Dynamic Auto Collection

SearchBlox provides a web crawler that can index content from any intranet, portal, or website. It can also index HTTPS content without extra setup and supports crawling through a proxy server or Form Authentication.

The Dynamic Auto Collection feature improves secure content indexing by:

  • Allowing authentication and crawling of content protected by web forms
  • Treating each page as an individual document during indexing
  • Indexing pages only after they are fully rendered

It also supports Single Page Applications (SPAs) and pages that load content dynamically using JavaScript.

🚧

For LINUX:

Please make sure to run the following command to install the browser dependencies, which is required for Dynamic Auto Collection without form URL:

sudo apt-get install libxkbcommon0 libgbm1

Creating a Dynamic Auto Collection

To create a Dynamic Auto Collection, follow these steps:

  1. Log in to the Admin Console

  2. Go to the Collections tab and click Create a New Collection or the + icon

  3. Choose Dynamic Auto Collection as the Collection Type

  4. Enter a unique name for your collection (for example: "DynamicAuto")

  5. Set RAG options:

    • Enable if you want to use Hybrid RAG search
    • Disable if you don’t need RAG features
  6. Configure access settings:

    • Select Private or Public access
    • Turn on encryption if needed for extra security
  7. Choose the content language (if it’s not English)

  8. Click Save to create the collection

  • Once the Dynamic Auto Collection is created, you will be automatically redirected to the Path tab.

Dynamic Auto Collection Paths

The Dynamic Auto Collection Paths let you set the Root URLs and the Allow/Disallow paths that guide how the crawler moves through your site. These settings help you control what areas should be included or skipped during indexing. To view or update these path settings, just click on the collection name in the Collections list.

Root URLs

  • The root URL is the main starting point for the crawler. It opens this URL first, indexes its content, and then follows the links found on that page.
  • Only one root URL should be entered. The crawler will then move through all the pages linked from that URL.
  • Make sure the root URL contains proper HTML HREF links, so the crawler can easily follow and index connected pages.

Allow/Disallow Paths

  • Allow/Disallow paths help control which URLs the crawler should include or skip during indexing.
  • These paths let you manage your collection by blocking unwanted URLs and allowing only the needed ones.
  • In a Dynamic Auto collection, giving an allow path is required to ensure the crawler indexes only the pages within the subdomain of the root URL.
FieldDescription
Root URLsThe main starting URL for the crawler. At least one root URL is required to begin crawling.
Allow PathsTells the crawler where it can go. Example: http://www.cnn.com/ keeps crawling inside cnn.com.
.* allows crawling on any external site.
Disallow PathsTells the crawler which URLs to avoid.
.jsp
/cgi-bin/
/videos/
?params
Allowed FormatsSelect which document types or file formats should be indexed and made searchable.

❗️

Important Note:

  • Enter the Root URL domain name (for example cnn.com or nytimes.com) within the Allow Paths to ensure the crawler stays within the required domains.
  • If .* is left as the value within the Allow Paths, the crawler will go to any external domain and index the web pages.

Dynamic Auto Collection Settings

The Dynamic Auto Collection Settings page allows you to configure how the SearchBlox crawler behaves. When a new collection is created, default settings are automatically applied, but you can modify them anytime based on your needs. These settings help you control what the crawler collects, how deep it crawls, and how it processes different types of content, giving you full flexibility in managing the indexing process.


📘

Generate Title, Description and Topics using SearchAI PrivateLLM and Enable Hybrid Search:

  • Choose and enable Generate Using LLM and Auto Relevance
  • By clicking Compare Keyword Search with Hybrid link, will redirect to the Comparison Plugin
SettingsDescription
TitleGenerates concise and relevant titles for the indexed documents using LLM.
DescriptionGenerates the description for indexed documents using LLM.
TopicGenerates relevant topics for indexed documents using LLM based on document's content.
Auto RelevanceEnable/Disable Hybrid Search for automatic relevance ranking
SectionSettingDescription
RelevanceAuto RelevanceTurns Hybrid Search enable or disable to automatically rank search results based on relevance.
RelevanceRemove DuplicatesWhen turned enable, the system avoids indexing documents that have the same content.
RelevanceStemmingReduces words to their base form. Example: “run,” “running,” “ran,” and “runs” are treated as the same word.
RelevanceSpelling SuggestionsShows spelling corrections based on indexed words. When enabled, users get suggestions in both normal and faceted search.
Keyword-in-Context Search SettingsKeyword-in-Context DisplayThe keyword-in-context feature shows search results along with a snippet of text from the page where the search term appears..
HTML Parser SettingsLast-Modified Date- This setting decides where the last modified date of a document is taken from.
- By default, it uses the webpage header.
- To use a Meta tag, select the Meta option.
- To use a custom date set on the webserver, select the Custom option.
Scanner SettingsSpider Delay- This setting defines how long the crawler waits (in milliseconds) between HTTP request to a web server.
Scanner SettingsIgnore RobotsChoose Yes or No to tell the crawler whether to follow robots.txt rules. Default is No (follows robots.txt).
Scanner SettingsFollow SitemapsChoose Yes or No to tell the crawler whether to index only sitemap URLs or all URLs on the site. Default is No.
Form AuthenticationForm authentication fieldsWhen documents are protected with form-based authentication, the crawler can automatically log in and access the content. The required attributes for form authentication are:

- Form URL: The root or landing page URL after login. The crawler starts crawling from this page.
- Username reference: ID or XPATH of the username field (found using Inspect Element).
- Username: Enter your username.
- Password reference: ID or XPATH of the password field (found using Inspect Element).
- Password: Enter your password.
- Submit: Type submit (do not change).
- Submit reference: ID or XPATH of the submit button (found using Inspect Element).
- Name 1: Optional field for actions like wait or iframe.
- Value 1: Optional reference ID for wait/iframe element if required.
HTML Proxy server SettingsProxy server credentialsRequired when accessing HTTP content through a proxy. You need to provide the Proxy server URL, Username, and Password so the crawler can access and index the content.
Enable Detailed Log SettingsEnable LoggingProvides detailed indexer activity in ..\webapps\ROOT\logs\index.log. When logging or debug mode is on, it records:

- List of links crawled
- Processing details for each URL with timestamps, whether it was indexed or skipped
- Indexing completion time and duration per URL
- Last modified date of the URL
- If a URL was skipped and the reason.

🚧

Important for Linux users !

  • Before indexing the created collection, navigate to /opt/searchblox/webapps/ROOT/WEB-INF/authconfig.yml and change the headless value to true, as shown below:
    headless: true.
  • Restarting SearchBlox service is required to apply the above changes.

📘

NOTE:

  • To add the additional time-outs, please update the below parameter in /webapps/ROOT/WEB-INF/authconfig.yml.

    submit-delay:30, where 30 will be in seconds.

  • Restarting SearchBlox service is required to apply the above changes.

Synonyms

Synonyms allow the search engine to find documents related to a search term, even if the exact term does not appear in the content. For example, if a user searches for “global,” the search results can also include documents containing synonyms like “world” or “international.”

Additionally, there is an option to load synonyms directly from existing documents to enhance the search results.

Schedule and Index

This setting allows you to define when a collection should be indexed and how often the indexing should occur, starting from the root URLs. You can set the specific start date and time for the indexing process. SearchBlox supports different schedule frequencies, giving you control over how regularly the collection is updated.

  • Once
  • Hourly
  • Daily
  • Every 48 Hours
  • Every 96 Hours
  • Weekly
  • Monthly

The following operation can be performed in Dynamic Auto collections

ActivityDescription
Enable Scheduler for IndexingTurn this on to set the start date and frequency for indexing the collection.
SaveSave the schedule settings so the collection will follow the defined indexing plan.
View all Collection SchedulesOpens the Schedules section to see all scheduled indexing tasks for collections.

Manage Documents

  • The Manage Documents tab allows you to:

    • Add/Update documents
    • Filter documents
    • View content
    • View meta data
    • Refresh documents
    • Delete documents
  • To add a document, click the “+” icon.

  • Enter the document or URL, then click Add/Update.

  • After updating, the document URL will appear on the screen, and you can perform all the above operations.

Data Fields

  • The Data Fields tab lets you create custom fields for searching. You can also see the default data fields for non-encrypted collections.

  • SearchBlox supports 4 types of data fields:**

    1. Keyword – for alphanumeric values like IDs, tags, or codes.
    2. Number – for numeric values like prices or quantities.
    3. Date – for date values to use in search and filters.
    4. Text – for full-text search in custom fields.

📘

Note:

Once the Data fields are configured, collection must be cleared and re-indexed to take effect.

To know more about Data Fields please refer to Data Fields Tab


Models

Embedding

  • Provider specifies the embedding provider used to generate vector representations of documents.
  • Model defines the embedding model used to convert document content into vectors for semantic search.

Reranker

  • Provider specifies the reranker provider used for improving search result relevance.
  • Model defines the reranker model used to re-score and reorder search results based on relevance.

LLM

  • Provider specifies the Large Language Model provider used for AI-powered features.

  • Model defines the LLM used for tasks such as document enrichment, summaries, and SmartFAQs.

  • These settings override global configurations and apply only to the current collection.

🚧

NOTE:

If you face any trouble in indexing Dynamic Auto Collection and in logs if you find the following error:

ERROR <xx xxx xxxx 10:56:01,368> <status> <Exception caught on initialize browser: >
com.microsoft.playwright.PlaywrightException: Error { message='
Host system is missing dependencies to run browsers.

Run the following command:

sudo apt-get install libxkbcommon0 libgbm1