WEB Collection

SearchBlox includes a web crawler designed to index content from various sources, such as intranets, portals and websites. The crawler supports indexing HTTPS-based content without any additional configuration and crawl through a proxy server or HTTP Basic Authentication/Form Authentication.

Creating a Web Collection

To create a Web Collection in SearchBlox, follow these steps:

Log in to the Admin Console
Navigate to the Collections tab and click either "Create a New Collection" or the "+" icon
Select "WEB Collection" as the Collection Type
Configure collection details:

Enter a unique name for your collection (e.g., "intranet site")
Configure RAG settings (enable for ChatBot and Hybrid RAG search)
Choose between Private or Public Collection Access
Set Collection Encryption based on your security requirements
Select the content language (if other than English)

Click Save to create the collection

Once the WEB collection is created you will be redirected to the Paths tab.

WEB Collection Paths

The WEB collection Paths allow you to configure the Root URLs and the Allow/Disallow paths for the crawler. To access the paths for the WEB collection, click on the collection name in the Collections list.

Root URLs

The root URL is the starting URL for the crawler. It requests this URL, indexes the content, and follows links from the URL.
Ensure the root URL entered has regular HTML HREF links for the crawler to follow.
In the paths sub-tab, enter at least one root URL for the WEB Collection in the Root URLs field.

Allow/Disallow Paths

Allow/Disallow paths ensure the crawler can include or exclude URLs during the crawling process.
Allow and Disallow paths make it possible to manage a collection by filtering out unwanted or irrelevant URLs.
It is mandatory to give an allow path in the WEB collection to restrict indexing to the subdomain provided in the Root URLs.

Field	Description
Root URLs	The starting URL for the crawler. You need to provide at least one root URL.
Allow Paths	http://www.cnn.com/ (Informs the crawler to stay within the cnn.com site.) .* (Allows the crawler to go to any external URL or domain.)
Disallow Paths	.jsp /cgi-bin/ /videos/ ?params
Allowed Formats	Select the document formats that need to be searchable within the collection.

❗️
Important Note:

Enter the Root URL domain name(s) (for example cnn.com or nytimes.com) within the Allow Paths to ensure the crawler stays within the required domains.

If .* is left as the value within the Allow Paths, the crawler will go to any external domain and index the web pages.

WEB Collection Settings

The Settings page has configurable parameters for the crawler. SearchBlox provides default parameters when a new collection is created. Most crawler settings can be changed to meet your specific requirements.

📘
Generate Title, Description and Topics using SearchAI PrivateLLM and Enable Hybrid Search:

Choose and enable Generate Using LLM and Auto Relevance

Once you Enable Title/Description/Topics, we can choose Document Formats on which we ca generate it and also Enable Approval Option as shown.

Enable Approval, if some URL needs user approval to generate Title/Descriptions/Topics. Enable and provide URLs which needs approval of a user.

After indexing, user can see the Approval button for those URLs in Management Tab.

By clicking Compare Keyword Search with Hybrid link, will redirect to the Comparison Plugin

Settings Description
Title Generates concise and relevant titles for the indexed documents using LLM.
Description Generates the description for indexed documents using LLM.
Topic Generates relevant topics for indexed documents using LLM based on document's content.
Auto Relevance Enable/Disable Hybrid Search for automatic relevance ranking

Settings	Description
Title	Generates concise and relevant titles for the indexed documents using LLM.
Description	Generates the description for indexed documents using LLM.
Topic	Generates relevant topics for indexed documents using LLM based on document's content.
Auto Relevance	Enable/Disable Hybrid Search for automatic relevance ranking

📘
Process/Store Images Inside a Document Using SearchAI PrivateLLM:

Enable Images, by default it is disabled.

Choose the file formats where the images are located, currently SearchBlox supports WORD, PDF and PPT formats

Enable Store images to store the processed images inside opensearch, by default it is disabled and stores the images locally inside SearchBlox installation path.
NOTE: Indexing will be slower if it is enabled. So enable only if it is needed.

While indexing based on the image, LLM will generate title and description for the processed image.

Settings Description
Images Enable to Generate Image's title and descriptions using LLM while indexing, by default it is disabled
Allowed Formats Choose the file formats, currently SearchBlox supports WORD, PDF and PPT.
Store Images Enable to store the processed images inside opensearch, by default it is disabled and stores the images locally inside SearchBlox installation path.

Settings	Description
Images	Enable to Generate Image's title and descriptions using LLM while indexing, by default it is disabled
Allowed Formats	Choose the file formats, currently SearchBlox supports `WORD`, `PDF` and `PPT`.
Store Images	Enable to store the processed images inside `opensearch`, by default it is disabled and stores the images locally inside SearchBlox installation path.

Section	Setting	Description
Relevance	Auto Relevance	Enable/Disable Hybrid Search for automatic relevance ranking
Relevance	Remove Duplicates	When enabled, prevents indexing duplicate documents.
Relevance	Stemming	When stemming is enabled, inflected words are reduced to root form. For example, "running", "runs", and "ran" are the inflected form of run.
Relevance	Spelling Suggestions	The spelling suggestions are based on the words found within the search index. By enabling Spelling Suggestion in collection settings, spelling suggestions will appear for the search box in both regular and faceted search.
Keyword-in-Context Search Settings	Keyword-in-Context Display	The keyword-in-context returns search results with the description displayed from content areas where the search term occurs.
HTML Parser Settings	Description	This setting configures the HTML parser to read the description for a document from one of the HTML tags: H1, H2, H3, H4, H5, H6
HTML Parser Settings	Last-Modified Date	This setting determines from where the lastmodified date is fetched for the indexed document. By default it is from webpage header If the user wants it from Meta tags they can select Meta option If the user wants from custom date set in the webserver for SearchBlox then they can select this option
HTML Parser Settings	Special Metadata Parser	Enable the special metadata parser to retrieve content from content using specific CSS selectors. The default is NO. To know more about Special Metadata Parser, click here.
Scanner Settings	Maximum Document Age	Specifies the maximum allowable age in days of a document in the collection.
Scanner Settings	Maximum Document Size	Specifies the maximum allowable size in kilobytes of a document in the collection.
Scanner Settings	Maximum Spider Depth	Specifies the maximum depth the spider is allowed to proceed to index documents. Maximum value of Spider depth that can be given in SearchBlox is 15
Scanner Settings	Spider Delay	Specifies the wait time in milliseconds for the spider between HTTP requests to a web server.
Scanner Settings	User Agent	The name under which the spider requests documents from a web server.
Scanner Settings	Referrer	This is a URL value set in the request headers to specify where the user agent previously visited.
Scanner Settings	Ignore Robots	Value is set to Yes or No to tell the spider to obey robot rules or not. The default value is no.
Scanner Settings	Ignore Canonical	Value is set to Yes or No to tell the spider to ignore canonical urls specified in the page. The default value is yes.
Scanner Settings	Follow Sitemaps	Value is set to Yes or No to tell the spider whether sitemaps alone can be indexed, or if all of the URLs have to be indexed respectively. The default value is no.
Scanner Settings	Follow Redirects	Is set to Yes or No to instruct the spider to automatically follow redirects or not.
HTTP Basic Authentication	Basic Authentication credentials	When the spider requests a document, the spider presents these values (user/password) to the HTTP server in the Authorization MIME header. The attributes required for basic authentication are Username and Password.
Proxy server Indexing	Proxy server credentials	When HTTP content is accessed through proxy servers, the proxy server settings are required to enable the spider to successfully access and index content. The attributes required for proxy server indexing are: Proxy server URL, Username and Password.
Form Authentication	Form authentication fields	When access to documents is protected using form-based authentication, the spider can automatically log in and access the documents. The attributes required for form authentication are: Form URL, Form Action, Name and Value pairs as required.
Enable Detailed Log Settings	Enable Logging	Provides the indexer activity in detail in ..\webapps\ROOT\logs\index.log. The details that occur in the index.log when logging or debug logging mode are enabled are: List of links that are crawled. Processing done on each URL along with timestamp on when the processing starts, whether the indexing process is taking place or URL gets skipped, and whether the URL gets indexed. All data will be available as separate entries in index.log. Timestamp of when the indexing completed, and the time taken for indexing across the indexed URL entry in the log file. Last modified date of the URL. * If the URL is skipped or not, and why.
Enable Content API	Enable Content API	Provides the ability to crawl the content with special characters included.

Synonyms

Synonyms find relevant documents related to a search term, even if the search term is not present. For example, while searching for documents that use the term “global,” results with synonyms “world” and “international” would be listed in the search results.
We have an option to load Synonyms from the existing documents.

Schedule and Index

Sets the frequency and the start date/time for indexing a collection, from the root URLs. Schedule Frequency supported in SearchBlox is as follows:

Once
Hourly
Daily
Every 48 Hours
Every 96 Hours
Weekly
Monthly

The following operation can be performed in WEB collections

Activity	Description
Enable Scheduler for Indexing	Once enabled, you can set the Start Date and Frequency
Save	For each collection, indexing can be scheduled based on the above options.
View all Collection Schedules	Redirects to the Schedules section, where all the Collection Schedules are listed.

Manage Documents

Using Manage Documents tab we can do the following operations:
1. Add/Update
2. Filter
3. View content
4. View meta data
5. Refresh
6. Delete
To add a document click on "+" icon as shown in the screenshot.
Enter the document/URL, Click on add/update.
Once the document is updated you will be able to see the document URL on the screen and we will be able to perform the above mentioned operations.

Data Fields

Using Data Fields tab we can create custom fields for search and we can see the Default Data Fields with non-encrypted collection. SearchBlox supports 4 types of Data Fields as listed below:

Keyword
Number
Date
Text

📘
Note:
Once the Data fields are configured, collection must be cleared and re-indexed to take effect.

While creating the collection Data Fields tab is disabled. Once the collection gets created, we can see the collection specific Data Field tab while editing the Collection page.
When we enable Default Data Fields, we can see collection specific SearchBlox reserved fields. The reserved field name cannot be used as custom field name.

To Create custom Data Field, Click on Add Data Field button and provide the data field name from your document, select the type of data field values and save the field. You will see the list of added Data Fields in Data Fields screen.

Field	Description
Custom Field Name	The Custom Field Name should be unique, cannot contain spaces, supports alphabets and underscore and should be lowercase characters. Eg. author, topic_name
Data Type	Data type can be keyword, text, number and date depending on the data field values. _ Keyword Use keyword type for any alphanumeric values such as IDs, email addresses, hostnames, status codes, zip codes, or tags. This field is indexed, searchable. _ Number Use the number type to store and search on any number values from custom data fields like product pricing or product inventory. Use any number values, e.g. "1001" or "100.50". _ Date Use the date type for custom date values to be indexed, searchable and used as facet filters from custom date fields. Use strings containing formatted dates, e.g. "2015-01-01" or "2015/01/01 12:10:30". _ Text Use the text type for full-text search of custom fields like author of a page or custom metadata of a product. This field search is indexed and searchable and can be used for facet filters.

Field

Description

Custom Field Name

The Custom Field Name should be unique, cannot contain spaces, supports alphabets and underscore and should be lowercase characters. Eg. author, topic_name

Data Type

Data type can be keyword, text, number and date depending on the data field values.

_ Keyword
Use keyword type for any alphanumeric values such as IDs, email addresses, hostnames, status codes, zip codes, or tags. This field is indexed, searchable.

_ Number
Use the number type to store and search on any number values from custom data fields like product pricing or product inventory. Use any number values, e.g. "1001" or "100.50".

_ Date
Use the date type for custom date values to be indexed, searchable and used as facet filters from custom date fields. Use strings containing formatted dates, e.g. "2015-01-01" or "2015/01/01 12:10:30".

_ Text
Use the text type for full-text search of custom fields like author of a page or custom metadata of a product. This field search is indexed and searchable and can be used for facet filters.

Custom Data Fields can be shown in the Data Fields Tab as well as with the Default Data Fields list.

Custom fields will also be shown under Deafult Data Fields list as shown below:

Custom mapping file can be shown under the path: /webapps/ROOT/WEB-INF/mappings/collections

We can edit the mapping file if we need to add any specific Analyzers. Mapping fields are added at the end of the collection mapping file as shown below:

📘
Note:
SearchBlox version 10.0.1 and above includes ML fields (ml_topic, ml_sentimentLabel, ml_entity_org, ml_entity_product, ml_entity_person, ml_entity_loc, ml_entity_gpe) as default mappings for all type of collections.

❗️
Important Note:

Data Fields tab cannot be used with Encrypted type Collection.

We recommend not to edit the collection mapping files unlesss we need to add analyzers or fielddata configuration.

Prompts

AI related prompts for Indexing when LLM/RAG is enabled.

Prompts related to Title, Description and Topic, Image Description and SmartFAQs can be modified.
Prompts can be modified as shown below, by clicking on Restore Default prompts can be set to SearchBlox default prompts.

WEB Collection

Creating a Web Collection

WEB Collection Paths

Root URLs

Allow/Disallow Paths

❗️
Important Note:

WEB Collection Settings

📘
Generate Title, Description and Topics using SearchAI PrivateLLM and Enable Hybrid Search:

📘
Process/Store Images Inside a Document Using SearchAI PrivateLLM:

Synonyms

Schedule and Index

Manage Documents

Data Fields

📘
Note:

📘
Note:

❗️
Important Note:

Prompts

Creating a Web Collection

WEB Collection Paths

Root URLs

Allow/Disallow Paths

❗️Important Note:

WEB Collection Settings

📘Generate Title, Description and Topics using SearchAI PrivateLLM and Enable Hybrid Search:

📘Process/Store Images Inside a Document Using SearchAI PrivateLLM:

Synonyms

Schedule and Index

Manage Documents

Data Fields

📘Note:

📘Note:

❗️Important Note:

Prompts

❗️
Important Note:

📘
Generate Title, Description and Topics using SearchAI PrivateLLM and Enable Hybrid Search:

📘
Process/Store Images Inside a Document Using SearchAI PrivateLLM:

📘
Note:

📘
Note:

❗️
Important Note: