SearchBlox Developer Hub

Welcome to the SearchBlox developer hub. Here you will find comprehensive guides and documentation to help you start working with SearchBlox as quickly as possible, as well as support if you get stuck. Let's jump right in!


HTTP Collection

SearchBlox includes a web crawler to index content from any intranet, portal or website. The crawler can also index HTTPS-based content without any additional configuration, and crawl through a proxy server or HTTP Basic Auth/Form authentication.

  • After logging in to the Admin Console, click on the Add Collection button.
  • Enter a unique Collection name for the data source (for example, intranetsite).
  • Choose HTTP Collection as Collection Type.
  • Choose the language of the content (if language is other than English).
  • Click Add to create the collection.

Collection Paths

The HTTP collection Paths allow you to configure the Root URLs and the Allow/Disallow paths for the crawler. To access the paths for the HTTP collection, click on the collection name in the Collections list.

Root URLs
The root URL is the starting URL for the crawler. It requests this URL, indexes the content, and follows links from the URL. Make sure the root URL entered has regular HTML HREF links that the crawler can follow. In the paths sub-tab, enter at least one root URL for the HTTP Collection in the Root URLs.

Allow/Disallow Paths
Allow/Disallow paths ensure the crawler can include or exclude URLs. Allow and Disallow paths make it possible to manage a collection by excluding unwanted URLs.

Allow Paths (Informs the crawler to stay within the site.)
.* (Allows the crawler to go any external URL or domain.)

Disallow Paths


Allowed Formats

Select the document formats that need to be searchable within the collection.

Keep the crawler within the required domain(s)

Enter the Root URL domain name(s) (for example or within the Allow Paths to ensure the crawler stays within the required domains. If .* is left as the value within the Allow Paths, the crawler will go to any external domain and index the web pages.

Collection Settings

The Settings page has configurable parameters for the crawler. SearchBlox provides default parameters when a new collection is created. Most crawler settings can be changed for your specific requirements.


Keyword-in-Context Display

The keyword-in-context returns search results with the description displayed from content areas where the search term occurs.

HTML Parser Settings

This setting configures the HTML parser to read the description for a document from one of the HTML tags: H1, H2, H3, H4, H5, H6

Maximum Document Age

Specifies the maximum allowable age in days of a document in the collection.

Maximum Document Size

Specifies the maximum allowable size in kilobytes of a document in the collection.

Maximum Spider Depth

Specifies the maximum depth the spider is allowed to proceed to index documents. Maximum value of Spider depth that can be given in SearchBlox is 15

Spider Delay

Specifies the wait time in milliseconds for the spider between HTTP requests to a web server.

User Agent

The name under which the spider requests documents from a web server.


This is a URL value set in the request headers to specify where the user agent previously visited.

Ignore Robots

Value is set to Yes or No to tell the spider to obey robot rules or not. The default value is no.

Follow Sitemaps

Value is set to Yes or No to tell the spider whether sitemaps alone can be indexed, or if all of the URLs have to be indexed respectively. The default value is no.

Follow Redirects

Is set to Yes or No to instruct the spider to automatically follow redirects or not.


When stemming is enabled, inflected words are reduced to root form. For example, "running", "runs", and "ran" are the inflected form of run.

Remove Duplicates

When enabled, prevents indexing duplicate documents.


Boost search terms for the collection by setting a value greater than 1 (maximum value 9999).

Spelling Suggestions

When enabled, a spelling index is created at the end of the indexing process.


Provides the indexer activity in detail in ../searchblox/logs/index.log.
The details that occur in the index.log when logging or debug logging mode are enabled are:

  • List of links that are crawled.
  • Processing done on each URL along with timestamp on when the processing starts, whether the indexing process is taking place or URL gets skipped, and whether the URL gets indexed. All data will be available as separate entries in index.log.
  • Timestamp of when the indexing completed, and the time taken for indexing across the indexed URL entry in the log file.
  • Last modified date of the URL.
  • If the URL is skipped or not, and why.

HTTP Basic Authentication

When the spider requests a document, the spider presents these values (user/password) to the HTTP server in the Authorization MIME header. The attributes required for basic authentication are username, password.

Form Authentication

When access to documents is protected using form-based authentication, the spider can automatically log in and access the documents. The attributes required for form authentication are:
Form URL, Form Action, Name/Value pairs as required.

Proxy server Indexing

When HTTP content is accessed through proxy servers, the proxy server settings are required to enable the spider to successfully access and index content. The attributes required for proxy server indexing are:
Proxy server URL, Username/password.

Metatags Customization

HTTP Collection crawler/parser can be controlled using the following HTML markup tags.


Robots Meta Tag

These tags in the HTML page specify whether SearchBlox can or cannot index the page, and can or cannot spider the entire website. The different types of robots meta tags are as shown below:

<meta name="robots" content="index, follow">

<meta name="robots" content="noindex, follow">

<meta name="robots" content="index, nofollow">

<meta name="robots" content="noindex, nofollow">

NoIndex/ StopIndex Tags

With HTTP collections, there is often a requirement to exclude content from sections of an HTML page from being indexed, such as headers, footers, and navigation. SearchBlox provides two ways to achieve this.

<noindex> Content to Exclude</noindex>
<!--stopindex-->Content to Exclude <!--startindex-->

Canonical Tag

Canonical is specified in the link tag in order to set the preferred URL in HTML pages that are copies or duplicates.

For example:
A URL such as can be specified in the HEAD part of the document:

<link rel="canonical" href=""/>

Index and Refresh Activity

An HTTP Collection can be indexed, refreshed or cleared on-demand, on a schedule, or through API requests.


Starts the indexer for the selected collection. Starts indexing from the root URLs.


Clears the current index for the selected collection.


Revisits URLs from the current index to make sure they are still valid, and then continues to index newly discovered URLs.

Scheduled Activity

For each collection, any of the following scheduled indexer activity can be set:
Index - Set the frequency and the start date/time for indexing a collection.
Refresh - Set the frequency and the start date/time for refreshing a collection.
Clear - Set the frequency and the start date/time for clearing a collection.

  • Indexing is controlled from the Index sub-tab for a collection or through API. The current status of a collection is always indicated on the Collection Dashboard and the Index page.
  • Refresh can be performed for a collection only after Indexing is completed.
  • Index and Refresh operations can also be initiated from Collection Dashboard.
  • Scheduling can be performed only from the Index sub-tab.

Best Practices for Scheduling Index/Refresh of Collections

Do not schedule the same time for all three operations (Index, Refresh, Clear). This will create conflict between activities.

If you have multiple collections, always schedule the activity to prevent more than 2-3 collections indexing or refreshing at the same time.

Using Canonical URLs

You can use canonical URLs to avoid duplicates, improve link and ranking signals for content available through multiple URL structures. It is quite possible that same content could be accessed through multiple URLs, it is also possible that content can be distributed to different URLs and domains entirely.

For Example

  1. Your site has multiple URLs as you position the same page under multiple sections.
  1. Your serve the same content for the www subdomain or the http/https protocol.
  1. Content is replicated partially or fully in different pages or sites in the domain

This replication of pages could result in few challenges for those who search for them in SearchBlox resulting in repetitive content in results pointing to different pages. To overcome this challenge you could define Canonical URL for pages that have similar or equivalent content in multiple URLs/pages.

  1. Set the site which is most preferred as the canonical URL
  2. Indicate the preferred URL with the rel="canonical" link element
<link rel="canonical" href="" />

The above indicates that the preferred URL to access the tea post.

On indexing the URL, SearchBlox will index only the URL that has been specified as canonical.

To avoid error please use absolute path instead of relative path in canonical URLs.
Please note that while we encourage you to use any of these methods, none of them are mandatory. If you don't indicate a canonical URL, Searchblox will identify the current URL as the best version or URL.


Remove Duplicates

When Remove Duplicates is enabled, pages with 100% same content i.e., duplicates will not be indexed. If there is any difference in the content then the files/urls will not be considered as duplicates.

HTTP Collection