SearchBlox

SearchBlox Developer Hub

Welcome to the SearchBlox developer hub. Here you will find comprehensive guides and documentation to help you start working with SearchBlox as quickly as possible, as well as support if you get stuck. Let's jump right in!

Documentation

HTTP Collection

SearchBlox includes a web crawler to index content from any intranet, portal or website. The crawler can also index HTTPS based content without any additional configuration and crawl through a proxy server or HTTP Basic Auth/Form authentication.

  • After logging in to the Admin Console, click on the Add Collection button
  • Enter a unique Collection name for the data source (for example, intranetsite)
  • Choose HTTP Collection as Collection Type
  • Choose the language of the content (if language is other than English)
  • Click Add to create the collection

Collection Paths

The HTTP collection Paths allow you to configure the Root URLs and the Allow/Disallow paths for the crawler. To access the paths for the HTTP collection, click on the collection name in the Collections list.

Root URLs
The root URL is the starting URL for the crawler. It requests this URL, indexes the content, and follows links from the URL. Make sure the root URL entered has regular HTML HREF links that the crawler can follow. In the paths sub-tab, enter at least one root URL for the HTTP Collection in the Root URLs.

Allow/Disallow Paths
Allow/Disallow paths ensure the crawler can include or exclude URLs. Allow and Disallow paths make it possible to manage a collection by excluding unwanted URLs.

Allow Paths

http://www.cnn.com/ (Informs the crawler to stay within the cnn.com site)
.* (Allows the crawler to go any external URL or domain)

Disallow Paths

.jsp
/cgi-bin/
/videos/
?params

Allowed Formats

Select the document formats that need to be searchable within the collection.

Keep the crawler within the required domain(s)

Please enter the Root URL domain name(s) (for example cnn.com or nytimes.com) within the Allow Paths to ensure the crawler can stay within the required domains. If .* is left as the value within the Allow Paths, the crawler will go to any external domain and index the web pages.

Collection Settings

The Settings page has configurable parameters for the crawler. SearchBlox provides default parameters when a new collection is created. Most crawler settings can be changed for your specific requirements.

Setting
Description

Keyword-in-Context Display

The keyword-in-context returns search results with the description displayed from content areas where the search term occurs.

HTML Parser Settings

This setting configures the HTML parser to read the description for a document from one of the HTML tags: H1, H2, H3, H4, H5, H6

Maximum Document Age

Specifies the maximum allowable age in days of a document in the collection.

Maximum Document Size

Specifies the maximum allowable size in kilobytes of a document in the collection.

Maximum Spider Depth

Specifies the maximum depth the spider is allowed to proceed to index documents.

Spider Delay

Specifies the wait time in milliseconds for the spider between HTTP requests to a web server.

User Agent

The name under which the spider requests documents from a web server.

Referrer

This is a URL value set in the request headers to specify where the user agent previously visited.

Ignore Robots

Value is set to Yes or No to tell the spider to obey robot rules or not. The default value is no.

Follow Sitemaps

Value is set to Yes or No to tell the spider whether sitemaps alone can be indexed, or if all of the URLs have to be indexed respectively. The default value is no.

Follow Redirects

Is set to Yes or No to instruct the spider to automatically follow redirects or not.
Boosting Boost search terms for the collection by setting a value greater than 1 (maximum value 9999).

Stemming

When stemming is enabled, inflected words are reduced to root form. For example, "running", "runs", and "ran" are the inflected form of run.

Remove Duplicates

When enabled prevents indexing duplicate documents

Boosting

Boost search terms for the collection by setting a value greater than 1 (maximum value 9999).

Spelling Suggestions

When enabled, a spell index is created at the end of the indexing process.

Logging

On enabling logging, the indexer activity would be available in detail in ../searchblox/logs/index.log
The details that occur in the index.log on enabling logging or debug logging mode are

  • List of links that are crawled.
  • Processing done on each URL along with timestamp on when the processing starts, the whether the indexing process is taking place or URL gets skipped and finally whether the URL gets indexed all these data would be available as separate entries in index.log.
  • timestamp of when the indexing completed would also be available along with time taken for indexing across the indexed URL entry in log file.
  • last modified date of the URL would be available.
  • If the URL is skipped or not indexed the reason for the same would be available.

HTTP Basic Authentication

When the spider requests a document, the spider presents these values (user/password) to the HTTP server in the Authorization MIME header. The attributes required for basic authentication are username, password.

Form Authentication

When access to documents is protected using form-based authentication, the spider can automatically log in and access the documents. The attributes required for form authentication are
Form URL, Form Action, Name/Value pairs as required

Proxy server Indexing

When HTTP content is accessed through proxy servers, the proxy server settings are required to enable the spider to successfully access and index content. The attributes required for proxy server indexing are
Proxy server URL, Username/password

Metatags customization

HTTP Collection crawler/parser can be controlled using the following HTML markup tags

HTML Tags
Description

Robots Meta Tag

These tags in the HTML page specify whether SearchBlox can or cannot index the page, and can or cannot spider the entire website. The different types of robots meta tags are as below

<meta name="robots" content="index, follow">

<meta name="robots" content="noindex, follow">

<meta name="robots" content="index, nofollow">

<meta name="robots" content="noindex, nofollow">

NoIndex/ StopIndex Tags

With HTTP collections, there is often a requirement to exclude content from sections of an HTML page from being indexed, such as headers, footers, and navigation. SearchBlox provides two ways to achieve this.

<noindex> Content to Exclude</noindex>
<!--stopindex-->Content to Exclude <!--startindex-->

Canonical Tag

Canonical is specified in the link tag in order to set the preferred URL in HTML pages that are copies or duplicates.
Eg:
A URL such as http://www.sample.com/new.html?uid=dxdm59652xhax can be specified in the HEAD part of the document the following:

<link rel="canonical" href="http://sample.com"/>

Index and Refresh Activity

HTTP Collection can be indexed, refreshed or cleared on-demand, on a schedule or through API requests.

Index

Starts the indexer for the selected collection. Starts indexing from the root URLs.

Clear

Clears the current index for the selected collection.

Refresh

Revisits URLs from the current index to make sure they are still valid, and then continues to index newly discovered URLs.

Scheduled Activity

For each collection, any of the following scheduled indexer activity can be set:
Index - Set the frequency and the start date/time for indexing a collection.
Refresh - Set the frequency and the start date/time for refreshing a collection.
Clear - Set the frequency and the start date/time for clearing a collection.

  • Indexing is controlled from the Index sub-tab for a collection or through API. The current status of a collection is always indicated on the Collection Dashboard and the Index page.
  • Refresh can be performed for a collection only after Indexing is completed.
  • Index and Refresh operations can also be initiated from Collection Dashboard.
  • Scheduling can be performed only from Index sub-tab.

Best Practices for Scheduling Index/Refresh of Collections

Do not give the same time schedule for all three operations (Index, Refresh, Clear). This would create conflict between activities.

If you have multiple collections, always schedule the activity to prevent more than 2-3 collections indexing or refreshing at the same time.

HTTP Collection