SearchBlox

SearchBlox Developer Hub

Welcome to the SearchBlox developer hub. Here you will find comprehensive guides and documentation to help you start working with SearchBlox as quickly as possible, as well as support if you get stuck. Let's jump right in!

Guides

RSS Collection

SearchBlox includes a built-in spider/crawler that can index RSS and Atom feeds. RSS and Atom feeds are essentially XML files that provide information about the recently changed content on a website or blog. A feed search collection can be created by following the steps below.

  • After logging in to the Admin Console, click on the Add Collection button. The Add Collection screen will be displayed.
  • Enter a unique name for the collection (for example, News).
  • Click on the RSS Collection radio button.
  • Choose the language of the web pages that need indexing.
  • Click Add to create the collection.

Settings

To access the paths settings for the RSS collection, click on the collection name in the collections list.

Path Settings

Feed URLs
The feed URL is the URL of the RSS/Atom feed page. The content of this URL is an XML file containing information about a list of URLs on the website. The SearchBlox feed crawler indexes URLs using information in the RSS/Atom file. However, it does not follow links on the URLs contained in the XML file. Enter at least one feed URL for the collection (For example, http://rss.cnn.com/rss/cnn_topstories.rss.) and save the settings.

Collection Filters
Filters allow you to configure the spider to include or exclude indexing documents. Allow and Disallow filters make it possible to manage a collection by excluding unwanted documents.

Allow Paths

http://www.searchblox.com/.*
(Informs the spider to stay only within the searchblox.com site.)
.*
(Lets the spider go anywhere it wants, potentially indexing any site linked from the root URL)

Disallow Paths

.jsp
/cgi-bin/.
/internal/.

Allow Formats

Select which formats are eligible to be part of your collection.

Collection Settings

The Settings sub-tab holds tunable parameters for the spider. SearchBlox comes pre-configured with parameters when a new collection is created. The settings that can be configured from SearchBlox are listed below.

Setting
Description

Keyword-in-Context Display

The keyword-in-context returns search results with the description displayed from content areas where the search term occurs.

Maximum Document Age

Specifies the maximum allowable age in days of a document in the collection.

Maximum Document Size

Specifies the maximum allowable size in kilobytes of a document in the collection.

Boosting

Boost search terms for the collection by setting a value greater than 1 (maximum value 9999).

Remove Duplicates

When enabled, prevents indexing duplicate documents.

Stemming

When stemming is enabled, inflected words are reduced to root form. For example, "running", "runs", and "ran" are the inflected form of run.

Spelling Suggestions

When enabled, a spelling index is created at the end of the indexing process.

Logging

When logging is enabled, the indexer activity will be available in detail here: ../searchblox/logs/index.log.

HTTP Basic Authentication

When the spider requests a document, the spider presents these values (user/password) to the HTTP server in the Authorization MIME header. The attributes required for basic authentication are username, password.

Form Authentication

When access to documents is protected using form-based authentication, the spider can automatically log in and access the documents. The attributes required for form authentication are:
Form URL, Form Action, and Name/Value pairs.

Proxy Server Indexing

When HTTP content is accessed through proxy servers, the proxy server settings are required to enable the spider to successfully access and index content. The attributes required for proxy server indexing are:
Proxy Server URL, Username/Password.

Indexing and Other Operations

The following operations can be performed in RSS collections.

Index

Starts the indexer for the selected collection. Starts indexing from the feed URLs.

Clear

Clears the current index for the selected collection.

Refresh

Revisits URLs from the current index to make sure they are still valid, and then continues to index newly discovered URLs.

Scheduled Activity

For each collection, any of the following scheduled indexer activity can be set:
Index — Set the frequency and the start date/time for indexing a collection.
Refresh — Set the frequency and the start date/time for refreshing a collection.
Clear — Set the frequency and the start date/time for clearing a collection.

  • Indexer activity is controlled from the Index sub-tab in the collection. The current status of an indexer for a particular collection is indicated.
  • Once indexing is completed refresh can be performed
  • Index and refresh operations can also be performed from the collection dashboard.
  • Scheduling can be performed only from indexer sub-tabs.

Best Practices for Scheduling Operations

Do not schedule the same time for all three operations (Index, Refresh, Clear). This will create conflict between activities.

If you have multiple collections, always schedule the activity to prevent more than 2-3 collections indexing or refreshing at the same time.

Always Clear at least five minutes before indexing.