SearchBlox

SearchBlox Developer Documentation

Welcome to the SearchBlox developer documentation. Here you will find comprehensive technical documentation to help you start working with SearchBlox as quickly as possible, as well as support if you get stuck. Let's jump right in!

Guides

Dynamic Content Collection

SearchBlox can index dynamically generated web content using JavaScript or applications (Single Page Applications SPAs).

This collection will be slower in the speed of indexing due to the dynamic rendering of the pages when compared to the speed of the HTTP collection. Please use this collection only for Javascript generated content.

The Chrome browser is required for using this collection.

Prerequisites

  • Chrome browser has to be installed on the system to use Dynamic Content Collection.
  • Chrome browser version 83 or higher is required.

Windows:

Linux:

Install Chrome by running the following commands:

wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
sudo yum install ./google-chrome-stable_current_*.rpm

Creating Dynamic Content Collection

  • After logging in to the Admin Console, click Add Collection button.
  • Enter a unique Collection name for the data source (for example, dynamic).
  • Choose Dynamic Content Collection as Collection Type.
  • Choose the language of the content (if the language is other than English).
  • Click Add to create the collection.

Dynamic Content Collection Paths

The HTTP collection Paths allow you to configure the Root URLs and the Allow/Disallow paths for the crawler. To access the paths for the HTTP collection, click on the collection name in the Collections list.

Root URLs

  • The root URL is the starting URL for the crawler. It requests this URL, indexes the content, and follows links from the URL.
  • The root URL entered should have regular HTML HREF links that the crawler can follow.
  • In the paths sub-tab, enter at least one root URL for the HTTP Collection in the Root URLs.

Allow/Disallow Paths

  • Allow/Disallow paths ensure the crawler can include or exclude URLs.
  • Allow and Disallow paths make it possible to manage a collection by excluding unwanted URLs.
  • It is mandatory to give an allow path in HTTP collection to limit the indexing within the subdomain provided in Root URLs.

Field

Description

Root URLs

The starting URL for the crawler. You need to provide at least one root URL.

Allow Paths

http://www.cnn.com/ (Informs the crawler to stay within the cnn.com site.)
.* (Allows the crawler to go any external URL or domain.)

Disallow Paths

.jsp
/cgi-bin/
/videos/
?params

Allowed Formats

Select the document formats that need to be searchable within the collection.

❗️

Keep the crawler within the required domain(s)

Enter the Root URL domain name(s) (for example cnn.com or nytimes.com) within the Allow Paths to ensure the crawler stays within the required domains. If .* is left as the value within the Allow Paths, the crawler will go to any external domain and index the web pages.

Dynamic Content Collection Settings

  • Only one setting is available for this collection. Option to upload the HAR file to index the dynamically generated content.
  • HAR is an HTTP archive file that can be downloaded from a dynamically generated website. This file has to be copied into WEB-INF har folder and the SearchBlox instance has to be restarted.
  • After the restart, the HAR file can be selected from Dynamic Collection settings for indexing.

Section

Setting

Description

Keyword-in-Context Search Settings

Keyword-in-Context Display

The keyword-in-context returns search results with the description displayed from content areas where the search term occurs.

HAR files

HAR files

This HAR file that is required to fetch the URLs of the page must be selected here.
Please note that it is required to copy the downloaded HAR file into ../WEB-INF/har folder

Index Activity

A Dynamic Content Collection can be indexed or cleared on-demand, on a schedule, or through API requests.

Index

Starts the indexer for the selected collection. Starts indexing from the root URLs.

Clear

Clears the current index for the selected collection.

Scheduled Activity

For each collection, any of the following scheduled indexer activity can be set:
Index - Set the frequency and the start date/time for indexing a collection.
Clear - Set the frequency and the start date/time for clearing a collection.

  • Indexing operation starts the indexer for the selected collection from the root URLs.
  • On reindexing that is, clicking on index again after the initial index operation, all crawled documents will be reindexed. If documents have been deleted from the source website or directory since the first index operation, they will be deleted from the index. New documents will also be indexed.
  • Also, indexing is controlled from the Index sub-tab for a collection.
  • The current status of a collection is always indicated on the Collection Dashboard and the Index page.
  • Index operation can also be initiated from the Collection Dashboard.
  • Scheduling can be performed only from the Index sub-tab.

👍

Best Practices

  • Index only dynamic content in this collection and use HTTP Collection for static web content
  • Most of the settings for HTTP Collection will not be available for Dynamic Content Collection.
  • Do not schedule the same time for two collection operations (Index, Clear).
  • If you have multiple collections, always schedule the activity to index no more than 3 collections at the same time.

Updated about a month ago


What's Next

Searching

Dynamic Content Collection


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.