SearchBlox

SearchBlox Developer Hub

Welcome to the SearchBlox developer hub. Here you will find comprehensive guides and documentation to help you start working with SearchBlox as quickly as possible, as well as support if you get stuck. Let's jump right in!

Guides

HTTP Collection

SearchBlox includes a web crawler to index content from any intranet, portal or website. The crawler can also index HTTPS-based content without any additional configuration, and crawl through a proxy server or HTTP Basic Auth/Form authentication.

  • After logging in to the Admin Console, click on the Add Collection button.
  • Enter a unique Collection name for the data source (for example, intranetsite).
  • Choose HTTP Collection as Collection Type.
  • Choose the language of the content (if the language is other than English).
  • Click Add to create the collection.

Collection Paths

The HTTP collection Paths allow you to configure the Root URLs and the Allow/Disallow paths for the crawler. To access the paths for the HTTP collection, click on the collection name in the Collections list.

Root URLs
The root URL is the starting URL for the crawler. It requests this URL, indexes the content, and follows links from the URL. Make sure the root URL entered has regular HTML HREF links that the crawler can follow. In the paths sub-tab, enter at least one root URL for the HTTP Collection in the Root URLs.

Allow/Disallow Paths
Allow/Disallow paths ensure the crawler can include or exclude URLs. Allow and Disallow paths make it possible to manage a collection by excluding unwanted URLs.

Allow Paths

http://www.cnn.com/ (Informs the crawler to stay within the cnn.com site.)
.* (Allows the crawler to go any external URL or domain.)

Disallow Paths

.jsp
/cgi-bin/
/videos/
?params

Allowed Formats

Select the document formats that need to be searchable within the collection.

Keep the crawler within the required domain(s)

Enter the Root URL domain name(s) (for example cnn.com or nytimes.com) within the Allow Paths to ensure the crawler stays within the required domains. If .* is left as the value within the Allow Paths, the crawler will go to any external domain and index the web pages.

Collection Settings

The Settings page has configurable parameters for the crawler. SearchBlox provides default parameters when a new collection is created. Most crawler settings can be changed for your specific requirements.

Section
Setting
Description

Keyword-in-Context Search Settings

Keyword-in-Context Display

The keyword-in-context returns search results with the description displayed from content areas where the search term occurs.

HTML Parser Settings

Description

This setting configures the HTML parser to read the description for a document from one of the HTML tags: H1, H2, H3, H4, H5, H6

Scanner Settings

Maximum Document Age

Specifies the maximum allowable age in days of a document in the collection.

Scanner Settings

Maximum Document Size

Specifies the maximum allowable size in kilobytes of a document in the collection.

Scanner Settings

Maximum Spider Depth

Specifies the maximum depth the spider is allowed to proceed to index documents. Maximum value of Spider depth that can be given in SearchBlox is 15

Scanner Settings

Spider Delay

Specifies the wait time in milliseconds for the spider between HTTP requests to a web server.

Scanner Settings

User Agent

The name under which the spider requests documents from a web server.

Scanner Settings

Referrer

This is a URL value set in the request headers to specify where the user agent previously visited.

Scanner Settings

Ignore Robots

Value is set to Yes or No to tell the spider to obey robot rules or not. The default value is no.

Scanner Settings

Ignore Canonical

Value is set to Yes or No to tell the spider to ignore canonical urls specified in the page. The default value is yes.

Scanner Settings

Follow Sitemaps

Value is set to Yes or No to tell the spider whether sitemaps alone can be indexed, or if all of the URLs have to be indexed respectively. The default value is no.

Scanner Settings

Follow Redirects

Is set to Yes or No to instruct the spider to automatically follow redirects or not.

Relevance

Boosting

Boost search terms for the collection by setting a value greater than 1 (maximum value 9999).

Relevance

Remove Duplicates

When enabled, prevents indexing duplicate documents.

Relevance

Stemming

When stemming is enabled, inflected words are reduced to root form. For example, "running", "runs", and "ran" are the inflected form of run.

Relevance

Spelling Suggestions

When enabled, a spelling index is created at the end of the indexing process.

Relevance

Enable Logging

Provides the indexer activity in detail in ../searchblox/logs/index.log.
The details that occur in the index.log when logging or debug logging mode are enabled are:

  • List of links that are crawled.
  • Processing done on each URL along with timestamp on when the processing starts, whether the indexing process is taking place or URL gets skipped, and whether the URL gets indexed. All data will be available as separate entries in index.log.
  • Timestamp of when the indexing completed, and the time taken for indexing across the indexed URL entry in the log file.
  • Last modified date of the URL.
  • If the URL is skipped or not, and why.

HTTP Basic Authentication

Basic Authentication credentials

When the spider requests a document, the spider presents these values (user/password) to the HTTP server in the Authorization MIME header. The attributes required for basic authentication are username, password.

Form Authentication

Form authentication fields

When access to documents is protected using form-based authentication, the spider can automatically log in and access the documents. The attributes required for form authentication are:
Form URL, Form Action, Name/Value pairs as required.

Proxy server Indexing

Proxy server credentials

When HTTP content is accessed through proxy servers, the proxy server settings are required to enable the spider to successfully access and index content. The attributes required for proxy server indexing are:
Proxy server URL, Username/password.

Metatags Customization

HTTP Collection crawler/parser can be controlled using the following HTML markup tags.

HTML Tags
Description

Robots Meta Tag

These tags in the HTML page specify whether SearchBlox can or cannot index the page, and can or cannot spider the entire website. The different types of robots meta tags are as shown below:

<meta name="robots" content="index, follow">

<meta name="robots" content="noindex, follow">

<meta name="robots" content="index, nofollow">

<meta name="robots" content="noindex, nofollow">

NoIndex/ StopIndex Tags

With HTTP collections, there is often a requirement to exclude content from sections of an HTML page from being indexed, such as headers, footers, and navigation. SearchBlox provides two ways to achieve this.

<noindex> Content to Exclude</noindex>
<!--stopindex-->Content to Exclude <!--startindex-->

Canonical Tag

Canonical is specified in the link tag in order to set the preferred URL in HTML pages that are copies or duplicates.

For example:
A URL such as http://www.sample.com/new.html?uid=dxdm59652xhax can be specified in the HEAD part of the document:

<link rel="canonical" href="http://sample.com"/>
Note: This canonical tag would be ignored by default in the collection which can be enabled by setting the ignore canonical value as No in collection settings

Index and Refresh Activity

An HTTP Collection can be indexed, refreshed or cleared on-demand, on a schedule, or through API requests.

Index

Starts the indexer for the selected collection. Starts indexing from the root URLs.

Clear

Clears the current index for the selected collection.

Refresh

Revisits URLs from the current index to make sure they are still valid, and then continues to index newly discovered URLs and deletes the URLs that are not valid.

Scheduled Activity

For each collection, any of the following scheduled indexer activity can be set:
Index - Set the frequency and the start date/time for indexing a collection.
Refresh - Set the frequency and the start date/time for refreshing a collection.
Clear - Set the frequency and the start date/time for clearing a collection.

  • Indexing operation starts the indexer for the selected collection. Starts indexing from the root URLs. On reindexing ( clicking on index again after the initial index operation), all crawled document will be reindexed. However, if documents have been deleted from the source website or directory since the first index operation, they will not be deleted from the index. Also, indexing is controlled from the Index sub-tab for a collection or through API. The current status of a collection is always indicated on the Collection Dashboard and the Index page.
    • Refresh operation revisits URLs from the current index to make sure they are still valid, and then continues to index newly discovered URLs, modified URLs and deletes the URLs that are not valid. Unmodified URLs (URLs whose last modified date has not changed since the last indexing run) are not reindexed. Refresh can be performed for a collection after Indexing is completed or can be started directly from collections dashboard or via API.
    • Index and Refresh operations can also be initiated from Collection Dashboard.
    • Scheduling can be performed only from the Index sub-tab.

Best Practices for Scheduling Index/Refresh of Collections

Do not schedule the same time for all three operations (Index, Refresh, Clear). This will create conflict between activities.

If you have multiple collections, always schedule the activity to prevent more than 2-3 collections indexing or refreshing at the same time.

Using Sitemaps

You can index only the sitemaps listed in the webpage by enabling sitemaps in HTTP collection settings.
We support the sitemap.xml if it is in the robots.txt when the settings ā€¯Follow sitemap" is enabled.
However, if the sitemap.xml has a list of other sitemaps.xml inside, SearchBlox would not be able to index the same. SearchBlox currently does not support this feature yet.
We only support the sitemap.xml that has URLs inside the same.

Using Canonical URLs

You can use canonical URLs to avoid duplicates, improve link and ranking signals for content available through multiple URL structures. It is quite possible that same content could be accessed through multiple URLs, it is also possible that content can be distributed to different URLs and domains entirely.
To enable canonical URL indexing please give ignore canonical as No in the HTTP collection settings.

For Example

  1. Your site has multiple URLs as you position the same page under multiple sections.
http://sample.com/tea/

http://sample.com/hot-beverage/tea

http://sample.com/drinks/tea
  1. Your serve the same content for the www subdomain or the http/https protocol.
http://sample.com/tea/

https://sample.com/tea

http://www.sample.com/tea
  1. Content is replicated partially or fully in different pages or sites in the domain
http://discussion.sample.com/tea/

http://blog.sample.com/tea

http://www.sample.com/tea

This replication of pages could result in few challenges for those who search for them in SearchBlox resulting in repetitive content in results pointing to different pages. To overcome this challenge you could define Canonical URL for pages that have similar or equivalent content in multiple URLs/pages.
Steps

  1. Set the site which is most preferred as the canonical URL
  2. Indicate the preferred URL with the rel="canonical" link element
<link rel="canonical" href="http://www.sample.com/tea" />

The above indicates that the preferred URL to access the tea post.

On indexing the URL, SearchBlox will index only the URL that has been specified as canonical.

To avoid error, please use the absolute path instead of relative path in canonical URLs.
Please note that while we encourage you to use any of these methods, none of them is mandatory. If you don't indicate a canonical URL, Searchblox will identify the current URL as the best version or URL.

ref: https://support.google.com/webmasters/answer/139066?hl=en

Enable Canonical

By default canonical tags would be ignored, it can be enabled by setting the ignore canonical value as yes in the collection settings

Remove Duplicates

When Remove Duplicates is enabled, pages with 100% same content i.e., duplicates, will not be indexed. Here the content for checking duplicates includes title, keywords, description, other meta fields as well as the content of the page.
If there is any difference in the whole content then the files/URLs will not be considered as duplicates.

Add Documents

Using Add Documents tab one can manually add/update/delete a document/URL through the admin console for an HTTP System collection. To go to add documents tab, click a collection and select the last tab with the name Add Documents
Please note that this only adds the individual URL and does not start crawling from the URL
To add a URL to your collection, enter the web address and click "Add/Update":

To delete a URL from your collection, enter the web address and click "Delete":

To see the status of a URL, click "Status":

HTTP Collection