SearchBlox

SearchBlox Developer Hub

Welcome to the SearchBlox developer hub. Here you will find comprehensive guides and documentation to help you start working with SearchBlox as quickly as possible, as well as support if you get stuck. Let's jump right in!

Guides

Email Collection

Email Collections can index content from PST files including attachments and documents from file-systems.

Email Collections can be created by following the steps below:

After logging in to the Admin Console, click on the Add Collection button. The Add Collection screen will be displayed.

  • Enter a unique name for the collection (for example, EmailArchive).
  • Click on the Email radio button.
  • Choose the language of the content.
  • Click Add to create the new collection.

Settings

The email collection settings page allows you to configure the directory paths and filters for the collection. To access the paths settings for the collection, click on the collection name in the collections list.

Path Settings

Directory Paths
The directory path is the starting path for the crawler. The crawler recursively indexes files within the folders. Enter at least one directory path for the collection. For example, c:\salesdocs or /var/web/html/salesdocs.

Collection Filters
Filters allow you to configure the crawler to include or exclude documents or sub-folders. Allow and Disallow filters make it possible to manage a collection by excluding unwanted documents.

Map Directory Path

The directory path that needs to be indexed. This directory path is mapped to the URL given in the To URL field given below. These two fields are optional.

To URL

The URL that has to be mapped to the directory path.
For example, C:\testfolder can be mapped to http://www.examplesite.com so that even though SearchBlox indexes the content from the file system, when you click on the search result, the web document is served from the web server. The child path is automatically mapped to the URL.

Allow Paths

C:\www\html\*
When creating an email collection, specifying an allow filter is optional since the indexer is only going to look into sub-folders, but if any symbolic links are placed, the spider will move to linked directories.

Disallow Paths

C:\\emails\\old\\noindex\\.
\\videos\\.

Allowed Formats

Select which formats are eligible to be part of the collection using the checkboxes.
File formats supported in email collection are HTML, XML, Word, Powerpoint, Excel, Visio, PDF, Text, RTF, EPUB, AutoCAD, OpenOffice, iWorks, WordPerfect, Images, Audio, Video, PST files, Email and Archive.

Collection Settings

The Settings sub-tab holds tunable parameters for the email collection. SearchBlox comes pre-configured with parameters when a new collection is created.

The settings that can be configured are listed below:

Email Settings

Email Setting
Description

All Mail

Selecting All Mail enables all folders in the PST files to be indexed.

Partial Mail

Selecting Partial Mail will allow you to select specific folders in PST files for indexing.
The folders that can be selected are

  • Inbox
  • Outbox
  • Deleted Items
  • Drafts
  • Sent mail
  • All other folders will be considered in Others category
Setting
Description

Keyword-in-Context Display

The keyword-in-context returns search results with the description displayed from content areas where the search term occurs.

Maximum Document Age

Specifies the maximum allowable age in days of a document in the collection.

Maximum Document Size

Specifies the maximum allowable size in kilobytes of a document in the collection.

Remove Duplicates

When enabled, prevents indexing duplicate documents.

Boosting

Boost search terms for the collection by setting a value greater than 1 (maximum value 9999).

Stemming

When stemming is enabled, inflected words are reduced to root form. For example, "running", "runs", and "ran" are the inflected form of run.

Spelling Suggestions

When enabled, a spelling index is created at the end of the indexing process.

Logging

Provides the indexer activity in detail in ../searchblox/logs/index.log.
The details that occur in the index.log when logging or debug logging mode are enabled are:

  • List of files that are crawled.
  • Processing done on each file along with timestamp on when the processing starts, whether the indexing process is taking place or URL gets skipped, and whether the file gets indexed. All data will be available as separate entries in index.log.
  • Timestamp of when the indexing completed, and the time taken for indexing across the indexed file entry in the log file.
  • Last modified date of the file.
  • If the file is skipped or not, and why.

Extraction of Emails

  • You can extract emails as text and attachments in a specific folder (all emails and attachments will be exported to the specified location).
  • Location can be specified at <searchblox installation path>/webapps/searchblox/WEB-INF/pst.yml.
  • Please restart SearchBlox after entering the storage location in pst.yml. Then clear and reindex the collection.

Indexing and Other Operations

The following operations can be performed in email collections:

Index

Starts the indexer for the selected collection. Starts indexing from the directory paths.

Clear

Clears the current index for the selected collection.

Refresh

Revisits the files from the current index to make sure they are still valid, and then continues to index newly discovered documents.

Scheduled Activity

For each collection, any of the following scheduled indexer activity can be set:
Index - Set the frequency and the start date/time for indexing a collection.
Refresh - Set the frequency and the start date/time for refreshing a collection.
Clear - Set the frequency and the start date/time for clearing a collection.

  • Indexer activity is controlled from the Index sub-tab in the collection. The current status of an indexer for a particular collection is indicated.
  • Once indexing has been completed, you can perform a refresh.
  • Index and refresh operations can also be performed from the collection dashboard.
  • Scheduling can be performed only from indexer sub-tabs.

Best Practices for Scheduling

Do not schedule the same time for all three operations (Index, Refresh, Clear). This will create conflict between activities.

If you have multiple collections, always schedule the activity to prevent more than 2-3 collections indexing or refreshing at the same time.

Logging

Log files starting with the name EmailCollection_<date> are generated in ../webapps/searchblox/logs folder, which lists the status of the action performed on each PST file.