Filesystem Collection
SearchBlox includes a built-in file system crawler to index content from file systems. A file system search collection can be created by following the following steps.
Creating File Collection
After logging in to the Admin Console, click Add Collection button. The Add Collection screen will be displayed.
- Enter a unique name for the collection (for example, SalesDocs).
- Select File System radio button.
- Choose the language of the web pages that need indexing.
- Click Add to create the collection.
The file collection settings configure the directory paths and the filters for the collection. To access the path settings for the collection, click on the collection name in the collections list.
File Collection Path Settings
Directory Paths
The directory path is the starting path for the crawler. The crawler recursively indexes content within the folders. In the paths sub-tab, enter at least one directory path for the collection. (For example, c:\salesdocs
or /var/web/html/salesdoc
) and then save the settings.
Allow/Disallow Paths
Allow and Disallow filters make it possible to manage a collection by excluding unwanted documents.
The path settings are listed in the table:
Field | Description |
---|---|
Directory Path | The directory path is the starting path for the crawler. |
Map Directory Path | The directory path that needs to be indexed. It is mapped to the URL entered in the To URL field. These two fields are optional. |
To URL | The URL that has to be mapped to the directory path. For example, C:\testfolder can be mapped to http://www.examplesite.com so that even though SearchBlox indexes the content from the file system when a user clicks on the search result, the web document is served from the web server. The child path is automatically mapped to the URL. |
Allow Paths | C:\\www\\html\\* When creating a file system-based collection, specifying an allow filter is optional since the indexer is only going to look into sub-folders, but if any symbolic links are placed, the spider will move to linked directories. |
Disallow Paths | C:\\www\\html\\noindex\\.* \\cgi-bin\\.* |
Allowed Formats | Select the formats eligible to be part of the collection using the checkboxes. File formats supported in File Collection are HTML, XML, Word, PowerPoint, Excel, Visio, PDF, Text, RTF, EPUB, AutoCAD, OpenOffice, iWorks, WordPerfect, Images, Audio, Video, PST files, Email and Archive. |
File Collection Settings
The Settings sub-tab holds tunable parameters for the file system crawler and the indexer. SearchBlox comes pre-configured with parameters when a new collection is created. The settings that can be configured from SearchBlox are listed as follows.
Setting | Description |
---|---|
Keyword-in-Context Display | The keyword-in-context returns search results with the description displayed from content areas where the search term occurs. |
Maximum Document Age | Specifies the maximum allowable age in days of a document in the collection. |
Maximum Document Size | Specifies the maximum allowable size in kilobytes of a document in the collection. |
Remove Duplicates | When enabled, prevents indexing of duplicate documents. |
Boosting | Boost search terms for the collection by setting a value greater than 1 (maximum value 9999). |
Stemming | When stemming is enabled, inflected words are reduced to root form. For example, "running", "runs", and "ran" are the inflected form of run. |
Spelling Suggestions | When enabled, a spell index is created at the end of the indexing process. |
Enable OCR | SearchBlox now supports Optical Character Recognition (OCR) to search text within images. When Enable OCR is set to Yes in file collection, SearchBlox is able to index the text within the image. However, for the feature to work, it is necessary to install Tesseract, an OCR engine. File types supported by OCR recognition are JPG, PNG and TIFF. |
Logging | Provides the indexer activity in detail in <SearchBlox_installation_dir>/searchblox/logs/index.log The details that occur in the index.log when logging or debug logging mode are enabled are: - List of files that are crawled. - Processing done on each file along with timestamp on when the processing starts, whether the indexing process is taking place or URL gets skipped, and whether the file gets indexed. - All data will be available as separate entries in index.log. - Timestamp of when the indexing completed, and the time taken for indexing each page. - Last modified date of the file. - If the file is skipped or not, and why. |
Indexing File Share Using File collection
- To index content from file share using file collection, provide the share path (UNC path) to the folder that contains the files to be indexed.
- If the file share is available on another server within the same network and requires permission, run SearchBlox server service with Admin access, and enter the credentials as listed in the following screenshot.
- Running as admin account or account with access to files only will help successfully index files from the share.
Indexing and Other Operations
The following operations can be performed in file collection:
Operation | Description |
---|---|
Index | Starts the indexer for the selected collection. Starts indexing from the directory paths. |
Clear | Clears the current index for the selected collection. |
Refresh | Revisits the files from the current index to make sure they are still valid, and then continues to index newly discovered documents. |
Scheduled Activity | For each collection, any of the following scheduled indexer activity can be set: Index - Set the frequency and the start date/time for indexing a collection. Clear - Set the frequency and the start date/time for clearing a collection. |
- Indexer activity is controlled from the Index sub-tab in the collection. The current status of an indexer for a particular collection is indicated.
- Once indexing has been completed, refresh can be performed.
- Refresh operation revisits files from the current index to make sure they are still valid, and then continues to index newly discovered files, modified files, and deletes the documents that are not valid.
- Index and refresh operations can also be performed from the collections dashboard.
- Scheduling can be performed only from indexer sub-tabs.
Schedule Frequency
Schedule Frequency supported in SearchBlox is as follows:
- Once
- Every Minute
- Hourly
- Daily
- Every 48 Hours
- Every 96 Hours
- Weekly
- Monthly
Best Practices
- Please give directory path or valid UNC path in File collection path settings
- When mapping directory to URL please give the Directory path in map directory path as well.
- Do not schedule the same time for all three operations (Index, Refresh, Clear). This will create conflict between activities.
- If you have multiple collections, always schedule the activity to prevent more than 2-3 collections indexing or refreshing at the same time.
OCR Recognition
- SearchBlox supports Optical Character Recognition (OCR) to search text within images.
- When Enable OCR is set to Yes in File Collection, SearchBlox is able to index the text within the image.
- However, for the feature to work, it is necessary to install Tesseract, an OCR engine.
- This featured has been tested in Windows and Linux operating systems.
- The image file types supported by OCR Recognition are JPG, PNG and TIFF.
About Tesseract
-
The Tesseract download link for Windows is shown. Install the software in the default location in Windows.
https://github.com/tesseract-ocr/tesseract/wiki/Downloads
http://sourceforge.net/projects/tesseract-ocr-alt/files/
Please download version 3.02.02
https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-setup-3.02.02.exe/download -
Tesseract is available directly for many Linux distributions. The package is generally called "tesseract" or "tesseract-ocr".
-
You can search their distribution's repositories to find it. Packages are also generally available for language training data (search the repositories), but if not, you will need to download the appropriate training data, unpack it, and copy the .traineddata file into the "tessdata" directory, i.e.,
/usr/share/tesseract-ocr/tessdata or /usr/share/tessdata
(ref : https://code.google.com/p/tesseract-ocr/wiki/ReadMe). -
Tesseract software installation instruction and download links are available here:
https://code.google.com/p/tesseract-ocr/
https://code.google.com/p/tesseract-ocr/downloads/list
Setting up OCR Recognition in SearchBlox
- After installing the Tesseract software, restart SearchBlox.
- In the file collection, select Yes to Enable OCR as shown:
- Index the file collection. Search for the word in the image. Sample results are shown here:
Add Documents Tab
- Using Documents tab one can manually add/update/delete a document/file through the admin console for a File System collection.
- To go to add documents tab, click a collection and select the last tab with the name Documents
- Please note that this only adds the individual file and does not start crawling from the path.
- To add a file to your collection, enter the file path and click "Add/Update":
- To delete a file from your collection, enter the file path and click "Delete"
- To see the status of a indexed file, click "Status"
Updated over 4 years ago