Filesystem Collection

SearchBlox includes a built-in file system crawler to index content from file systems.

Creating a File Collection

To create a File System Collection in SearchBlox, follow these steps:

Log in to the Admin Console
Navigate to the Collections tab and click either "Create a New Collection" or the "+" icon
Select "File System Collection" as the Collection Type
Enter a unique name for your collection (e.g., "SalesDocs")
Configure RAG settings:

Enable for Hybrid RAG search
Disable if not required

Set access permissions:

Choose between Private or Public Collection Access
Configure Collection Encryption based on your security requirements

Select the content language (if other than English)
Click Save to create the collection

Once the File collection is created you will be taken to the Path tab, to configure the directory paths and the filters for the collection.

File Collection Path Settings

Directory Paths

The directory path is the starting path for the crawler. The crawler recursively indexes content within the folders. In the paths sub-tab, enter at least one directory path for the collection. (For example, c:\salesdocs or /var/web/html/salesdoc) and then save the settings.

Allow/Disallow Paths

Allow and Disallow filters make it possible to manage a collection by excluding unwanted documents.

The path settings are listed in the table:

Field	Description
Directory Path	The directory path is the starting path for the crawler.
Map Directory Path	The directory path that needs to be indexed. It is mapped to the URL entered in the To URL field. These two fields are optional.
To URL	The URL has to be mapped to the directory path. For example, `C:\testfolder` can be mapped to URL so that even though SearchBlox indexes the content from the file system when a user clicks on the search result, the web document is served from the web server. The child path is automatically mapped to the URL.
Allow Paths	`C:\\www\\html\\*` When creating a file system-based collection, specifying an allow filter is optional since the indexer is only going to look into sub-folders, but if any symbolic links are placed, the spider will move to linked directories.
Disallow Paths	`C:\\www\\html\\noindex\\.` `\\cgi-bin\\.`
Allowed Formats	Select the formats eligible to be part of the collection using the checkboxes. File formats supported in File Collection are HTML, XML, Word, PowerPoint, Excel, Visio, PDF, Text, RTF, EPUB, AutoCAD, OpenOffice, iWorks, WordPerfect, Images, Audio, Video, PST files, Email and Archive.

File Collection Settings

The Settings sub-tab holds tunable parameters for the file system crawler and the indexer. SearchBlox comes pre-configured with parameters when a new collection is created. The settings that can be configured from SearchBlox are listed as follows.

📘
Generate Title, Description and Topics using SearchAI PrivateLLM and Enable Hybrid Search:

Choose and enable Generate Using LLM and Auto Relevance

By clicking Compare Keyword Search with Hybrid will redirect to the Comparison Plugin

Settings Description
Title Generates concise and relevant titles for the indexed documents using LLM.
Description Generates the description for indexed documents using LLM.
Topic Generates relevant topics for indexed documents using LLM based on document's content.
Auto Relevance Enable/Disable Hybrid Search for automatic relevance ranking

Settings	Description
Title	Generates concise and relevant titles for the indexed documents using LLM.
Description	Generates the description for indexed documents using LLM.
Topic	Generates relevant topics for indexed documents using LLM based on document's content.
Auto Relevance	Enable/Disable Hybrid Search for automatic relevance ranking

Setting	Description
Remove Duplicates	When enabled, prevents indexing of duplicate documents.
Stemming	When stemming is enabled, inflected words are reduced to root form. For example, "running", "runs", and "ran" are the inflected form of run.
Spelling Suggestions	Provide spelling suggestions for the collection. The default is YES.
Keyword-in-Context Display	The keyword-in-context returns search results with the description displayed from content areas where the search term occurs.
HTML Parser Setting	The setting configures the HTML parser to read the description for a document from one of the HTML tags: META, H1, H2, H3, H4, H5, H6.
Maximum Document Age	Specifies the maximum allowable age in days of a document in the collection.
Maximum Document Size	Specifies the maximum allowable size in kilobytes of a document in the collection.
Logging	Provides the indexer activity in detail in `<SearchBlox_installation_dir>/webapps/ROOT/logs/index.log` The details that occur in the index.log when logging or debug logging mode are enabled are: - List of files that are crawled. - Processing done on each file along with timestamp on when the processing starts, whether the indexing process is taking place or URL gets skipped, and whether the file gets indexed. - All data will be available as separate entries in index.log. - Timestamp of when the indexing completed, and the time taken for indexing each page. - Last modified date of the file. - If the file is skipped or not, and why.
Enable Content API	Provides the ability to crawl the content with special characters included.

Indexing File Share Using File collection

To index content from file share using file collection, provide the share path (UNC path) to the folder that contains the files to be indexed.
If the file share is available on another server within the same network and requires permission, run SearchBlox server service with Admin access, and enter the credentials as listed in the following screenshot.
Running as admin account or account with access to files only will help successfully index files from the share.

Schedule and Index

Sets the frequency and the start date/time for indexing a collection, from the root URLs. Schedule Frequency supported in SearchBlox is as follows:

Once
Hourly
Daily
Every 48 Hours
Every 96 Hours
Weekly
Monthly

The following operation can be performed in file collection:

Activity	Description
Enable Scheduler for Indexing	Once enabled, you can set the Start Date and Frequency
Save	For each collection, indexing can be scheduled based on the above options.
View all Collection Schedules	Redirects to the Schedules section, where all the Collection Schedules are listed.

👍
Best Practices

Please give directory path or valid UNC path in File collection path settings

When mapping directory to URL please give the Directory path in map directory path as well.

If you have multiple collections, always schedule the activity to prevent more than 2-3 collections indexing or refreshing at the same time.

Manage Documents Tab

Using Manage Documents tab we can do the following operations:
1. Add/Update
2. Filter
3. View content
4. View metadata
5. Refresh
6. Delete
To add a document click on "+" icon as shown in the screenshot.
Enter the document/URL, Click on add/update.
Once the document is updated you will be able to see the document URL on the screen and we be able to perform the above mentioned Operations.
To delete a file from your collection, enter the file path and click "Delete".
To see the status of an indexed file, click "View Metadata".

Filesystem Collection

Creating a File Collection

File Collection Path Settings

Directory Paths

Allow/Disallow Paths

File Collection Settings

📘
Generate Title, Description and Topics using SearchAI PrivateLLM and Enable Hybrid Search:

Indexing File Share Using File collection

Schedule and Index

👍
Best Practices

Manage Documents Tab

Creating a File Collection

File Collection Path Settings

Directory Paths

Allow/Disallow Paths

File Collection Settings

📘Generate Title, Description and Topics using SearchAI PrivateLLM and Enable Hybrid Search:

Indexing File Share Using File collection

Schedule and Index

👍Best Practices

Manage Documents Tab

📘
Generate Title, Description and Topics using SearchAI PrivateLLM and Enable Hybrid Search:

👍
Best Practices