Email Collection

Email Collections let you index content from PST files (including attachments) and also from file systems. It’s best to use this type mainly for PST files.

Creating an Email Collection (Easy Steps)

  1. Log in to the Admin Console
  2. Go to the Collections tab
  3. Click "Create a New Collection" or the "+" icon
  4. Choose "Email Collection" as the Collection Type
  5. Enter a unique name (e.g., "Email Archive")
  6. Set Collection Access (Private or Public)
  7. Enable Encryption if needed
  8. Choose the content language
  9. Click Save to create the collection
  • After creating the Email Collection, you will be taken to the Path tab.
    The Email Collection settings page lets you set folder paths and filters. To access these settings, click on the collection name in the Collections list.

Email Collection Path Settings

Directory Paths

The directory path is where the crawler starts indexing. It will go through all folders inside it. Enter at least one path, for example, c:\salesdocs or /var/web/html/salesdocs.

Allow/Disallow Paths

Use Allow and Disallow filters to control the collection by including only needed files and excluding unwanted ones.

FieldDescription
Directory PathThe directory path is where the crawler starts searching for files.
Allow PathsC:\\www\\html\\*
When creating a file system-based collection, specifying an allow filter is optional because the indexer scans all sub-folders, but if there are symbolic links, the crawler will also follow them into linked directories.
Disallow PathsC:\\www\\html\\noindex\\.*
\\cgi-bin\\.*
Allowed FormatsSelect the file formats you want to include by checking the boxes.
Supported formats include HTML, XML, Word, PowerPoint, Excel, Visio, PDF, Text, RTF, EPUB, AutoCAD, OpenOffice, iWorks, WordPerfect, Images, Audio, Video, PST files, Emails, and Archives.

Email Collection Settings

The Settings tab contains adjustable options for the Email Collection. SearchBlox provides default settings when a new collection is created.

The settings that can be configured are listed as follows:

SettingDescription
Remove DuplicatesTurn this on to avoid indexing duplicate documents
StemmingWhen on, words are reduced to their root form (e.g., "running", "runs", "ran" → "run").
Spelling SuggestionsProvides spelling suggestions for the collection. Default setting is YES.
Keyword-in-Context DisplayShows search results with snippets from content where the search term appears.
HTML Parser SettingsConfigures the HTML parser to get the document description from HTML tags like META, H1, H2, H3, H4, H5, or H6.
Email Settings- All MailIndexes all documents from all extracted PST folders. This option is enabled by default
Email Settings- PartiallyIndexes documents only from selected PST folders. Default folders include Inbox, Outbox, Deleted Items, Drafts, and Sent Mail. Custom folders are included under “Others Option".
Maximum Document Age Sets the maximum age (in days) a document can be in the collection.
Maximum Document Size Sets the maximum allowed size of a document in the collection (in kilobytes).
Enable Detailed Log SettingsProvides detailed indexer activity in ../webapps/ROOT/logs/index.log.

When logging or debug mode is enabled, the log includes:

- List of files that were crawled
- Processing details for each file with timestamps, whether it was indexed or skipped, recorded as separate entries
- Timestamp of indexing completion and time taken for each file
- Last modified date of the file
- Information on skipped files and the reasons
Enable Content APIAllows the crawler to index content that contains special characters.

Extraction of Emails

  • Emails and attachments can be extracted to a specific folder, with all content exported to that location.
  • Set the location in <SEARCHBLOX_INSTALLATION_PATH>/webapps/searchblox/WEB-INF/pst.yml
  • After updating pst.yml, restart SearchBlox, then clear and reindex the collection.

Schedule and Index*

Set when and how often a collection should be indexed from its root URLs. SearchBlox supports these schedule options:

  • Once
  • Hourly
  • Daily
  • Every 48 Hours
  • Every 96 Hours
  • Weekly
  • Monthly

The following operations can be performed in email collections:

ActivityDescription
Enable Scheduler for IndexingTurn this on to set the start date and how often indexing should run.
SaveSaves your scheduling settings for the collection.
View all Collection SchedulesOpens the Schedules page where you can see all scheduled collections.

Data Fields Tab

The Data Fields tab lets you create custom fields for search and view the default fields in non-encrypted collections.
SearchBlox supports 4 types of Data Fields:

  1. Keyword
  2. Number
  3. Date
  4. Text
  • After configuring Data Fields, you must clear and re-index the collection for changes to take effect.

To know more about Data Fields please refer to Data Fields Tab

Models

Embedding

  • Provider specifies the embedding provider used to generate vector representations of documents.
  • Model defines the embedding model used to convert document content into vectors for semantic search.

Reranker

  • Provider specifies the reranker provider used for improving search result relevance.
  • Model defines the reranker model used to re-score and reorder search results based on relevance.

LLM

  • Provider specifies the Large Language Model provider used for AI-powered features.

  • Model defines the LLM used for tasks such as document enrichment, summaries, and SmartFAQs.

  • These settings override global configurations and apply only to the current collection.

👍

Best Practices

  • To use the email extraction and download feature, update pst.yml as mentioned earlier, restart SearchBlox, and then index the collection.
  • For multiple collections, schedule them so that only 2–3 collections index or refresh at the same time.