Email Collection

Email Collections allow you to index content from PST files, including attachments, as well as documents from file systems. We recommend using this collection type exclusively for PST files.

Creating an Email Collection

Follow these steps to create a new Email Collection:

Log in to the Admin Console
Navigate to the Collections tab
Click on "Create a New Collection" or the "+" icon
Select "Email Collection" as the Collection Type
Enter a unique name for your collection (e.g., "Email Archive")
Configure Collection Access settings (Private/Public)
Set Collection Encryption according to your security requirements
Select the language of the content
Click "Save" to create your collection

Once the Email collection is created you will be taken to the Path tab.

The email collection settings page allows you to configure the directory paths and filters for the collection. To access the paths settings for the collection, click on the collection name in the collections list.

Email Collection Path Settings

Directory Paths

The directory path is the starting path for the crawler. The crawler recursively indexes files within the folders. Enter at least one directory path for the collection. For example, c:\salesdocs or /var/web/html/salesdocs

Allow/Disallow Paths

Allow and Disallow filters make it possible to manage a collection by excluding unwanted documents.

Field	Description
Directory Path	The directory path is the starting path for the crawler.
Allow Paths	`C:\\www\\html\\*` When creating an email collection, specifying an allow filter is optional since the indexer is only going to look into sub-folders, but if any symbolic links are placed, the spider will move to linked directories.
Disallow Paths	`C:\\www\\html\\noindex\\.` `\\cgi-bin\\.`
Allowed Formats	Select which formats are eligible to be part of the collection using the checkboxes. File formats supported in email collection are HTML, XML, Word, Powerpoint, Excel, Visio, PDF, Text, RTF, EPUB, AutoCAD, OpenOffice, iWorks, WordPerfect, Images, Audio, Video, PST files, Email, and Archive.

Email Collection Settings

The Settings sub-tab holds tunable parameters for the email collection. SearchBlox comes pre-configured with parameters when a new collection is created.

The settings that can be configured are listed as follows:

Setting	Description
Remove Duplicates	When enabled, prevents indexing duplicate documents.
Stemming	When stemming is enabled, inflected words are reduced to a root form. For example, "running", "runs", and "ran" are the inflected form of a run.
Spelling Suggestions	Provide spelling suggestions for the collection. The default is YES.
Keyword-in-Context Display	The keyword-in-context returns search results with the description displayed from content areas where the search term occurs.
HTML Parser Settings	The setting configures the HTML parser to read the description for a document from one of the HTML tags: META, H1, H2, H3, H4, H5, H6.
Email Settings- All Mail	Allows crawler to index all the documents from all extracted PST folders. All Email is enabled by default.
Email Settings- Partially	Allows the crawler to index all the documents from selected extracted PST folders. Partially includes the folder names Inbox, Outbox, Deleted Items, Drafts and Sent Mail. All the Custom email folders will fall under Others option.
Maximum Document Age	Specifies the maximum allowable age in days of a document in the collection.
Maximum Document Size	Specifies the maximum allowable size in kilobytes of a document in the collection.
Enable Detailed Log Settings	Provides the indexer activity in detail in ../webapps/ROOT/logs/index.log. The details that occur in the index.log when logging or debug logging mode is enabled are: - List of files that are crawled. - Processing is done on each file along with timestamp on when the processing starts, whether the indexing process is taking place or URL gets skipped, and whether the file gets indexed. All data will be available as separate entries in index.log. - Timestamp of when the indexing completed, and the time is taken for indexing across the indexed file entry in the log file. - Last modified date of the file. - If the file is skipped or not, and why.
Enable Content API	Provides the ability to crawl the document content with special characters included.

Extraction of Emails

You can extract emails as text and attachments in a specific folder (all emails and attachments will be exported to the specified location).
Location can be specified at <SEARCHBLOX_INSTALLATION_PATH>/webapps/searchblox/WEB-INF/pst.yml
Please restart SearchBlox after entering the storage location in pst.yml. Then clear and reindex the collection.

Schedule and Index

Sets the frequency and the start date/time for indexing a collection, from the root URLs. Schedule Frequency supported in SearchBlox is as follows:

Once
Hourly
Daily
Every 48 Hours
Every 96 Hours
Weekly
Monthly

The following operations can be performed in email collections:

Activity	Description
Enable Scheduler for Indexing	Once enabled, you can set the Start Date and Frequency
Save	For each collection, indexing can be scheduled based on the above options.
View all Collection Schedules	Redirects to the Schedules section, where all the Collection Schedules are listed.

Data Fields Tab

Using Data Fields tab we can create custom fields for search and we can see the Default Data Fields with non-encrypted collection. SearchBlox supports 4 types of Data Fields as listed below:

Keyword
Number
Date
Text

Once the Data fields are configured, collection must be cleared and re-indexed to take effect.

To know more about Data Fields please refer to Data Fields Tab

👍
Best Practices

If you need to use the extraction and download feature of email collection please make the required changes in pst.yml mentioned earlier, restart the instance and then index the collection.

If you have multiple collections, always schedule the activity to prevent more than 2-3 collections indexing or refreshing at the same time.

Creating an Email Collection

Email Collection Path Settings

Directory Paths

Allow/Disallow Paths

Email Collection Settings

Extraction of Emails

Schedule and Index

Data Fields Tab

👍Best Practices

👍
Best Practices