Amazon S3 Collection
Creating Amazon S3 Collection
You can create an Amazon S3 collection by following the steps given below.
- After logging in to the Admin Console, select the Collections tab and click on Create a New Collection or "+" icon.
- Choose Amazon S3 Collection as Collection Type
- Enter a unique name for your collection (for example, AmazonS3).
- Choose Private/Public Collection Access and Collection Encryption as per the requirements.
- Choose the language of the content (if the language is other than English).
- Click Save to create the collection.
- Once the AmazonS3 collection is created you will be taken to the AmazonS3 tab
AmazonS3 Collection Settings
- The Settings sub-tab holds settings for Amazon S3 and tunable parameters for the search.
- Amazon S3 settings must be set explicitly in the Amazon S3 collections.
- The mandatory fields for AmazonS3 collection are
- Access key
- Secret key
- Bucket name
- SearchBlox also comes pre-configured with few other AmazonS3 parameters like includes, excludes when a new collection is created.
- The following table has the list of settings for AmazonS3 Collection
Field | Description |
---|---|
Access Key | Access key from Amazon S3 security credentials. Mandatory field. |
Secret Key | Security key from Amazon S3 security credentials. Mandatory field. |
Name | Optional name. |
Bucket | Amazon S3 bucket to index. Mandatory field. |
Path Prefix | Path prefix to index in this bucket example: Work/. This is optional. If specified, it should be an existing path with the trailing /. |
Includes | File types to be included. example: .pdf, .jpg. |
Excludes | File types to be excluded. example: *.zip. |
Relevance - Remove Duplicates | Avoids the indexing of duplicate documents, i.e., documents which have the same exact content. The default is NO |
Relevance - Stemming | Stemming considers the inflected words of the root form within the search page. For example, "running", "runs", and "ran" are all inflected forms of run. The default is YES. |
Relevance - Spelling Suggestions | When enabled, a spelling index is created at the end of the indexing process. |
Keyword-in-Context Display | The keyword-in-context returns search results with the description displayed from content areas where the search term occurs. |
Enable Detailed Log Settings | When debug mode is enabled, indexing activity gets logged in detail within the index.log. Log details include: Indexing status of each URL along with timestamp, URL indexing status along with timestamp, status code and time taken for indexing. By default this is set to NO. |
Enable Content API | Provides the ability to crawl the document content with special characters included. |
Additional Note
- Do not log transactions to S3 buckets since those log files will also be indexed, increasing bandwidth usage.
- If logging is needed, then disallow the log files by excluding them (using extensions) in Collection Settings.
Schedule and Index
Sets the frequency and the start date/time for indexing a collection. Schedule Frequency supported in SearchBlox is as follows:
- Once
- Hourly
- Daily
- Every 48 Hours
- Every 96 Hours
- Weekly
- Monthly
The following operations can be performed in AmazonS3 collection:
Activity | Description |
---|---|
Schedule | For each collection, indexing can be scheduled based on the above options. |
Data Fields Tab
Using Data Fields tab we can create custom fields for search and we can see the Default Data Fields with non-encrypted collection. SearchBlox supports 4 types of Data Fields as listed below:
Keyword
Number
Date
Text
- Once the Data fields are configured, collection must be cleared and re-indexed to take effect.
To know more about Data Fields please refer to Data Fields Tab
Best Practices
- It is mandatory to provide access key, secret key, bucket name and update rate in S3 collection settings.
- It is possible to include or exclude file types using collection settings. Please use them to avoid indexing unnecessary file types.
- If you have multiple collections, always schedule the activity to prevent more than 2-3 collections indexing at the same time.
Updated over 2 years ago