Amazon S3 Collection
Creating Amazon S3 Collection
Follow these steps to create a new Amazon S3 Collection:
- Log in to the Admin Console
- Go to the Collections tab
- Click "Create a New Collection" or the "+" icon
- Select "Amazon S3 Collection" as the Collection Type
- Enter a unique name (e.g., AmazonS3)
- Configure RAG settings (Enable for Hybrid RAG search, Disable for standard search)
- Set Collection Access (Private or Public)
- Configure Encryption as needed
- Choose the content language if it’s not English
- Click Save to create the collection

- After creating the Amazon S3 collection, you will be taken to the Amazon S3 tab.
AmazonS3 Collection Settings
-
The Settings tab contains options for Amazon S3 and other tunable search parameters.
-
Amazon S3 settings must be configured explicitly for each collection.
-
Mandatory fields for an Amazon S3 collection:
- Access Key
- Secret Key
- Bucket Name
-
SearchBlox also provides default settings like includes and excludes when a new collection is created.
-
The table below lists all available settings for the Amazon S3 collection.


| Field | Description |
|---|---|
| Access Key | Amazon S3 access key from your security credentials. Mandatory field. |
| Secret Key | Amazon S3 secret key from your security credentials. Mandatory field. |
| Name | Optional name for the collection. |
| Bucket | Amazon S3 bucket to index. Mandatory field. |
| Path Prefix | Optional path prefix to index within the bucket, e.g., Work/. Must be an existing path with trailing / if specified. |
| Includes | File types to include, e.g., .pdf, .jpg. |
| Excludes | File types to exclude, e.g., *.zip. |
| Relevance - Remove Duplicates | Prevents indexing duplicate documents with the same content. Default is NO. |
| Relevance - Stemming | Treats inflected words as their root form (e.g., "running", "runs", "ran" → "run"). Default is YES. |
| Relevance - Spelling Suggestions | Creates a spelling index at the end of the indexing process when enabled. |
| Keyword-in-Context Display | Shows search results with snippets from content where the search term appears. |
| Enable Detailed Log Settings | When debug mode is on, logs detailed indexing activity in index.log, including URL status, timestamps, status codes, and time taken. Default is NO. |
| Enable Content API | Allows the crawler to index document content that contains special characters. |
Generate Title, Description and Topics using SearchAI PrivateLLM and Enable Hybrid Search:
- Choose and enable
Generate Using LLMandAuto Relevance
- By clicking
Compare Keyword Search with Hybridwill redirect to the Comparison Plugin
Settings Description Title Generates concise and relevant titles for the indexed documents using LLM. Description Generates the description for indexed documents using LLM. Topic Generates relevant topics for indexed documents using LLM based on document's content. Auto Relevance Enable/Disable Hybrid Search for automatic relevance ranking
Additional Note
- Do not log transactions to S3 buckets since those log files will also be indexed, increasing bandwidth usage.
- If logging is needed, then disallow the log files by excluding them (using extensions) in Collection Settings.
Schedule and Index
Set when and how often a collection should be indexed. SearchBlox supports these schedule options:
- Once
- Hourly
- Daily
- Every 48 Hours
- Every 96 Hours
- Weekly
- Monthly

The following operations can be performed in AmazonS3 collection:
| Activity | Description |
|---|---|
| Enable Scheduler for Indexing | Turn this on to set the start date and how often indexing should run. |
| Save | Saves your scheduling settings for the collection.. |
| View all Collection Schedules | Opens the Schedules page where you can see all scheduled collections. |
Models
Embedding
- Provider specifies the embedding provider used to generate vector representations of documents.
- Model defines the embedding model used to convert document content into vectors for semantic search.
Reranker
- Provider specifies the reranker provider used for improving search result relevance.
- Model defines the reranker model used to re-score and reorder search results based on relevance.
LLM
-
Provider specifies the Large Language Model provider used for AI-powered features.
-
Model defines the LLM used for tasks such as document enrichment, summaries, and SmartFAQs.
-
These settings override global configurations and apply only to the current collection.
Best Practices
- Always provide Access Key, Secret Key, Bucket Name, and Update Rate in S3 collection settings.
- Use the include/exclude options to avoid indexing unnecessary file types.
- For multiple collections, schedule them so that only 2–3 collections index at the same time.
Updated 14 days ago


