Amazon S3 Collection
Amazon S3 Collections allow SearchBlox to index files stored in Amazon S3 buckets, making your cloud-stored documents fully searchable. This is useful for organisations that store large volumes of files — such as PDFs, images, or documents — in S3 and need to make that content searchable without moving it.
Creating Amazon S3 Collection
Follow these steps to create a new Amazon S3 Collection:
-
Log in to the Admin Console
-
Go to the Collections tab
-
Click "Create a New Collection" or the "+" icon
-
Select "Amazon S3 Collection" as the Collection Type
-
Enter a unique name (e.g., AmazonS3)
-
Configure RAG settings (Enable for Hybrid RAG search, Disable for standard search)
-
Set Collection Access (Private or Public)
-
Configure Encryption as needed
-
Choose the content language if it’s not English
-
Click Save to create the collection

-
After creating the Amazon S3 collection, you will be taken to the Amazon S3 tab.
AmazonS3 Collection Settings
-
The Settings tab contains options for Amazon S3 and other tunable search parameters.
-
Amazon S3 settings must be configured explicitly for each collection.
-
Mandatory fields for an Amazon S3 collection:
- Access Key
- Secret Key
- Bucket Name
-
SearchBlox also provides default settings like includes and excludes when a new collection is created.
-
The table below lists all available settings for the Amazon S3 collection.


| Field | Description |
|---|---|
| Access Key | Amazon S3 access key from your security credentials. Mandatory field. |
| Secret Key | Amazon S3 secret key from your security credentials. Mandatory field. |
| Name | Optional name for the collection. |
| Bucket | Amazon S3 bucket to index. Mandatory field. |
| Path Prefix | Optional path prefix to index within the bucket, e.g., Work/. Must be an existing path with trailing / if specified. |
| Includes | File types to include, e.g., .pdf, .jpg. |
| Excludes | File types to exclude, e.g., *.zip. |
| Relevance - Remove Duplicates | Prevents indexing duplicate documents with the same content. Default is NO. |
| Relevance - Stemming | Treats inflected words as their root form (e.g., "running", "runs", "ran" → "run"). Default is YES. |
| Relevance - Spelling Suggestions | Creates a spelling index at the end of the indexing process when enabled. |
| Keyword-in-Context Display | Shows search results with snippets from content where the search term appears. |
| Enable Detailed Log Settings | When debug mode is on, logs detailed indexing activity in index.log, including URL status, timestamps, status codes, and time taken. Default is NO. |
| Enable Content API | Allows the crawler to index document content that contains special characters. |
Generate Title, Description and Topics using SearchAI PrivateLLM and Enable Hybrid Search:
- Choose and enable
Generate Using LLMandAuto Relevance
- By clicking
Compare Keyword Search with Hybridwill redirect to the Comparison Plugin
Settings Description Title Generates concise and relevant titles for the indexed documents using LLM. Description Generates the description for indexed documents using LLM. Topic Generates relevant topics for indexed documents using LLM based on document's content. Auto Relevance Enable/Disable Hybrid Search for automatic relevance ranking
Additional Note
- Do not log transactions to S3 buckets since those log files will also be indexed, increasing bandwidth usage.
- If logging is needed, then disallow the log files by excluding them (using extensions) in Collection Settings.
Synonyms
Synonyms help the search show relevant documents even when the exact search word is not used.
For example, if someone searches for “global,” the results can also include documents that use “world” or “international.”
We have an option to load Synonyms from the existing documents.

Schedule and Index
Set when and how often a collection should be indexed. SearchBlox supports these schedule options:
-
Once
-
Hourly
-
Daily
-
Every 48 Hours
-
Every 96 Hours
-
Weekly
-
Monthly

The following operations can be performed in AmazonS3 collection:
| Activity | Description |
|---|---|
| Enable Scheduler for Indexing | Turn this on to set the start date and how often indexing should run. |
| Save | Saves your scheduling settings for the collection.. |
| View all Collection Schedules | Opens the Schedules page where you can see all scheduled collections. |
Manage Documents
Using the Manage Documents tab, you can perform the following operations:
Add/Update
Filter
View Content
View Metadata
Refresh
Delete
To add a document, click the + icon as shown in the screenshot.
Enter the document URL and click Add/Update.
Once the document is added or updated, the document URL will be displayed on the screen, and you will be able to perform the operations listed above.

Data Fields Tab
Using the Data Fields tab, you can create custom fields for search and view the default fields in non-encrypted collections. SearchBlox supports 4 types of Data Fields:
| Type | Description |
|---|---|
| Keyword | Used for alphanumeric values such as IDs, tags, codes, or other exact-match fields. |
| Number | Used for numeric values such as prices, quantities, ratings, or counts. |
| Date | Used for date values that can be searched, sorted, and filtered. |
| Text | Used for full-text search within custom field content. |
After configuring Data Fields, you must clear and re-index the collection for changes to take effect.

Prompts
When LLM/RAG is enabled, you can edit AI-based prompts for Title, Description, Topic, Image Description, and Smart FAQs.
You can customize these prompts anytime, and use Restore Default to reset them back to the original SearchBlox settings.


Models
The Models section lets you override the global embedding, reranking, and LLM settings for this specific collection. Changes made here apply only to the current collection and do not affect other collections.
Embedding
- Provider specifies the embedding provider used to generate vector representations of documents.
- Model defines the embedding model used to convert document content into vectors for semantic search.
Reranker
- Provider specifies the reranker provider used for improving search result relevance.
- Model defines the reranker model used to re-score and reorder search results based on relevance.
LLM
-
Provider specifies the Large Language Model provider used for AI-powered features.
-
Model defines the LLM used for tasks such as document enrichment, summaries, and SmartFAQs.
-
These settings override global configurations and apply only to the current collection.
Best Practices
- Always provide Access Key, Secret Key, Bucket Name, and Update Rate in S3 collection settings.
- Use the include/exclude options to avoid indexing unnecessary file types.
- For multiple collections, schedule them so that only 2–3 collections index at the same time.


