Amazon S3 Collection

Creating Amazon S3 Collection

Follow these steps to create a new Amazon S3 Collection:

  1. Log in to the Admin Console
  2. Go to the Collections tab
  3. Click "Create a New Collection" or the "+" icon
  4. Select "Amazon S3 Collection" as the Collection Type
  5. Enter a unique name (e.g., AmazonS3)
  6. Configure RAG settings (Enable for Hybrid RAG search, Disable for standard search)
  7. Set Collection Access (Private or Public)
  8. Configure Encryption as needed
  9. Choose the content language if it’s not English
  10. Click Save to create the collection

  • After creating the Amazon S3 collection, you will be taken to the Amazon S3 tab.

AmazonS3 Collection Settings

  • The Settings tab contains options for Amazon S3 and other tunable search parameters.

  • Amazon S3 settings must be configured explicitly for each collection.

  • Mandatory fields for an Amazon S3 collection:

    • Access Key
    • Secret Key
    • Bucket Name
  • SearchBlox also provides default settings like includes and excludes when a new collection is created.

  • The table below lists all available settings for the Amazon S3 collection.


FieldDescription
Access KeyAmazon S3 access key from your security credentials.
Mandatory field.
Secret KeyAmazon S3 secret key from your security credentials.
Mandatory field.
NameOptional name for the collection.
BucketAmazon S3 bucket to index.
Mandatory field.
Path PrefixOptional path prefix to index within the bucket, e.g., Work/. Must be an existing path with trailing / if specified.
IncludesFile types to include, e.g., .pdf, .jpg.
ExcludesFile types to exclude, e.g., *.zip.
Relevance - Remove DuplicatesPrevents indexing duplicate documents with the same content. Default is NO.
Relevance - StemmingTreats inflected words as their root form (e.g., "running", "runs", "ran" → "run"). Default is YES.
Relevance - Spelling SuggestionsCreates a spelling index at the end of the indexing process when enabled.
Keyword-in-Context DisplayShows search results with snippets from content where the search term appears.
Enable Detailed Log SettingsWhen debug mode is on, logs detailed indexing activity in index.log, including URL status, timestamps, status codes, and time taken. Default is NO.
Enable Content APIAllows the crawler to index document content that contains special characters.

📘

Generate Title, Description and Topics using SearchAI PrivateLLM and Enable Hybrid Search:

  • Choose and enable Generate Using LLM and Auto Relevance

  • By clicking Compare Keyword Search with Hybrid will redirect to the Comparison Plugin
SettingsDescription
TitleGenerates concise and relevant titles for the indexed documents using LLM.
DescriptionGenerates the description for indexed documents using LLM.
TopicGenerates relevant topics for indexed documents using LLM based on document's content.
Auto RelevanceEnable/Disable Hybrid Search for automatic relevance ranking

📘

Additional Note

  • Do not log transactions to S3 buckets since those log files will also be indexed, increasing bandwidth usage.
  • If logging is needed, then disallow the log files by excluding them (using extensions) in Collection Settings.

Schedule and Index

Set when and how often a collection should be indexed. SearchBlox supports these schedule options:

  • Once
  • Hourly
  • Daily
  • Every 48 Hours
  • Every 96 Hours
  • Weekly
  • Monthly

The following operations can be performed in AmazonS3 collection:

ActivityDescription
Enable Scheduler for IndexingTurn this on to set the start date and how often indexing should run.
SaveSaves your scheduling settings for the collection..
View all Collection SchedulesOpens the Schedules page where you can see all scheduled collections.

Models

Embedding

  • Provider specifies the embedding provider used to generate vector representations of documents.
  • Model defines the embedding model used to convert document content into vectors for semantic search.

Reranker

  • Provider specifies the reranker provider used for improving search result relevance.
  • Model defines the reranker model used to re-score and reorder search results based on relevance.

LLM

  • Provider specifies the Large Language Model provider used for AI-powered features.

  • Model defines the LLM used for tasks such as document enrichment, summaries, and SmartFAQs.

  • These settings override global configurations and apply only to the current collection.

👍

Best Practices

  • Always provide Access Key, Secret Key, Bucket Name, and Update Rate in S3 collection settings.
  • Use the include/exclude options to avoid indexing unnecessary file types.
  • For multiple collections, schedule them so that only 2–3 collections index at the same time.