AEM Collection

AEM Collections index pages and assets in the AEM content repository, treating each page or asset as a separate document.

Prerequisites

Before creating an AEM Collection, make sure:

  • AEM author instance is running and accessible
  • AEM publisher instance is running and accessible
  • Admin credentials (username and password) for the AEM author instance are available

👍

Note:

SearchBlox should have access to AEM instances and reachable to crawl the AEM site pages.

Create an AEM Collection

Follow these steps to create a new AEM Collection:

  • Log in to the Admin Console
  • Go to the Collections tab
  • Click on "Create a New Collection" or the "+" icon
  • Select "AEM Collection" as the Collection Type
  • Enter a unique name for your collection (e.g., "intranet site")
  • Configure RAG settings (Enable for ChatBot and Hybrid RAG search)
  • Set Collection Access permissions (Private/Public)
  • Select the content language (if not English)
  • Click "Save" to create your collection

  • After creating the AEM collection, you will be taken to the Settings tab.

Settings Tab

  • Provide the Authentication fields.
FieldDescription
Author Instance URLURL of the AEM Author instance to index documents from the content repository.
Publisher Instance URLURL of the AEM Publisher instance. Documents are served from the publisher, while indexing happens from the author instance.
UsernameAEM username with admin privileges. Not required if service security is disabled.
PasswordCorresponding password for the AEM username. Not required if service security is disabled.

Choose the settings for Generate Using LLM and Hybrid Search.


SettingsDescription
TitleGenerates concise and relevant titles for the indexed documents using LLM.
DescriptionGenerates the description for indexed documents using LLM.
TopicGenerates relevant topics for indexed documents using LLM based on document's content.
Auto RelevanceEnable/Disable Hybrid Search for automatic relevance ranking
  • Click on Save button and Click on Test Connection.

AEM Collection Paths to Index Specific Site Pages

  • AEM collection paths let you set Allow/Disallow paths for the crawler. To index specific site pages or assets, add the allow path format. To access the paths, click on the collection name in the Collections list.

Allow/Disallow Paths

  • Allow/Disallow paths let the crawler include or exclude URLs.
  • They help manage a collection by excluding unwanted URLs.
  • All Allow and Disallow paths relate to the publisher instance URL.
FieldDescription
Allow Pathshttps://xxx.xxx.xx.xx:xxxx/wk-events/
/aqua-collections/
/wellness-care/
https://xxx.xxx.xx.xx:xxxx/wk-events/standard.html
.* (Allows the crawler to go any external URL or domain.)
Disallow Paths.jsp
/cgi-bin/
/videos/
?params
Allowed FormatsSelect the document formats to be searchable in the collection.
Enable Content APIAllows crawling of document content with special characters included.

Schedule and Index


AEM collection should be indexed only on published pages. You can set the schedule for indexing a collection with the following frequency options:

  • Once
  • Hourly
  • Daily
  • Every 48 Hours
  • Every 96 Hours
  • Weekly
  • Monthly

The following operation can be performed in AEM collections

ActivityDescription
Enable Scheduler for IndexingTurn on to set the Start Date and Frequency for indexing.
ScheduleSet the indexing schedule for each collection based on the selected options.
View all Collection SchedulesGo to the Schedules section to see all collection schedules.

Data Fields Tab

Using the Data Fields tab, you can create custom fields for search and view the default fields in non-encrypted collections. SearchBlox supports 4 types of Data Fields:

  1. Keyword
  2. Number
  3. Date
  4. Text
  • After configuring Data Fields, you must clear and re-index the collection for changes to take effect.

To know more about Data Fields please refer to Data Fields Tab


Models

Embedding

  • Provider specifies the embedding provider used to generate vector representations of documents.
  • Model defines the embedding model used to convert document content into vectors for semantic search.

Reranker

  • Provider specifies the reranker provider used for improving search result relevance.
  • Model defines the reranker model used to re-score and reorder search results based on relevance.

LLM

  • Provider specifies the Large Language Model provider used for AI-powered features.

  • Model defines the LLM used for tasks such as document enrichment, summaries, and SmartFAQs.

  • These settings override global configurations and apply only to the current collection.

Monitoring & Webhooks

The Monitoring & Webhooks tab provides settings for monitoring content changes and configuring webhook endpoints for automatic synchronization between AEM and SearchBlox.

Content Monitoring

Scheduled Monitoring

Enables automatic synchronization based on the configured schedule. When enabled, SearchBlox periodically checks the content source and performs synchronization according to the selected interval.

Delta Sync

Controls whether synchronization processes only changed content or performs a full synchronization.

  • When enabled, only new, updated, or deleted content is synchronized.
  • When disabled, every synchronization performs a full crawl of the content source.

Sync Interval

Specifies how frequently SearchBlox checks for content updates when Scheduled Monitoring is enabled. The selected interval determines how often synchronization jobs are executed.

Sync History

The Sync History section displays information about previous synchronization jobs.

The table includes:

  • Type – The type of synchronization that was performed.
  • Status – The result of the synchronization job.
  • Started – The date and time when the synchronization began.
  • Duration – The time taken to complete the synchronization.

Webhooks

The Webhooks section provides endpoints and security settings used to receive content update notifications from AEM.

Standard AEM Webhook URL

Endpoint used by a classic or on-premise AEM replication agent to notify SearchBlox when content is published.

Adobe I/O Events Webhook URL

Endpoint used by Adobe Experience Manager as a Cloud Service to send Adobe I/O event notifications to SearchBlox. Incoming requests are validated using the Adobe I/O signature.

Standard Webhook Secret

Shared secret value used to validate requests sent to the Standard AEM Webhook URL. Leave the field blank to retain the current secret.

Adobe I/O Webhook Secret

Signing secret used to verify Adobe I/O event notifications. This value is used to validate the x-adobe-signature included in incoming requests. Leave the field blank to retain the current secret.

Save

Saves any changes made to the webhook configuration and monitoring settings.