AEM Collection

AEM Collections index pages and assets in the AEM content repository, treating each page or asset as a separate document.

Prerequisites

Before creating an AEM Collection, make sure:

  • AEM author instance is running and accessible
  • AEM publisher instance is running and accessible
  • Admin credentials (username and password) for the AEM author instance are available

👍

Note:

SearchBlox should have access to AEM instances and reachable to crawl the AEM site pages.

Create an AEM Collection

Follow these steps to create a new AEM Collection:

  • Log in to the Admin Console
  • Go to the Collections tab
  • Click on "Create a New Collection" or the "+" icon
  • Select "AEM Collection" as the Collection Type
  • Enter a unique name for your collection (e.g., "intranet site")
  • Configure RAG settings (Enable for ChatBot and Hybrid RAG search)
  • Set Collection Access permissions (Private/Public)
  • Select the content language (if not English)
  • Click "Save" to create your collection

  • After creating the AEM collection, you will be taken to the Settings tab.

AEM Collection – Authentication Settings

The Authentication section in an AEM Collection allows you to configure how SearchBlox connects to your Adobe Experience Manager (AEM) instance to index content.

Fields Overview

Author Instance URL
The URL of the AEM Author instance from which documents are indexed. This is the source of content (e.g., http://localhost:4502). The URL must begin with http:// or https://.

Publisher Instance URL
The publish domain of the AEM instance. When configured, indexed documents are served from the publish domain while indexing is still performed from the author instance. The URL must begin with http:// or https://.

Authentication Type
Defines how SearchBlox authenticates with AEM. Three options are available:

  • Basic – Username and password authentication.
  • IMS S2S– OAuth-based Server-to-Server authentication using Adobe IMS client credentials.
  • IMS JWT – Service account authentication using a JSON Web Token (JWT).

Index Mode
Set to Auto by default. It discovers published pages first, then falls back to all pages. This mode is recommended for most setups.

Authentication Type: Basic

When Basic is selected as the Authentication Type, the following credentials are required to connect SearchBlox to the AEM instance:

Username
The AEM username used to authenticate. If service security is disabled on the AEM instance, this field can be left blank. Minimum 3 characters.

Password
The corresponding password for the AEM user account. Similarly, if service security is disabled, this field is not required. Minimum 3 characters.

NOTE: Basic authentication is straightforward and suitable for development or internal environments where OAuth-based credentials are not configured.

Generate Using LLM

This section allows SearchBlox to automatically enrich indexed AEM documents using a Large Language Model (LLM) at the time of indexing. The following fields can be toggled on or off:

FieldDescription
TitleAutomatically generates concise and relevant titles for documents during indexing.
DescriptionAutomatically generates relevant descriptions for documents during indexing.
TopicsAutomatically generates relevant topics/tags for documents during indexing.

NOTE: All three toggles are set to No by default. Enabling them improves content discoverability by ensuring documents have meaningful, AI-generated metadata even when the original AEM content lacks it.


Authentication Type: IMS S2S (Server-to-Server)

IMS S2S is Adobe's OAuth 2.0 Server-to-Server credential method, used to authenticate machine-to-machine integrations without user involvement. When selected, the following fields are required:

FieldDescription
Client IDThe Adobe Developer Console client ID from the Service Account (Server-to-Server) credential.
Client SecretThe client secret from Adobe Developer Console, stored encrypted at rest.
ScopesComma-separated IMS permission scopes (for example, AdobeID, openid, aem.folders, aem.assets.author). Leave blank to use the default scopes.
Organization IDThe Adobe Organization ID from the Developer Console project (format: XXXXXXXX@AdobeOrg).

Note: IMS S2S is the recommended authentication method for modern AEM integrations. It replaces the older JWT-based service account approach and supports secure, long-lived OAuth credentials managed through the Adobe Developer Console.

Authentication Type: IMS JWT

When IMS JWT is selected, SearchBlox authenticates with AEM using a JSON Web Token (JWT) via Adobe's Identity Management System (IMS). This is a service account-based approach where a signed JWT is exchanged for an access token. The following fields are required:

FieldDescription
Client IDThe Adobe Developer Console client ID from the Service Account (JWT) credential.
Client SecretThe client secret from Adobe Developer Console, stored encrypted at rest.
Organization IDThe Adobe Organization ID from the Developer Console project (format: XXXXXXXX@AdobeOrg).
Technical Account IDThe technical account ID associated with the JWT credential in the Developer Console (format: [email protected]).
MetascopesComma-separated JWT metascopes that define the permissions granted to the integration (e.g., ent_aem_cloud_api). Leave blank to use the default scope.
Private KeyThe RSA private key in PEM format, including the full -----BEGIN RSA PRIVATE KEY----- and -----END RSA PRIVATE KEY----- markers. Stored encrypted at rest.

NOTE: IMS JWT is suited for legacy Adobe service account integrations. Adobe has deprecated this method in favor of IMS S2S (OAuth Server-to-Server), so new integrations are encouraged to use IMS S2S where possible.


SettingsDescription
TitleGenerates concise and relevant titles for the indexed documents using LLM.
DescriptionGenerates the description for indexed documents using LLM.
TopicGenerates relevant topics for indexed documents using LLM based on document's content.
Auto RelevanceEnable/Disable Hybrid Search for automatic relevance ranking
  • Click on Save button and Click on Test Connection.

AEM Collection Paths to Index Specific Site Pages

  • AEM collection paths let you set Allow/Disallow paths for the crawler. To index specific site pages or assets, add the allow path format. To access the paths, click on the collection name in the Collections list.

Allow/Disallow Paths

  • Allow/Disallow paths let the crawler include or exclude URLs.
  • They help manage a collection by excluding unwanted URLs.
  • All Allow and Disallow paths relate to the publisher instance URL.
FieldDescription
Allow Pathshttps://xxx.xxx.xx.xx:xxxx/wk-events/
/aqua-collections/
/wellness-care/
https://xxx.xxx.xx.xx:xxxx/wk-events/standard.html
.* (Allows the crawler to go any external URL or domain.)
Disallow Paths.jsp
/cgi-bin/
/videos/
?params
Allowed FormatsSelect the document formats to be searchable in the collection.
Enable Content APIAllows crawling of document content with special characters included.

Synonyms

Synonyms help the search show relevant documents even when the exact search word is not used.
For example, if someone searches for “global,” the results can also include documents that use “world” or “international.”
We have an option to load Synonyms from the existing documents.


Schedule and Index


AEM collection should be indexed only on published pages. You can set the schedule for indexing a collection with the following frequency options:

  • Once
  • Hourly
  • Daily
  • Every 48 Hours
  • Every 96 Hours
  • Weekly
  • Monthly

The following operation can be performed in AEM collections

ActivityDescription
Enable Scheduler for IndexingTurn on to set the Start Date and Frequency for indexing.
ScheduleSet the indexing schedule for each collection based on the selected options.
View all Collection SchedulesGo to the Schedules section to see all collection schedules.

Manage Documents

Using the Manage Documents tab, you can perform the following operations:

  • Add/Update
  • Filter
  • View Content
  • View Metadata
  • Refresh
  • Delete

To add a document, click the + icon as shown in the screenshot.

Enter the document URL and click Add/Update.

Once the document is added or updated, the document URL will be displayed on the screen, and you will be able to perform the operations listed above.

Data Fields Tab

Using the Data Fields tab, you can create custom fields for search and view the default fields in non-encrypted collections. SearchBlox supports 4 types of Data Fields:

TypeDescription
KeywordUsed for alphanumeric values such as IDs, tags, codes, or other exact-match fields.
NumberUsed for numeric values such as prices, quantities, ratings, or counts.
DateUsed for date values that can be searched, sorted, and filtered.
TextUsed for full-text search within custom field content.
  • After configuring Data Fields, you must clear and re-index the collection for changes to take effect.

To know more about Data Fields please refer to Data Fields Tab

Prompts

When LLM/RAG is enabled, you can edit AI-based prompts for Title, Description, Topic, Image Description, and Smart FAQs.
You can customize these prompts anytime, and use Restore Default to reset them back to the original SearchBlox settings.


Models

The Models section lets you override the global embedding, reranking, and LLM settings for this specific collection. Changes made here apply only to the current collection and do not affect other collections.

Embedding

  • Provider specifies the embedding provider used to generate vector representations of documents.
  • Model defines the embedding model used to convert document content into vectors for semantic search.

Reranker

  • Provider specifies the reranker provider used for improving search result relevance.
  • Model defines the reranker model used to re-score and reorder search results based on relevance.

LLM

  • Provider specifies the Large Language Model provider used for AI-powered features.

  • Model defines the LLM used for tasks such as document enrichment, summaries, and SmartFAQs.

  • These settings override global configurations and apply only to the current collection.

Monitoring & Webhooks

The Monitoring & Webhooks tab provides settings for monitoring content changes and configuring webhook endpoints for automatic synchronization between AEM and SearchBlox.

Content Monitoring

Scheduled Monitoring

Enables automatic synchronization based on the configured schedule. When enabled, SearchBlox periodically checks the content source and performs synchronization according to the selected interval.

Delta Sync

Controls whether synchronization processes only changed content or performs a full synchronization.

  • When enabled, only new, updated, or deleted content is synchronized.
  • When disabled, every synchronization performs a full crawl of the content source.

Sync Interval

Specifies how frequently SearchBlox checks for content updates when Scheduled Monitoring is enabled. The selected interval determines how often synchronization jobs are executed.

Sync History

The Sync History section displays information about previous synchronization jobs.

The table includes:

  • Type – The type of synchronization that was performed.
  • Status – The result of the synchronization job.
  • Started – The date and time when the synchronization began.
  • Duration – The time taken to complete the synchronization.

Webhooks

The Webhooks section provides endpoints and security settings used to receive content update notifications from AEM.

Standard AEM Webhook URL

Endpoint used by a classic or on-premise AEM replication agent to notify SearchBlox when content is published.

Adobe I/O Events Webhook URL

Endpoint used by Adobe Experience Manager as a Cloud Service to send Adobe I/O event notifications to SearchBlox. Incoming requests are validated using the Adobe I/O signature.

Standard Webhook Secret

Shared secret value used to validate requests sent to the Standard AEM Webhook URL. Leave the field blank to retain the current secret.

Adobe I/O Webhook Secret

Signing secret used to verify Adobe I/O event notifications. This value is used to validate the x-adobe-signature included in incoming requests. Leave the field blank to retain the current secret.

Save

Saves any changes made to the webhook configuration and monitoring settings.

Best Practices

  • Verify that both the AEM Author and Publisher instance URLs are correct and accessible from the SearchBlox server before saving the collection settings.
  • Always click Test Connection after saving the settings to confirm that the connection is working before starting indexing.
  • Use Allow Paths to limit indexing to specific sections of the site. Indexing the entire AEM repository without path restrictions can significantly increase indexing time.
  • Index only published pages to ensure that search results reflect live content.
  • When managing multiple collections, schedule indexing so that only 2–3 collections run simultaneously to optimize system performance.