AEM Collection

AEM Collections index pages and assets in the AEM content repository, treating each page or asset as a separate document.

Prerequisites

Before creating an AEM Collection, make sure:

  • AEM author instance is running and accessible
  • AEM publisher instance is running and accessible
  • Admin credentials (username and password) for the AEM author instance are available

👍

Note:

SearchBlox should have access to AEM instances and reachable to crawl the AEM site pages.

Create an AEM Collection

Follow these steps to create a new AEM Collection:

  1. Log in to the Admin Console
  2. Go to the Collections tab
  3. Click on "Create a New Collection" or the "+" icon
  4. Select "AEM Collection" as the Collection Type
  5. Enter a unique name for your collection (e.g., "intranet site")
  6. Configure RAG settings (Enable for ChatBot and Hybrid RAG search)
  7. Set Collection Access permissions (Private/Public)
  8. Select the content language (if not English)
  9. Click "Save" to create your collection


  • After creating the AEM collection, you will be taken to the Settings tab.

Settings Tab

  • Provide the Authentication fields.
FieldDescription
Author Instance URLURL of the AEM Author instance to index documents from the content repository.
Publisher Instance URLURL of the AEM Publisher instance. Documents are served from the publisher, while indexing happens from the author instance.
UsernameAEM username with admin privileges. Not required if service security is disabled.
PasswordCorresponding password for the AEM username. Not required if service security is disabled.

Choose the settings for Generate Using LLM and Hybrid Search.

SettingsDescription
TitleGenerates concise and relevant titles for the indexed documents using LLM.
DescriptionGenerates the description for indexed documents using LLM.
TopicGenerates relevant topics for indexed documents using LLM based on document's content.
Auto RelevanceEnable/Disable Hybrid Search for automatic relevance ranking
  • Click on Save button and Click on Test Connection.

AEM Collection Paths to Index Specific Site Pages

  • AEM collection paths let you set Allow/Disallow paths for the crawler. To index specific site pages or assets, add the allow path format. To access the paths, click on the collection name in the Collections list.

Allow/Disallow Paths

  • Allow/Disallow paths let the crawler include or exclude URLs.
  • They help manage a collection by excluding unwanted URLs.
  • All Allow and Disallow paths relate to the publisher instance URL.
FieldDescription
Allow Pathshttps://xxx.xxx.xx.xx:xxxx/wk-events/
/aqua-collections/
/wellness-care/
https://xxx.xxx.xx.xx:xxxx/wk-events/standard.html
.* (Allows the crawler to go any external URL or domain.)
Disallow Paths.jsp
/cgi-bin/
/videos/
?params
Allowed FormatsSelect the document formats to be searchable in the collection.
Enable Content APIAllows crawling of document content with special characters included.

Schedule and Index


AEM collection should be indexed only on published pages. You can set the schedule for indexing a collection with the following frequency options:

  • Once
  • Hourly
  • Daily
  • Every 48 Hours
  • Every 96 Hours
  • Weekly
  • Monthly

The following operation can be performed in AEM collections

ActivityDescription
Enable Scheduler for IndexingTurn on to set the Start Date and Frequency for indexing.
ScheduleSet the indexing schedule for each collection based on the selected options.
View all Collection SchedulesGo to the Schedules section to see all collection schedules.

Data Fields Tab

Using the Data Fields tab, you can create custom fields for search and view the default fields in non-encrypted collections. SearchBlox supports 4 types of Data Fields:

  1. Keyword
  2. Number
  3. Date
  4. Text
  • After configuring Data Fields, you must clear and re-index the collection for changes to take effect.

To know more about Data Fields please refer to Data Fields Tab


Models

Embedding

  • Provider specifies the embedding provider used to generate vector representations of documents.
  • Model defines the embedding model used to convert document content into vectors for semantic search.

Reranker

  • Provider specifies the reranker provider used for improving search result relevance.
  • Model defines the reranker model used to re-score and reorder search results based on relevance.

LLM

  • Provider specifies the Large Language Model provider used for AI-powered features.

  • Model defines the LLM used for tasks such as document enrichment, summaries, and SmartFAQs.

  • These settings override global configurations and apply only to the current collection.