Best Practices
Collection Path Settings
This page covers how to configure path settings, crawl behaviour, HTML page rules, and scheduling for WEB collections in SearchBlox.
Note: All settings described on this page can be configured from Collections > WEB Collection > Settings in the SearchBlox Admin Console.
Path Settings
Path settings control which URLs the crawler is allowed to index.
Allow Path
Use Allow Path to limit indexing to specific domains or URL paths within the root path. Only URLs matching the allowed paths will be indexed.
Disallow Path
Use Disallow Path to prevent specific sub-paths from being indexed. Any URL matching a disallowed path will be skipped during crawling.
Important: Do not disable HTML in the allowed formats. If HTML files are blocked, indexing will not work.
Collection Settings
| Setting | Description |
|---|---|
| Spider Depth | Controls how many levels of links the crawler follows from the root URL. Set to 0 to index only the root URL. |
| Max Document Size | Limits indexing to documents below a specified size. Enter -1 to remove the size limit. |
| Max Document Age | Limits indexing to documents newer than a specified age. Enter -1 to remove the age limit. |
| User-Agent | Specify a custom user-agent if your robots.txt file contains rules for specific crawlers. |
| Ignore Canonical | Disable this to index only the canonical (preferred) version of a page. Enabled by default. |
| Redirects | Enable this if your webpages use redirects. Include the redirected URL in the Allow Path as well. |
| Follow Sitemaps | Enable this to speed up indexing when all URLs are listed in a sitemap.xml file. |
HTML Page Rules
You can control what gets crawled and indexed directly from your web pages:
- Use stopindex tags in HTML content to exclude specific sections of a page from being indexed — such as headers, footers, or navigation. See Using Meta Robots for supported tags.
- Use robots.txt to control which sub-paths of your website the crawler can access or block. See Using Robots.txt for details.
- Use robots meta tags on individual pages to control crawling and indexing behaviour at the page level.
Scheduling Collections
When scheduling a WEB collection for recurring indexing, follow these guidelines:
- Ensure the next scheduled indexing run starts only after the previous one has completed — overlapping runs can cause indexing errors.
- Use a daily schedule if your full indexing cycle completes within 24 hours.
- Use a weekly schedule if indexing takes longer than a day to complete.
Tip: Monitor your indexing logs to understand how long each run takes before deciding on a schedule frequency.
inishes within a day; if it takes longer, use a weekly schedule instead.
