Crawler Rules
Understand how the SearchBlox web crawler decides what to index, what to skip, and in what order rules are applied.
Rule priority order: When the crawler encounters a URL, it checks rules in this order — each rule can override what came before it:
- robots.txt (highest priority — overrides everything)
- Allow / Disallow paths
- Supported file formats
- Meta robots tags
- Collection-specific settings (redirects, duplicate removal, etc.)
Basic Crawler Rules
When SearchBlox initiates crawling from a specified root path, it follows this sequential verification process:
- Checks the robots.txt file
- Verifies allowed and disallowed paths
- Validates supported file formats
- Examines meta robots tags
- Applies additional settings (redirects, duplicate removal, etc.) according to WEB collection configuration
You can control crawling from two places:
In the SearchBlox Admin Console:
- WEB collection path rules (Allow / Disallow paths)
- File format settings
- Collection-specific configuration (spider depth, redirects, duplicate removal, etc.)
Directly on your web pages:
robots.txt— tells all crawlers what to skip- Canonical tags — tells the crawler which URL is the authoritative version of a page
- Meta robots tags — controls indexing and following links on a per-page basis
Rules for Robots.txt
- By default, the crawler follows rules in robots.txt because Ignore Robots is disabled.
- To ignore robots.txt, enable Ignore Robots in the WEB collection settings.
- Rules in robots.txt have the highest priority and override all other WEB collection settings or rules.
- Learn more about Using Robots.txt in HTTP Collection
Rules for Canonical
-
By default, canonical rules are ignored because Ignore Canonical is enabled in collection settings.
-
To index canonical URLs, the user must enable canonical settings in the collection.
-
If a canonical link is empty, e.g.,
<link href="" rel="canonical">, the actual URL will be used, which is the same as ignoring canonical. -
If a URL matches the allow path but its canonical URL matches a disallow path (or doesn’t match the allow path), the URL will still be indexed.
- The base URL that is crawled is used for indexing.
- The canonical URL will replace the URL in the index, even if it’s in a disallow path.
- Disallow paths are used only for crawling, not for replacing the URL in the index.
-
Learn more about Using Canonical Tags in HTTP Collection
Rules for meta robots
- Meta robots rules are applied after robots.txt and path settings in WEB collections.
- Meta robots can be used to allow crawling but prevent indexing, or allow indexing but prevent crawling**.
- To avoid indexing but allow crawling, add this meta tag to the page:
<meta name="robots" content="noindex, follow"> - To avoid crawling but allow indexing, use:
<meta name="robots" content="index, nofollow"> - To avoid both indexing and crawling, use:
<meta name="robots" content="noindex, nofollow"> - Learn more about Using Meta Robots
Rules for sitemaps
- If Ignore Sitemaps is disabled, only the links in the sitemap (e.g.,
https://example.com/sitemap.xml) will be indexed. - By default, Ignore Sitemaps is enabled in HTTP collection settings.
- Multiple sitemaps can be indexed by providing multiple URLs in the root path.
- Spider depth and other HTTP settings do not apply to sitemaps.
- SearchBlox supports standard XML sitemaps, but compressed XML files (tar or gzip) are not supported.
- Learn more on Using Sitemaps
Rules for stopindex, noindex, google tags
- Content wrapped in stopindex, noindex, or googleoff/googleon tags will not be indexed.
- These tags cannot be used in** **the head section or meta tags.
- These tags should not be nested inside each other.
- Each tag must be properly closed: stopindex → startindex, noindex → noindex end, googleoff → googleon.
- Please check the standards for these tags in wikipedia or google.
- Learn more about Using Exclusion Meta Robots
