Crawler Rules
Basic Crawler Rules
When SearchBlox initiates crawling from a specified root path, it follows this sequential verification process:
- Checks the robots.txt file
- Verifies allowed and disallowed paths
- Validates supported file formats
- Examines meta robots tags
- Applies additional settings (redirects, duplicate removal, etc.) according to WEB collection configuration
Crawling can be managed directly in SearchBlox through:
- WEB collection paths specification
- File format settings
- Collection-specific configuration parameters
Crawling can also be controlled at the webpage level using:
- robots.txt directives
- Canonical tags
- Robot meta tags
Rules for Robots.txt
- By default, the crawler follows rules in robots.txt because Ignore Robots is disabled.
- To ignore robots.txt, enable Ignore Robots in the WEB collection settings.
- Rules in robots.txt have the highest priority and override all other WEB collection settings or rules.
- Learn more about Using Robots.txt in HTTP Collection
Rules for Canonical
-
By default, canonical rules are ignored because Ignore Canonical is enabled in collection settings.
-
To index canonical URLs, the user must enable canonical settings in the collection.
-
If a canonical link is empty, e.g.,
<link href="" rel="canonical">, the actual URL will be used, which is the same as ignoring canonical. -
If a URL matches the allow path but its canonical URL matches a disallow path (or doesn’t match the allow path), the URL will still be indexed.
- The base URL that is crawled is used for indexing.
- The canonical URL will replace the URL in the index, even if it’s in a disallow path.
- Disallow paths are used only for crawling, not for replacing the URL in the index.
-
Learn more about Using Canonical Tags in HTTP Collection
Rules for meta robots
- Meta robots rules are applied after robots.txt and path settings in WEB collections.
- Meta robots can be used to allow crawling but prevent indexing, or allow indexing but prevent crawling**.
- To avoid indexing but allow crawling, add this meta tag to the page:
<meta name="robots" content="noindex, follow"> - To avoid crawling but allow indexing, use:
<meta name="robots" content="index, nofollow"> - To avoid both indexing and crawling, use:
<meta name="robots" content="noindex, nofollow"> - Learn more about Using Meta Robots
Rules for sitemaps
- If Ignore Sitemaps is disabled, only the links in the sitemap (e.g.,
https://example.com/sitemap.xml) will be indexed. - By default, Ignore Sitemaps is enabled in HTTP collection settings.
- Multiple sitemaps can be indexed by providing multiple URLs in the root path.
- Spider depth and other HTTP settings do not apply to sitemaps.
- SearchBlox supports standard XML sitemaps, but compressed XML files (tar or gzip) are not supported.
- Learn more on Using Sitemaps
Rules for stopindex, noindex, google tags
- Content wrapped in stopindex, noindex, or googleoff/googleon tags will not be indexed.
- These tags cannot be used in** **the head section or meta tags.
- These tags should not be nested inside each other.
- Each tag must be properly closed: stopindex → startindex, noindex → noindex end, googleoff → googleon.
- Please check the standards for these tags in wikipedia or google.
- Learn more about Using Exclusion Meta Robots
Updated 4 months ago
