Crawler Rules

Basic Crawler Rules

When SearchBlox initiates crawling from a specified root path, it follows this sequential verification process:

Checks the robots.txt file
Verifies allowed and disallowed paths
Validates supported file formats
Examines meta robots tags
Applies additional settings (redirects, duplicate removal, etc.) according to WEB collection configuration

Crawling can be managed directly in SearchBlox through:

Crawling can also be controlled at the webpage level using:

Rules specified in robots.txt would be considered by default by the crawler as ignore robots field would be disabled by default.
If you do not want to follow the robots.txt then enable ignore robots in the WEB collection settings.
As mentioned In Basic Crawler Rules, the rules specified robots.txt will take the highest precedence over other WEB collection rules or settings specified in the collection.
Learn more about Using Robots.txt in HTTP Collection

Rules for canonical will be ignored by default as ignore canonical would be enabled by default in collection settings.
If the user wants the canonical URLs to be indexed then enable the settings in the collection.
When a canonical link exists but empty? Example, <link href="" rel="canonical"> the actual URL would be taken as the URL that is, it is equivalent of giving ignore canonical.
When a URL matches the allow path but its canonical URL matches with disallow path (or does not match with allow path) the URL would be indexed. The base URL that is crawled would be taken into consideration, therefore, it would get indexed. The URL for that entry would be canonical URL even if it is available in disallow path. Disallow path would be used for crawling purpose only. When using canonical, only the URL would be replaced.
Learn more about Using Canonical Tags in HTTP Collection

Rules for crawling based on meta robots would be considered after robots.txt and path settings specified in WEB collections.
Using meta robots one can avoid indexing a page, but allow crawling of the page, and the other way around, it is also possible to avoid indexing of a page using meta robots.
To avoid indexing a page but allow crawling the following meta tag has to be specified in the page that is crawled
<meta name="robots" content="noindex, follow">
To avoid crawling but allow indexing the following meta tag has to be specified
<meta name="robots" content="index, nofollow">
To avoid both indexing and crawling the following meta tag has to be provided
<meta name="robots" content="noindex, nofollow">
Learn more about Using Meta Robots

Only sitemaps would be indexed on disabling ignore sitemaps that is, the links available in https://example.com/sitemap.xml would be indexed.
By default ignore sitemaps would be enabled in HTTP collection settings.
Multiple sitemaps can be indexed on providing multiple URLs in the root path.
Spider depth and other HTTP settings are not applicable for sitemaps.
Standard XML sitemaps are supported by SearchBlox. Sitemaps with compressed XML files with tar or gzip file extensions are not supported currently in SearchBlox.
Learn more on Using Sitemaps

Body content that is enclosed with stopindex tags or noindex tags or googleon and googleoff tags will not be included in the index
These tags are not applicable in the head section or meta tags.
These tags should not be nested.
stopindex should be followed by startindex tags, noindex start tag should be followed by noindex end tag and googleoff tag should be followed by googleon tag.
Please check the standards for these tags in wikipedia or google.
Learn more about Using Exclusion Meta Robots