Crawler Rules

Basic Crawler Rules

  • When SearchBlox starts the crawling based on the root path, the robots.txt is checked first then the allow and disallow paths are verified, formats are verified, then meta robots will be checked and later other settings such as redirects and remove duplicates are checked based on the HTTP collection settings.
  • Crawling can be controlled in SearchBlox using the HTTP collection paths and file formats specified and collection settings.
  • Crawling can be controlled from the webpage using robots.txt, canonical tags and robot meta tags.

Rules for Robots.txt

  • Rules specified in robots.txt would be considered by default by the crawler as ignore robots field would be disabled by default.
  • If you do not want to follow the robots.txt then enable ignore robots in the HTTP collection settings.
  • As mentioned In Basic Crawler Rules, the rules specified robots.txt will take the highest precedence over other HTTP collection rules or settings specified in the collection.
  • Learn more about Using Robots.txt in HTTP Collection

Rules for Canonical

  • Rules for canonical will be ignored by default as ignore canonical would be enabled by default in collection settings.
  • If the user wants the canonical URLs to be indexed then enable the settings in the collection.
  • When a canonical link exists but empty? Example, <link href="" rel="canonical"> the actual URL would be taken as the URL that is, it is equivalent of giving ignore canonical.
  • When a URL matches the allow path but its canonical URL matches with disallow path (or does not match with allow path) the URL would be indexed. The base URL that is crawled would be taken into consideration, therefore, it would get indexed. The URL for that entry would be canonical URL even if it is available in disallow path. Disallow path would be used for crawling purpose only. When using canonical, only the URL would be replaced.
  • Learn more about Using Canonical Tags in HTTP Collection

Rules for meta robots

  • Rules for crawling based on meta robots would be considered after robots.txt and path settings specified in HTTP collections.
  • Using meta robots one can avoid indexing a page, but allow crawling of the page, and the other way around, it is also possible to avoid indexing of a page using meta robots.
  • To avoid indexing a page but allow crawling the following meta tag has to be specified in the page that is crawled
    <meta name="robots" content="noindex, follow">
  • To avoid crawling but allow indexing the following meta tag has to be specified
    <meta name="robots" content="index, nofollow">
  • To avoid both indexing and crawling the following meta tag has to be provided
    <meta name="robots" content="noindex, nofollow">
  • Learn more about Using Meta Robots

Rules for sitemaps

  • Only sitemaps would be indexed on disabling ignore sitemaps that is, the links available in http://example.com/sitemap.xml would be indexed.
  • By default ignore sitemaps would be enabled in HTTP collection settings.
  • Multiple sitemaps can be indexed on providing multiple URLs in the root path.
  • Spider depth and other HTTP settings are not applicable for sitemaps.
  • Standard XML sitemaps are supported by SearchBlox. Sitemaps with compressed XML files with tar or gzip file extensions are not supported currently in SearchBlox.
  • Learn more on Using Sitemaps

Rules for stopindex, noindex, google tags

  • Body content that is enclosed with stopindex tags or noindex tags or googleon and googleoff tags will not be included in the index
  • These tags are not applicable in the head section or meta tags.
  • These tags should not be nested.
  • stopindex should be followed by startindex tags, noindex start tag should be followed by noindex end tag and googleoff tag should be followed by googleon tag.
  • Please check the standards for these tags in wikipedia or google.
  • Learn more about Using Exclusion Meta Robots