Crawler Rules

Basic Crawler Rules

When SearchBlox initiates crawling from a specified root path, it follows this sequential verification process:

  1. Checks the robots.txt file
  2. Verifies allowed and disallowed paths
  3. Validates supported file formats
  4. Examines meta robots tags
  5. Applies additional settings (redirects, duplicate removal, etc.) according to WEB collection configuration

Crawling can be managed directly in SearchBlox through:

  1. WEB collection paths specification
  2. File format settings
  3. Collection-specific configuration parameters

Crawling can also be controlled at the webpage level using:

  1. robots.txt directives
  2. Canonical tags
  3. Robot meta tags

Rules for Robots.txt

  • By default, the crawler follows rules in robots.txt because Ignore Robots is disabled.
  • To ignore robots.txt, enable Ignore Robots in the WEB collection settings.
  • Rules in robots.txt have the highest priority and override all other WEB collection settings or rules.
  • Learn more about Using Robots.txt in HTTP Collection

Rules for Canonical

  • By default, canonical rules are ignored because Ignore Canonical is enabled in collection settings.

  • To index canonical URLs, the user must enable canonical settings in the collection.

  • If a canonical link is empty, e.g., <link href="" rel="canonical">, the actual URL will be used, which is the same as ignoring canonical.

  • If a URL matches the allow path but its canonical URL matches a disallow path (or doesn’t match the allow path), the URL will still be indexed.

    • The base URL that is crawled is used for indexing.
    • The canonical URL will replace the URL in the index, even if it’s in a disallow path.
    • Disallow paths are used only for crawling, not for replacing the URL in the index.
  • Learn more about Using Canonical Tags in HTTP Collection

Rules for meta robots

  • Meta robots rules are applied after robots.txt and path settings in WEB collections.
  • Meta robots can be used to allow crawling but prevent indexing, or allow indexing but prevent crawling**.
  • To avoid indexing but allow crawling, add this meta tag to the page:
    <meta name="robots" content="noindex, follow">
  • To avoid crawling but allow indexing, use:
    <meta name="robots" content="index, nofollow">
  • To avoid both indexing and crawling, use:
    <meta name="robots" content="noindex, nofollow">
  • Learn more about Using Meta Robots

Rules for sitemaps

  • If Ignore Sitemaps is disabled, only the links in the sitemap (e.g., https://example.com/sitemap.xml) will be indexed.
  • By default, Ignore Sitemaps is enabled in HTTP collection settings.
  • Multiple sitemaps can be indexed by providing multiple URLs in the root path.
  • Spider depth and other HTTP settings do not apply to sitemaps.
  • SearchBlox supports standard XML sitemaps, but compressed XML files (tar or gzip) are not supported.
  • Learn more on Using Sitemaps

Rules for stopindex, noindex, google tags

  • Content wrapped in stopindex, noindex, or googleoff/googleon tags will not be indexed.
  • These tags cannot be used in** **the head section or meta tags.
  • These tags should not be nested inside each other.
  • Each tag must be properly closed: stopindex → startindex, noindex → noindex end, googleoff → googleon.
  • Please check the standards for these tags in wikipedia or google.
  • Learn more about Using Exclusion Meta Robots