Rules for URL Pattern

In the Web Collection there are three important fields in path settings

  • Root Path
  • Allow Path
  • Disallow Path

Root Path

  • The Root Path is the exact URL where the crawler starts, also called the start URL.**
  • Do not use URL patterns in the Root Path.
  • The Root Path must be a complete URL with the protocol (http:// or https://). It should work when pasted in a browser. Incomplete or relative URLs will not work.
  • If the Root Path redirects to another URL, use the redirected URL in the Root Path or in the Allow Path for proper indexing.
  • Only HTTP and HTTPS protocols are supported. Other protocols (e.g., googleconnector://) are not supported.**
    Example:
    http://www.searchblox.com
    https://www.searchblox.com
  • Regex patterns are not allowed in the Root Path.

Allow Path

Allow Path lets you include specific domains, paths, or patterns for crawling and indexing:

Example
Provide the complete path to include a domainInclude the whole site:https://www.searchblox.com/
Provide a subpath or folder to include that subpath in indexingInclude a specific folder:/blog/
to limit indexing to a particular suffix or end of URL stringLimit to specific endings: pdf$
com$

Disallow Path

Disallow Path lets you exclude specific domains, paths, or patterns from crawling and indexing:

Example
Provide the complete path to exclude a domainhttps://www.searchblox.com/
Provide a subpath or folder to exclude that subpath in indexing/blog/
to exclude indexing to a particular suffix or end of URL stringpdf$
com$
comment character:
# can be used as a comment character in disallow path
#comment

🚧

Basic regular expressions in GNU regular expression libraries can allow and disallow paths.
https://www.gnu.org/software/gnulib/manual/html_node/Regular-expressions.html