Rules for URL Pattern

In the Web Collection there are three important fields in path settings

  • Root Path
  • Allow Path
  • Disallow Path

Root Path

  • The Root Path specifies the exact URL where the crawler will begin its operation. This is also referred to as the "start URL" and serves as the entry point for all crawling activities.
  • URL patterns should not be provided in the root path.
  • The Root Path must contain a complete URL, including the protocol (http:// or https://). This URL should be fully functional - when copied and pasted into a browser's address bar, it should directly access the intended starting page for the crawl operation. Incomplete URLs or relative paths will not function correctly as Root Path entries.
  • If the root path gets redirected to some other URL, please give the redirected URL in the root path or provide the same in allow path for successful indexing.
  • SearchBlox supports only HTTP and HTTPS protocols in the WEB collection. No other protocol would be supported.
    Example:
    http://www.searchblox.com
    https://www.searchblox.com
    SearchBlox does not support
    googleconnector://
  • Regex patterns cannot be provided in the root path. Regex prefixes are not supported.

Allow Path

Allow path ensures that the crawling/indexing includes a particular domain or path based on the path or regex provided

Example
Provide the complete path to include a domainhttps://www.searchblox.com/
Provide a subpath or folder to include that subpath in indexing/blog/
to limit indexing to a particular suffix or end of URL stringpdf$
com$

Disallow Path

Disallow path ensures that the crawling/indexing excludes a particular domain or path based on the path or regex provided

Example
Provide the complete path to exclude a domainhttps://www.searchblox.com/
Provide a subpath or folder to exclude that subpath in indexing/blog/
to exclude indexing to a particular suffix or end of URL stringpdf$
com$
comment character:
# can be used as a comment character in disallow path
#comment

🚧

Basic regular expressions in GNU regular expression libraries can allow and disallow paths.
https://www.gnu.org/software/gnulib/manual/html_node/Regular-expressions.html