Rules for URL Pattern

In SearchBlox HTTP Collection there are three important fields in path settings

  • Root Path
  • Allow Path
  • Disallow Path

Root Path

  • In the root path, we should provide the exact root URL from where the crawling has to start. We can otherwise call this the start URL.
  • URL patterns should not be provided in the root path.
  • The root path should have the complete path including the protocol. On copying and pasting the URL in the address bar, the link should go to the required page where crawling is supposed to start.
  • If the root path gets redirected to some other URL, please give the redirected URL in the root path or provide the same in allow path for successful indexing.
  • SearchBlox supports only HTTP and HTTPS protocols in WEB collection. No other protocol would be supported.
    example:
    SearchBlox supports
    http://www.searchblox.com
    https://www.searchblox.com
    SearchBlox does not support
    googleconnector://
  • regex patterns cannot be provided in the root path. regex prefixes are not supported

Allow Path

Allow path ensures that the crawling/indexing includes a particular domain or path based on the path or regex provided

Example
Provide complete path to include a domainwww.searchblox.com
Provide subpath or folder to include that subpath in indexing/blog/
to limit indexing to a particular suffix or end of URL stringpdf$
com$

Disallow Path

Disallow path ensures that the crawling/indexing excludes a particular domain or path based on the path or regex provided

Example
Provide complete path to exclude a domainwww.searchblox.com
or
https://www.searchblox.com
Provide subpath or folder to exclude that subpath in indexing/blog/
to exclude indexing to a particular suffix or end of URL stringpdf$
com$
comment character:
# can be used as a comment character in disallow path
#comment

🚧

Basic regular expressions in GNU regular expression libraries can be used to allow and disallow paths.
https://www.gnu.org/software/gnulib/manual/html_node/Regular-expressions.html