Rules for URL Pattern
In the Web Collection there are three important fields in path settings
- Root Path
- Allow Path
- Disallow Path
Root Path
- The Root Path is the exact URL where the crawler starts, also called the start URL.**
- Do not use URL patterns in the Root Path.
- The Root Path must be a complete URL with the protocol (
http://orhttps://). It should work when pasted in a browser. Incomplete or relative URLs will not work. - If the Root Path redirects to another URL, use the redirected URL in the Root Path or in the Allow Path for proper indexing.
- Only HTTP and HTTPS protocols are supported. Other protocols (e.g.,
googleconnector://) are not supported.**
Example:
http://www.searchblox.com
https://www.searchblox.com - Regex patterns are not allowed in the Root Path.
Allow Path
Allow Path lets you include specific domains, paths, or patterns for crawling and indexing:
| Example | |
|---|---|
| Provide the complete path to include a domain | Include the whole site:https://www.searchblox.com/ |
| Provide a subpath or folder to include that subpath in indexing | Include a specific folder:/blog/ |
| to limit indexing to a particular suffix or end of URL string | Limit to specific endings: pdf$ com$ |
Disallow Path
Disallow Path lets you exclude specific domains, paths, or patterns from crawling and indexing:
| Example | |
|---|---|
| Provide the complete path to exclude a domain | https://www.searchblox.com/ |
| Provide a subpath or folder to exclude that subpath in indexing | /blog/ |
| to exclude indexing to a particular suffix or end of URL string | pdf$ com$ |
| comment character: # can be used as a comment character in disallow path | #comment |
Basic regular expressions in GNU regular expression libraries can allow and disallow paths.
https://www.gnu.org/software/gnulib/manual/html_node/Regular-expressions.html
Updated 4 months ago
What’s Next
