Using Canonical and Robots.txt
Using Canonical URLs
- Canonical URLs help avoid duplicates and improve links and ranking for content available through multiple URLs.
- The same content may be accessible via different URLs or even different domains.**
- By default, canonical tags are ignored.
- To index canonical URLs, set Ignore Canonical to No in WEB collection settings.
About Canonical Tags
- Use canonical tags when the same page is available under multiple sections or URLs.
https://sample.com/tea/
https://sample.com/hot-beverage/tea
https://sample.com/drinks/tea
- The same content is served on the www subdomain or the http/https protocol.
https://sample.com/tea/
httpss://sample.com/tea
https://www.sample.com/tea
- Content is partially or fully replicated on different pages or sites within the domain.
https://discussion.sample.com/tea/
https://blog.sample.com/tea
https://www.sample.com/tea
This replication can cause repetitive search results in SearchBlox. To fix this, define a Canonical URL for pages with the same or similar content.
- Choose the most preferred page as the canonical URL.
- Mark the preferred URL using the rel="canonical" link element.
<link rel="canonical" href="https://www.sample.com/tea" />
When indexed, SearchBlox will only index the canonical URL.
Important Information:
- To avoid error, please use the absolute path instead of relative path in canonical URLs.
- Please note that while we encourage you to use any of these methods, none of them is mandatory.
- If you don't indicate a canonical URL, Searchblox will identify the current URL as the best version or URL.
ref: https://support.google.com/webmasters/answer/139066?hl=en
Using Robots.txt
- Robots.txt rules are applied first during HTTP crawling, after path settings.
- To ignore robots.txt, enable Ignore Robots in the WEB collection settings.
- Rules in robots.txt have the highest priority and override all other collection settings.
About Robots.txt
- The robots.txt file is usually found at
https://www.example.com/robots.txt. - It tells crawling bots what pages they can or cannot crawl.**
- By default, the crawler follows robots.txt because Ignore Robots is disabled.
User-agent: *applies the rules to all robots.**Disallow: /means the robot cannot visit any pages on the site.Disallow:(empty) means the robot can visit all pages.- Sitemap rules in robots.txt are followed when the Sitemap setting in the WEB collection is enabled.
Sitemap:specifies which sitemaps the robot can use. Nested sitemaps are also supported.
Updated 4 months ago