Using Canonical and Robots.txt

Using Canonical URLs

  • Canonical URLs help avoid duplicates and improve links and ranking for content available through multiple URLs.
  • The same content may be accessible via different URLs or even different domains.**
  • By default, canonical tags are ignored.
  • To index canonical URLs, set Ignore Canonical to No in WEB collection settings.

About Canonical Tags

  1. Use canonical tags when the same page is available under multiple sections or URLs.
https://sample.com/tea/

https://sample.com/hot-beverage/tea

https://sample.com/drinks/tea
  1. The same content is served on the www subdomain or the http/https protocol.
https://sample.com/tea/

httpss://sample.com/tea

https://www.sample.com/tea
  1. Content is partially or fully replicated on different pages or sites within the domain.
https://discussion.sample.com/tea/

https://blog.sample.com/tea

https://www.sample.com/tea

This replication can cause repetitive search results in SearchBlox. To fix this, define a Canonical URL for pages with the same or similar content.

  1. Choose the most preferred page as the canonical URL.
  2. Mark the preferred URL using the rel="canonical" link element.
<link rel="canonical" href="https://www.sample.com/tea" />

When indexed, SearchBlox will only index the canonical URL.

🚧

Important Information:

  • To avoid error, please use the absolute path instead of relative path in canonical URLs.
  • Please note that while we encourage you to use any of these methods, none of them is mandatory.
  • If you don't indicate a canonical URL, Searchblox will identify the current URL as the best version or URL.

ref: https://support.google.com/webmasters/answer/139066?hl=en

Using Robots.txt

  • Robots.txt rules are applied first during HTTP crawling, after path settings.
  • To ignore robots.txt, enable Ignore Robots in the WEB collection settings.
  • Rules in robots.txt have the highest priority and override all other collection settings.

About Robots.txt

  • The robots.txt file is usually found at https://www.example.com/robots.txt.
  • It tells crawling bots what pages they can or cannot crawl.**
  • By default, the crawler follows robots.txt because Ignore Robots is disabled.
  • User-agent: * applies the rules to all robots.**
  • Disallow: / means the robot cannot visit any pages on the site.
  • Disallow: (empty) means the robot can visit all pages.
  • Sitemap rules in robots.txt are followed when the Sitemap setting in the WEB collection is enabled.
  • Sitemap: specifies which sitemaps the robot can use. Nested sitemaps are also supported.