Using Canonical and Robots.txt

Using Canonical URLs

  • You can use canonical URLs to avoid duplicates, improve links, and ranking signals for content available through multiple URL structures.
  • It is quite possible that the same content could be accessed through multiple URLs. It is also possible that content can be distributed to different URLs and domains entirely.
  • By default, canonical tags would be ignored.
  • To enable canonical URL indexing, please set ignore canonical as No in the WEB collection settings.

About Canonical Tags

  1. Your site has multiple URLs as you position the same page under multiple sections.
http://sample.com/tea/

http://sample.com/hot-beverage/tea

http://sample.com/drinks/tea
  1. You serve the same content for the www subdomain or the http/https protocol.
http://sample.com/tea/

https://sample.com/tea

http://www.sample.com/tea
  1. Content is replicated partially or fully in different pages or sites in the domain
http://discussion.sample.com/tea/

http://blog.sample.com/tea

http://www.sample.com/tea

This replication of pages could result in a few challenges for those who search for them in SearchBlox resulting in repetitive content in results pointing to different pages. To overcome this challenge you could define Canonical URL for pages that have similar or equivalent content in multiple URLs/pages.

  1. Set the site which is most preferred as the canonical URL
  2. Indicate the preferred URL with the rel="canonical" link element as shown.
<link rel="canonical" href="http://www.sample.com/tea" />

On indexing the URL, SearchBlox will index only the URL that has been specified as canonical.

🚧

Important Information:

  • To avoid error, please use the absolute path instead of relative path in canonical URLs.
  • Please note that while we encourage you to use any of these methods, none of them is mandatory.
  • If you don't indicate a canonical URL, Searchblox will identify the current URL as the best version or URL.

ref: https://support.google.com/webmasters/answer/139066?hl=en

Using Robots.txt

  • Robots.txt is the basic rule that gets considered in HTTP crawling after path settings.
  • If you do not want to follow the robots.txt then enable ignore robots in the WEB collection settings.
  • The rules specified robots.txt will take the highest precedence over other WEB collection rules or settings specified in the collection.

About Robots.txt

  • robots.txt can be found in the URL https://www.example.com/robots.txt
  • robots.txt is used by the website to give instruction to crawling bots on what can be crawled and what not to be crawled.
  • Rules specified in robots.txt would be considered by default by the crawler as ignore robots field would be disabled by default.
  • User-agent: * means that the disallow section that follows next is applicable to all robots.
  • Disallow: / tells the robot that it should not visit any pages on the site.
  • Disallow: tells the robot that it can visit all pages on the site.
  • Sitemap rules specified in robots.txt would also be considered by the crawler when the Sitemap setting of WEB collection gets enabled.
  • Sitemap: tells the robot that it can visit all pages on the specified list of sitemap(s). It supports nested sitemaps as well.