Best Practices

Collection Path Settings

  • Provide allow path to limit indexing to the domain provided in the root path.
  • Provide disallow path if you need to avoid certain subpath from the domain indexed.
  • Do not disable HTML format in allowed formats as indexing is not possible by disallowing HTML file.

Collection Settings

  • Increase or decrease spider depth to limit the level of crawling.
  • If you want to avoid data indexed by size or age, please provide relevant values in the settings.
  • If you do not want to limit based on size or age, please provide -1 as the value.
  • Provide relevant User-Agent if there is any specific limit in your robots.txt.
  • If you need canonical URLs to be considered for indexing disable ignore canonical.

HTML Page Rules

  • If you need to avoid a specific portion of your content from indexing please use stopindex tags in your webpages.
  • Using robots.txt you can control the access to subpaths from your site.
  • You can control crawling and indexing from your webpages using robots meta tags.
  • If you have all your URLs listed in sitemap.xml, you can index all pages faster by enabling follow sitemaps and indexing the collection.
  • If your webpages have redirects, please ensure to enable redirects in the collection's settings and also provide the redirected URL in the allowed path of the collection.

Schedule Operations

  • While scheduling collections, ensure that the collection starts indexing after the previously scheduled operation gets completed.
  • You can schedule clear operation two minutes before index operation if you need to perform clear before indexing.
  • You can use daily schedule for indexing if the indexing operation completes within a day; otherwise, please use a weekly schedule.