Best Practices

Collection Path Settings

  • Use Allow Path to limit indexing to specific domains or URLs in the root path.
  • Use Disallow Path to prevent certain sub paths from being indexed.
  • Do not disable HTML in the allowed formats, as indexing will not work if HTML files are blocked.

Collection Settings

  • Set Spider Depth to control how many levels of links the crawler follows from the root URL.
  • To avoid indexing documents based on size or age, specify the maximum values in the settings.
  • To remove limits based on document size or age, enter -1.
  • Specify the User-Agent if your robots.txt file contains rules for specific crawlers.
  • To index canonical URLs, disable Ignore Canonical so that only the preferred versions of pages are indexed.

HTML Page Rules

  • To exclude certain parts of a webpage from indexing, use stopindex tags in the HTML content.
  • Use robots.txt to control which sub paths of your website the crawler can access or block.
  • You can also control crawling and indexing directly from webpages using robots meta tags.
  • If all URLs are listed in sitemap.xml, you can speed up indexing by enabling Follow Sitemaps and indexing the collection.
  • If your webpages use redirects, enable Redirects in the collection settings and include the redirected URL in the Allow Path.

Schedule Operations

  • When scheduling collections, ensure that the next indexing run starts only after the previous scheduled task has completed.
  • Use a daily schedule if indexing finishes within a day; if it takes longer, use a weekly schedule instead.