Best Practices
Collection Path Settings
- Use Allow Path to limit indexing to specific domains or URLs in the root path.
- Use Disallow Path to prevent certain sub paths from being indexed.
- Do not disable HTML in the allowed formats, as indexing will not work if HTML files are blocked.
Collection Settings
- Set Spider Depth to control how many levels of links the crawler follows from the root URL.
- To avoid indexing documents based on size or age, specify the maximum values in the settings.
- To remove limits based on document size or age, enter -1.
- Specify the User-Agent if your robots.txt file contains rules for specific crawlers.
- To index canonical URLs, disable Ignore Canonical so that only the preferred versions of pages are indexed.
HTML Page Rules
- To exclude certain parts of a webpage from indexing, use stopindex tags in the HTML content.
- Use robots.txt to control which sub paths of your website the crawler can access or block.
- You can also control crawling and indexing directly from webpages using robots meta tags.
- If all URLs are listed in sitemap.xml, you can speed up indexing by enabling Follow Sitemaps and indexing the collection.
- If your webpages use redirects, enable Redirects in the collection settings and include the redirected URL in the Allow Path.
Schedule Operations
- When scheduling collections, ensure that the next indexing run starts only after the previous scheduled task has completed.
- Use a daily schedule if indexing finishes within a day; if it takes longer, use a weekly schedule instead.
Updated 27 days ago
