Best Practices
Collection Path Settings
- Provide allow path to limit indexing to the domain provided in the root path.
- Provide disallow path if you need to avoid certain subpath from the domain indexed.
- Do not disable HTML format in allowed formats as indexing is not possible by disallowing HTML file.
Collection Settings
- Increase or decrease spider depth to limit the level of crawling.
- If you want to avoid data indexed by size or age, please provide relevant values in the settings.
- If you do not want to limit based on size or age, please provide -1 as the value.
- Provide relevant User-Agent if there is any specific limit in your robots.txt.
- If you need canonical URLs to be considered for indexing disable ignore canonical.
HTML Page Rules
- If you need to avoid a specific portion of your content from indexing please use stopindex tags in your webpages.
- Using robots.txt you can control the access to subpaths from your site.
- You can control crawling and indexing from your webpages using robots meta tags.
- If you have all your URLs listed in sitemap.xml, you can index all pages faster by enabling follow sitemaps and indexing the collection.
- If your webpages have redirects, please ensure to enable redirects in the collection's settings and also provide the redirected URL in the allowed path of the collection.
Schedule Operations
- While scheduling collections, ensure that the collection starts indexing after the previously scheduled operation gets completed.
- You can schedule clear operation two minutes before index operation if you need to perform clear before indexing.
- You can use daily schedule for indexing if the indexing operation completes within a day; otherwise, please use a weekly schedule.
Updated over 4 years ago