SearchBlox includes a web crawler to index content from any intranet, portal or website. The crawler can also index HTTPS-based content without any additional configuration, and crawl through a proxy server or HTTP Basic Auth/Form authentication.
- After logging in to the Admin Console, click on the Add Collection button.
- Enter a unique Collection name for the data source (for example, intranetsite).
- Choose HTTP Collection as Collection Type.
- Choose the language of the content (if language is other than English).
- Click Add to create the collection.
The HTTP collection Paths allow you to configure the Root URLs and the Allow/Disallow paths for the crawler. To access the paths for the HTTP collection, click on the collection name in the Collections list.
The root URL is the starting URL for the crawler. It requests this URL, indexes the content, and follows links from the URL. Make sure the root URL entered has regular HTML HREF links that the crawler can follow. In the paths sub-tab, enter at least one root URL for the HTTP Collection in the Root URLs.
Allow/Disallow paths ensure the crawler can include or exclude URLs. Allow and Disallow paths make it possible to manage a collection by excluding unwanted URLs.
http://www.cnn.com/ (Informs the crawler to stay within the cnn.com site.)
.* (Allows the crawler to go any external URL or domain.)
Select the document formats that need to be searchable within the collection.
Keep the crawler within the required domain(s)
Enter the Root URL domain name(s) (for example cnn.com or nytimes.com) within the Allow Paths to ensure the crawler stays within the required domains. If .* is left as the value within the Allow Paths, the crawler will go to any external domain and index the web pages.
The Settings page has configurable parameters for the crawler. SearchBlox provides default parameters when a new collection is created. Most crawler settings can be changed for your specific requirements.
The keyword-in-context returns search results with the description displayed from content areas where the search term occurs.
HTML Parser Settings
This setting configures the HTML parser to read the description for a document from one of the HTML tags: H1, H2, H3, H4, H5, H6
Maximum Document Age
Specifies the maximum allowable age in days of a document in the collection.
Maximum Document Size
Specifies the maximum allowable size in kilobytes of a document in the collection.
Maximum Spider Depth
Specifies the maximum depth the spider is allowed to proceed to index documents. Maximum value of Spider depth that can be given in SearchBlox is 15
Specifies the wait time in milliseconds for the spider between HTTP requests to a web server.
The name under which the spider requests documents from a web server.
This is a URL value set in the request headers to specify where the user agent previously visited.
Value is set to Yes or No to tell the spider to obey robot rules or not. The default value is no.
Value is set to Yes or No to tell the spider to ignore canonical urls specified in the page. The default value is yes.
Value is set to Yes or No to tell the spider whether sitemaps alone can be indexed, or if all of the URLs have to be indexed respectively. The default value is no.
Is set to Yes or No to instruct the spider to automatically follow redirects or not.
When stemming is enabled, inflected words are reduced to root form. For example, "running", "runs", and "ran" are the inflected form of run.
When enabled, prevents indexing duplicate documents.
Boost search terms for the collection by setting a value greater than 1 (maximum value 9999).
When enabled, a spelling index is created at the end of the indexing process.
Provides the indexer activity in detail in ../searchblox/logs/index.log.
The details that occur in the index.log when logging or debug logging mode are enabled are:
- List of links that are crawled.
- Processing done on each URL along with timestamp on when the processing starts, whether the indexing process is taking place or URL gets skipped, and whether the URL gets indexed. All data will be available as separate entries in index.log.
- Timestamp of when the indexing completed, and the time taken for indexing across the indexed URL entry in the log file.
- Last modified date of the URL.
- If the URL is skipped or not, and why.
HTTP Basic Authentication
When the spider requests a document, the spider presents these values (user/password) to the HTTP server in the Authorization MIME header. The attributes required for basic authentication are username, password.
When access to documents is protected using form-based authentication, the spider can automatically log in and access the documents. The attributes required for form authentication are:
Form URL, Form Action, Name/Value pairs as required.
Proxy server Indexing
When HTTP content is accessed through proxy servers, the proxy server settings are required to enable the spider to successfully access and index content. The attributes required for proxy server indexing are:
Proxy server URL, Username/password.
HTTP Collection crawler/parser can be controlled using the following HTML markup tags.
Robots Meta Tag
These tags in the HTML page specify whether SearchBlox can or cannot index the page, and can or cannot spider the entire website. The different types of robots meta tags are as shown below:
<meta name="robots" content="index, follow">
<meta name="robots" content="noindex, follow">
<meta name="robots" content="index, nofollow">
<meta name="robots" content="noindex, nofollow">
NoIndex/ StopIndex Tags
With HTTP collections, there is often a requirement to exclude content from sections of an HTML page from being indexed, such as headers, footers, and navigation. SearchBlox provides two ways to achieve this.
<noindex> Content to Exclude</noindex>
<!--stopindex-->Content to Exclude <!--startindex-->
Canonical is specified in the link tag in order to set the preferred URL in HTML pages that are copies or duplicates.
A URL such as http://www.sample.com/new.html?uid=dxdm59652xhax can be specified in the HEAD part of the document:
<link rel="canonical" href="http://sample.com"/>
Note: This canonical tag would be ignored by default in the collection which can be enabled by setting the ignore canonical value as No in collection settings
An HTTP Collection can be indexed, refreshed or cleared on-demand, on a schedule, or through API requests.
Starts the indexer for the selected collection. Starts indexing from the root URLs.
Clears the current index for the selected collection.
Revisits URLs from the current index to make sure they are still valid, and then continues to index newly discovered URLs.
For each collection, any of the following scheduled indexer activity can be set:
Index - Set the frequency and the start date/time for indexing a collection.
Refresh - Set the frequency and the start date/time for refreshing a collection.
Clear - Set the frequency and the start date/time for clearing a collection.
- Indexing is controlled from the Index sub-tab for a collection or through API. The current status of a collection is always indicated on the Collection Dashboard and the Index page.
- Refresh can be performed for a collection only after Indexing is completed.
- Index and Refresh operations can also be initiated from Collection Dashboard.
- Scheduling can be performed only from the Index sub-tab.
Best Practices for Scheduling Index/Refresh of Collections
Do not schedule the same time for all three operations (Index, Refresh, Clear). This will create conflict between activities.
If you have multiple collections, always schedule the activity to prevent more than 2-3 collections indexing or refreshing at the same time.
You can use canonical URLs to avoid duplicates, improve link and ranking signals for content available through multiple URL structures. It is quite possible that same content could be accessed through multiple URLs, it is also possible that content can be distributed to different URLs and domains entirely.
To enable canonical URL indexing please give ignore canonical as No in the HTTP collection settings.
- Your site has multiple URLs as you position the same page under multiple sections.
http://sample.com/tea/ http://sample.com/hot-beverage/tea http://sample.com/drinks/tea
- Your serve the same content for the www subdomain or the http/https protocol.
http://sample.com/tea/ https://sample.com/tea http://www.sample.com/tea
- Content is replicated partially or fully in different pages or sites in the domain
http://discussion.sample.com/tea/ http://blog.sample.com/tea http://www.sample.com/tea
This replication of pages could result in few challenges for those who search for them in SearchBlox resulting in repetitive content in results pointing to different pages. To overcome this challenge you could define Canonical URL for pages that have similar or equivalent content in multiple URLs/pages.
- Set the site which is most preferred as the canonical URL
- Indicate the preferred URL with the rel="canonical" link element
<link rel="canonical" href="http://www.sample.com/tea" />
The above indicates that the preferred URL to access the tea post.
On indexing the URL, SearchBlox will index only the URL that has been specified as canonical.
To avoid error please use absolute path instead of relative path in canonical URLs.
Please note that while we encourage you to use any of these methods, none of them are mandatory. If you don't indicate a canonical URL, Searchblox will identify the current URL as the best version or URL.
By default canonical tags would be ignored, it can be enabled by setting the ignore canonical value as yes in the collection settings