SearchBlox includes a web crawler to index content from any intranet, portal or website. The crawler can also index HTTPS based content without any additional configuration and crawl through a proxy server or HTTP Basic Auth/Form authentication.
- After logging in to the Admin Console, click on the Add Collection button
- Enter a unique Collection name for the data source (for example, intranetsite)
- Choose HTTP Collection as Collection Type
- Choose the language of the content (if language is other than English)
- Click Add to create the collection
The HTTP collection Paths allow you to configure the Root URLs and the Allow/Disallow paths for the crawler. To access the paths for the HTTP collection, click on the collection name in the Collections list.
The root URL is the starting URL for the crawler. It requests this URL, indexes the content, and follows links from the URL. Make sure the root URL entered has regular HTML HREF links that the crawler can follow. In the paths sub-tab, enter at least one root URL for the HTTP Collection in the Root URLs.
Allow/Disallow paths ensure the crawler can include or exclude URLs. Allow and Disallow paths make it possible to manage a collection by excluding unwanted URLs.
http://www.cnn.com/ (Informs the crawler to stay within the cnn.com site)
.* (Allows the crawler to go any external URL or domain)
Select the document formats that need to be searchable within the collection.
Keep the crawler within the required domain(s)
Please enter the Root URL domain name(s) (for example cnn.com or nytimes.com) within the Allow Paths to ensure the crawler can stay within the required domains. If .* is left as the value within the Allow Paths, the crawler will go to any external domain and index the web pages.
The Settings page has configurable parameters for the crawler. SearchBlox provides default parameters when a new collection is created. Most crawler settings can be changed for your specific requirements.
The keyword-in-context returns search results with the description displayed from content areas where the search term occurs.
HTML Parser Settings
This setting configures the HTML parser to read the description for a document from one of the HTML tags: H1, H2, H3, H4, H5, H6
Maximum Document Age
Specifies the maximum allowable age in days of a document in the collection.
Maximum Document Size
Specifies the maximum allowable size in kilobytes of a document in the collection.
Maximum Spider Depth
Specifies the maximum depth the spider is allowed to proceed to index documents.
Specifies the wait time in milliseconds for the spider between HTTP requests to a web server.
The name under which the spider requests documents from a web server.
This is a URL value set in the request headers to specify where the user agent previously visited.
Value is set to Yes or No to tell the spider to obey robot rules or not. The default value is no.
Value is set to Yes or No to tell the spider whether sitemaps alone can be indexed, or if all of the URLs have to be indexed respectively. The default value is no.
Is set to Yes or No to instruct the spider to automatically follow redirects or not.
Boosting Boost search terms for the collection by setting a value greater than 1 (maximum value 9999).
When stemming is enabled, inflected words are reduced to root form. For example, "running", "runs", and "ran" are the inflected form of run.
When enabled prevents indexing duplicate documents
Boost search terms for the collection by setting a value greater than 1 (maximum value 9999).
When enabled, a spell index is created at the end of the indexing process.
On enabling logging, the indexer activity would be available in detail in ../searchblox/logs/index.log
The details that occur in the index.log on enabling logging or debug logging mode are
- List of links that are crawled.
- Processing done on each URL along with timestamp on when the processing starts, the whether the indexing process is taking place or URL gets skipped and finally whether the URL gets indexed all these data would be available as separate entries in index.log.
- timestamp of when the indexing completed would also be available along with time taken for indexing across the indexed URL entry in log file.
- last modified date of the URL would be available.
- If the URL is skipped or not indexed the reason for the same would be available.
HTTP Basic Authentication
When the spider requests a document, the spider presents these values (user/password) to the HTTP server in the Authorization MIME header. The attributes required for basic authentication are username, password.
When access to documents is protected using form-based authentication, the spider can automatically log in and access the documents. The attributes required for form authentication are
Form URL, Form Action, Name/Value pairs as required
Proxy server Indexing
When HTTP content is accessed through proxy servers, the proxy server settings are required to enable the spider to successfully access and index content. The attributes required for proxy server indexing are
Proxy server URL, Username/password
HTTP Collection crawler/parser can be controlled using the following HTML markup tags
Robots Meta Tag
These tags in the HTML page specify whether SearchBlox can or cannot index the page, and can or cannot spider the entire website. The different types of robots meta tags are as below
<meta name="robots" content="index, follow">
<meta name="robots" content="noindex, follow">
<meta name="robots" content="index, nofollow">
<meta name="robots" content="noindex, nofollow">
NoIndex/ StopIndex Tags
With HTTP collections, there is often a requirement to exclude content from sections of an HTML page from being indexed, such as headers, footers, and navigation. SearchBlox provides two ways to achieve this.
<noindex> Content to Exclude</noindex>
<!--stopindex-->Content to Exclude <!--startindex-->
Canonical is specified in the link tag in order to set the preferred URL in HTML pages that are copies or duplicates.
A URL such as http://www.sample.com/new.html?uid=dxdm59652xhax can be specified in the HEAD part of the document the following:
<link rel="canonical" href="http://sample.com"/>
HTTP Collection can be indexed, refreshed or cleared on-demand, on a schedule or through API requests.
Starts the indexer for the selected collection. Starts indexing from the root URLs.
Clears the current index for the selected collection.
Revisits URLs from the current index to make sure they are still valid, and then continues to index newly discovered URLs.
For each collection, any of the following scheduled indexer activity can be set:
Index - Set the frequency and the start date/time for indexing a collection.
Refresh - Set the frequency and the start date/time for refreshing a collection.
Clear - Set the frequency and the start date/time for clearing a collection.
- Indexing is controlled from the Index sub-tab for a collection or through API. The current status of a collection is always indicated on the Collection Dashboard and the Index page.
- Refresh can be performed for a collection only after Indexing is completed.
- Index and Refresh operations can also be initiated from Collection Dashboard.
- Scheduling can be performed only from Index sub-tab.
Best Practices for Scheduling Index/Refresh of Collections
Do not give the same time schedule for all three operations (Index, Refresh, Clear). This would create conflict between activities.
If you have multiple collections, always schedule the activity to prevent more than 2-3 collections indexing or refreshing at the same time.