SearchBlox includes a web crawler to index content from any intranet, portal, or website. The crawler can also index HTTPS-based content without any additional configuration and crawl through a proxy server or HTTP Basic Authentication/Form Authentication.
You can Create a Database Collection with the following steps:
- After logging in to the Admin Console, select the Collections tab and click on Create a New
Collection or "+" icon.
- Choose WEB Collection as Collection Type.
- Enter a unique Collection name for the data source (For example, intranet site).
- Choose Private/Public Collection Access and Collection Encryption as per the requirements
- Choose the language of the content (if the language is other than English).
- Click Save to create the collection.
- Once the WEB collection is created you will be taken to the Path tab.
The WEB collection Paths allow you to configure the Root URLs and the Allow/Disallow paths for the crawler. To access the paths for the WEB collection, click on the collection name in the Collections list.
- The root URL is the starting URL for the crawler. It requests this URL, indexes the content, and follows links from the URL.
- Make sure the root URL entered has regular HTML HREF links that the crawler can follow.
- In the paths sub-tab, enter at least one root URL for the WEB Collection in the Root URLs.
- Allow/Disallow paths ensure the crawler can include or exclude URLs.
- Allow and Disallow paths make it possible to manage a collection by excluding unwanted URLs.
- It is mandatory to give an allow path in WEB collection to limit the indexing within the subdomain provided in Root URLs.
The starting URL for the crawler. You need to provide at least one root URL.
http://www.cnn.com/ (Informs the crawler to stay within the cnn.com site.)
Select the document formats that need to be searchable within the collection.
- Enter the Root URL domain name(s) (for example cnn.com or nytimes.com) within the Allow Paths to ensure the crawler stays within the required domains.
- If .* is left as the value within the Allow Paths, the crawler will go to any external domain and index the web pages.
The Settings page has configurable parameters for the crawler. SearchBlox provides default parameters when a new collection is created. Most crawler settings can be changed for your specific requirements.
When enabled, prevents indexing duplicate documents.
When stemming is enabled, inflected words are reduced to root form. For example, "running", "runs", and "ran" are the inflected form of run.
The spelling suggestions are based on the words found within the search index. By enabling Spelling Suggestion in collection settings, spelling suggestions will appear for the search box in both regular and faceted search.
Keyword-in-Context Search Settings
The keyword-in-context returns search results with the description displayed from content areas where the search term occurs.
HTML Parser Settings
This setting configures the HTML parser to read the description for a document from one of the HTML tags: H1, H2, H3, H4, H5, H6
HTML Parser Settings
This setting determines from where the lastmodified date is fetched for the indexed document.
Maximum Document Age
Specifies the maximum allowable age in days of a document in the collection.
Maximum Document Size
Specifies the maximum allowable size in kilobytes of a document in the collection.
Maximum Spider Depth
Specifies the maximum depth the spider is allowed to proceed to index documents. Maximum value of Spider depth that can be given in SearchBlox is 15
Specifies the wait time in milliseconds for the spider between HTTP requests to a web server.
The name under which the spider requests documents from a web server.
This is a URL value set in the request headers to specify where the user agent previously visited.
Value is set to Yes or No to tell the spider to obey robot rules or not. The default value is no.
Value is set to Yes or No to tell the spider to ignore canonical urls specified in the page. The default value is yes.
Value is set to Yes or No to tell the spider whether sitemaps alone can be indexed, or if all of the URLs have to be indexed respectively. The default value is no.
Is set to Yes or No to instruct the spider to automatically follow redirects or not.
HTTP Basic Authentication
Basic Authentication credentials
When the spider requests a document, the spider presents these values (user/password) to the HTTP server in the Authorization MIME header. The attributes required for basic authentication are Username and Password.
Proxy server Indexing
Proxy server credentials
When HTTP content is accessed through proxy servers, the proxy server settings are required to enable the spider to successfully access and index content. The attributes required for proxy server indexing are:
Form authentication fields
When access to documents is protected using form-based authentication, the spider can automatically log in and access the documents. The attributes required for form authentication are:
Enable Detailed Log Settings
Provides the indexer activity in detail in ..\webapps\ROOT\logs\index.log.
Synonyms find relevant documents related to a search term, even if the search term is not present. For example, while searching for documents that use the term “global,” results with synonyms “world” and “international” would be listed in the search results.
We have an option to load Synonyms from the existing documents.
Sets the frequency and the start date/time for indexing a collection, from the root URLs. Schedule Frequency supported in SearchBlox is as follows:
- Every 48 Hours
- Every 96 Hours
The following operation can be performed in WEB collections
For each collection, indexing can be scheduled based on the above options.
Using Manage Documents tab we can do the following operations:
- View content
- View meta data
To add a document click on "+" icon as shown in the screenshot.
- Enter the document/URL, Click on add/update.
- Once the document is updated you will be able to see the document URL on the screen and we will be able to perform the above mentioned operations.
Using Data Fields tab we can create custom fields for search and we can see the Default Data Fields with non-encrypted collection. SearchBlox supports 4 types of Data Fields as listed below:
Once the Data fields are configured, collection must be cleared and re-indexed to take effect.
- While creating the collection Data Fields tab is disabled. Once the collection gets created, we can see the collection specific Data Field tab while editing the Collection page.
- When we enable Default Data Fields, we can see collection specific SearchBlox reserved fields. The reserved field name cannot be used as custom field name.
- To Create custom Data Field, Click on Add Data Field button and provide the data field name from your document, select the type of data field values and save the field. You will see the list of added Data Fields in Data Fields screen.
Custom Field Name
The Custom Field Name should be unique, cannot contain spaces, supports alphabets and underscore and should be lowercase characters. Eg. author, topic_name
Data type can be keyword, text, number and date depending on the data field values.
- Custom Data Fields can be shown in the Data Fields Tab as well as with the Default Data Fields list.
- Custom fields will also be shown under Deafult Data Fields list as shown below:
- Custom mapping file can be shown under the path: /webapps/ROOT/WEB-INF/mappings/collections
- We can edit the mapping file if we need to add any specific Analyzers. Mapping fields are added at the end of the collection mapping file as shown below:
SearchBlox version 10.0.1 includes ML fields (ml_topic, ml_sentimentLabel, ml_entity_org, ml_entity_product, ml_entity_person, ml_entity_loc, ml_entity_gpe) as default mappings for all type of collections.
- Data Fields tab cannot be used with Encrypted type Collection.
- We recommend not to edit the collection mapping files unlesss we need to add analyzers or fielddata configuration.
Updated 11 days ago