SearchBlox includes a web crawler to index content from any intranet, portal, or website. The crawler can also index HTTPS-based content without any additional configuration and crawl through a proxy server or Form Authentication.
Dynamic Auto Collection authenticates and crawls secure content behind a web form. Each page is considered as a document and is indexed by the crawler after it is rendered by the crawler.
It supports SPA application along with the dynamically loaded JS pages
You can Create a Dynamic Auto Collection with the following steps:
- After logging in to the Admin Console, select the Collections tab and click on Create a New
Collection or "+" icon.
- Choose Dynamic Auto Collection as Collection Type.
- Enter a unique Collection name for the data source (For example, DynamicAuto).
- Choose Private/Public Collection Access and Collection Encryption as per the requirements
- Choose the language of the content (if the language is other than English).
- Click Save to create the collection.
- Once the Dynamic Auto collection is created you will be taken to the Path tab.
The Dynamic Auto collection Paths allow you to configure the Root URLs and the Allow/Disallow paths for the crawler. To access the paths for the WEB collection, click on the collection name in the Collections list.
- The root URL is the starting URL for the crawler. It requests this URL, indexes the content, and follows links from the URL.
- The root URL should be single url, we can crawl linked with that url.
- Make sure the root URL entered has regular HTML HREF links that the crawler can follow.
- Allow/Disallow paths ensure the crawler can include or exclude URLs.
- Allow and Disallow paths make it possible to manage a collection by excluding unwanted URLs.
- It is mandatory to give an allow path in Dynamic Auto collection to limit the indexing within the subdomain provided in Root URLs.
|Root URLs||The starting URL for the crawler. You need to provide at least one root URL.|
|Allow Paths||http://www.cnn.com/ (Informs the crawler to stay within the cnn.com site.)|
.* (Allows the crawler to go any external URL or domain.)
|Allowed Formats||Select the document formats that need to be searchable within the collection.|
- Enter the Root URL domain name (for example cnn.com or nytimes.com) within the Allow Paths to ensure the crawler stays within the required domains.
- If .* is left as the value within the Allow Paths, the crawler will go to any external domain and index the web pages.
The Settings page has configurable parameters for the crawler. SearchBlox provides default parameters when a new collection is created. Most crawler settings can be changed for your specific requirements.
|HTML Parser Settings||Description||This setting configures the HTML parser to read the description for a document from one of the HTML tags: H1, H2, H3, H4, H5, H6|
|HTML Parser Settings||Last-Modified Date||This setting determines from where the lastmodified date is fetched for the indexed document.|
By default it is from webpage header
If the user wants it from Meta tags they can select Meta option
If the user wants from custom date set in the webserver for SearchBlox then they can select this option
|Form Authentication||Form authentication fields||When access to documents is protected using form-based authentication, the spider can automatically log in and access the documents. The attributes required for form authentication are:|
Form URL: It should be root url/Landing page url after authentication (crawler get started crawling from this point).
Username reference: ID of the user name element(we can get it from inspect element of the User name Element)
User name: <User_name_input>
Password reference: ID of the password element(we can get it from inspect element of the password Element)
submit: submit (dont change)
submit reference: ID of the submit element(we can get it from inspect element of the submit Element)
name 1: action for wait/iframe (optional)
value 1: reference id of wait/iframe element if needed(optional)
|HTML Proxy server Settings||Proxy server credentials||When HTTP content is accessed through proxy servers, the proxy server settings are required to enable the spider to successfully access and index content. The attributes required for proxy server indexing are:|
Proxy server URL, Username and Password.
|Enable Detailed Log Settings||Enable Logging||Provides the indexer activity in detail in ..\webapps\ROOT\logs\index.log.|
The details that occur in the index.log when logging or debug logging mode are enabled are:
List of links that are crawled.
Processing done on each URL along with timestamp on when the processing starts, whether the indexing process is taking place or URL gets skipped, and whether the URL gets indexed. All data will be available as separate entries in index.log.
Timestamp of when the indexing completed, and the time taken for indexing across the indexed URL entry in the log file.
Last modified date of the URL.
* If the URL is skipped or not, and why.
|Enable Content API||Enable Content API||Provides the ability to crawl the content with special characters included.|
Important for Linux users !
- Before indexing the created collection, navigate to
/opt/searchblox/webapps/ROOT/WEB-INF/authconfig.ymland change the
true, as shown below:
- Restarting SearchBlox service is required to apply the above changes.
To add the additional time-outs, please update the below parameter in
submit-delay:30, where 30 will be in seconds.
Restarting SearchBlox service is required to apply the above changes.
Synonyms find relevant documents related to a search term, even if the search term is not present. For example, while searching for documents that use the term “global,” results with synonyms “world” and “international” would be listed in the search results.
We have an option to load Synonyms from the existing documents.
Sets the frequency and the start date/time for indexing a collection, from the root URLs. Schedule Frequency supported in SearchBlox is as follows:
- Every 48 Hours
- Every 96 Hours
The following operation can be performed in Dynamic Auto collections
|Schedule||For each collection, indexing can be scheduled based on the above options.|
Using Manage Documents tab we can do the following operations:
- View content
- View meta data
To add a document click on "+" icon as shown in the screenshot.
- Enter the document/URL, Click on add/update.
- Once the document is updated you will be able to see the document URL on the screen and we will be able to perform the above mentioned operations.
Using Data Fields tab we can create custom fields for search and we can see the Default Data Fields with non-encrypted collection. SearchBlox supports 4 types of Data Fields as listed below:
Once the Data fields are configured, collection must be cleared and re-indexed to take effect.
To know more about Data Fields please refer to Data Fields Tab
Updated about 2 months ago