Special Metadata Parser

The special metadata parser is capable of indexing certain fields present within web pages that are generally not captured and indexed.

The special metadata parser uses CSS selectors to manually pull out metadata information from a web page, This extracted metadata is then indexed as fields within OpenSearch, ensuring that previously unindexed data becomes searchable and retrievable.

The only condition being that the CSS location of the data that needs to be indexed needs to be consistent across all webpages

Example

  • Enable special metadata parser in the collection settings
  • Navigate to the SearchBlox installation directory <SearchBlox-Installation-Path>/webapps/ROOT/WEB-INF/specialmetadataparser
  • A metadata config file with the corresponding collection ID will be created with default values (example: metadata-fields-config-001.json)
  • Here replace the "name" and "selector" fields within the json file.
    name --> Name of the filed to be indexed within opensearch
    selector -- > CSS selector to navigate and access the specific data within the webpage.
  • Once updated, Restart SearchBlox.
  • Navigate to console and hit the index button.

Additionally the Special metadata parser comes equipped with two inbuilt functions called "parse-date" and "mask".
Parse date --> Takes the date format present within the web pages (if any) and converts it to an ISO compliant format that OpenSearch supports, This conversion facilitates efficient date querying within OpenSearch.
Mask --> The "mask" function allows for the selective masking of characters within the data obtained by CSS selectors. This feature enables indexing only the unmasked portion of the data.

{
      "name": "Year Issued",
      "selector": "div.lrd-inline-field:has(div:contains(Date Issued)) div:nth-child(2)", 
      "mask" : "******####"
    },

Example: "mask" : ******####
Applied on the year Issued 03-26-2024
Will only index the year 2024 as a separate field