Custom Analyzers

  • SearchBlox supports custom Elasticsearch analyzers which have been extended from standard analyzers in Elasticsearch.
  • The analyzers determine how a string is converted to tokens to improve their searchability or recall.
  • They are also used to split the terms for the filters used in SearchBlox.
  • The character that splits the term is called separator.
  • Click here to learn about using Custom Fields in Search.

Mapping Files for Collections

The analyzers are mapped in the JSON files available in <SEARCHBLOX_INSTALLATION_PATH>/webapps/searchblox/WEB-INF/

The list of collection and the JSON files associated with mapping are listed as follows:

collectionJSON file
HTTP Collectionmapping.json
File Collectionmapping.json
Email Collectionmapping.json
Database Collectionjdbc.json
CSV Collectioncsv.json
Amazon S3amazonS3.json
MongoDBmongodb.json
Custom Collectionmapping.json

If fields are to be analyzed, then they have to be mapped to the relevant analyzer in the JSON file in the following format:

{
		"type": "text",
		"store": true,
		"fielddata": true,
		"analyzer": "comma_analyzer"
 },

Analyzers supported in SearchBlox are given below.

sb_analyzer

sb_analyzer considers space, comma, hyphen operators as the separators to tokenize the content indexed. This analyzer strips off most special characters from the content while indexing.
sb_analyzer is the default analyzer for most string fields used in searches such as title, description, and content. This is the most common analyzer used for custom fields in order to filter them.
The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as sb_analyzer in the "analyzer" field.

"test": {
        "type": "text",
        "store": true,
        "analyzer": "sb_analyzer"
        "fielddata": true
      },

For example, if the meta field has the following data
<meta name="test" content="world ,news, breaking news, tv radio, part-time />,
on filtering the field using sb_analyzer, the filter would have the following terms

  • world
  • news
  • breaking
  • news
  • tv
  • radio
  • part
  • time

sb_analyzer_special

sb_analyzer_special is similar to sb_analyzer except the special characters are not stripped off the content while indexing. This analyzer is to make the special characters appear in context while search.
The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as sb_analyzer_special in the "analyzer" field.

"test": {
        "type": "text",
        "store": true,
        "analyzer": "sb_analyzer_special",
         "fielddata": true
      },

For example, if the meta field has the following data
<meta name="test" content="world ,news, breaking news, tv radio, part-time />,
on filtering the field using sb_analyzer, the filter would have the following terms

  • world
  • news
  • breaking
  • news
  • tv
  • radio
  • part-time

comma_analyzer

comma_analyzer considers comma character as a separator or tokenizer in the content indexed. Currently, comma_analyzer is used for keywords field.
The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as comma_analyzer in the "analyzer" field.

"test": {
        "type": "text",
        "store": true,
         "analyzer": "comma_analyzer",
        "fielddata": true
      },

For example, if the meta field has the following data
<meta name="test" content="world ,news, breaking news, tv radio, part-time" />,
on filtering the field using comma_analyzer, the filter would have the following terms

  • world
  • news
  • breaking news
  • tv radio
  • part-time

pipe_analyzer

pipe_analyzer is a custom analyzer developed which uses pipe operator as a separator, this analyzer can be used if both comma, as well as space, are not to be used as separators.
This analyzer is a custom one that is not used by default in SearchBlox.
The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as pipe_analyzer in the "analyzer" field.

"keywords": {
        "type": "text",
        "store": true,
        "analyzer": "pipe_analyzer",
        "fielddata": true
      },

For example, if the meta field has the following data
<meta name="test" content="world news| breaking news| tv radio, part-time" />,
on filtering the field using comma_analyzer, the filter would have the following terms

  • world news
  • breaking news
  • tv radio, part-time

whitespace

whitespace analyzer uses space character as a separator or tokenizer in the content indexed.
The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as whitespace in the "analyzer" field.

"test": {
        "type": "text",
        "store": true,
        "analyzer": "whitespace",
        "fielddata": true
      },

For example, if the meta field is have the following data
<meta name="test" content="world news breaking news tv radio part-time" />,
on filtering the field using whitespace, the filter would have the following terms

  • world
  • news
  • breaking
  • news
  • tv
  • radio
  • part-time

sb_analyzer_alphanumeric

sb_analyzer_alphanumeric is mostly similar to sb_analyzer except the following special characters are stripped off the content while indexing. Most special characters are also used as separators.
Please find the list of characters stripped off when using this analyzer:

Characters stripped off and separator
_ . + ! # ^ & * ( ) { } > < : ; ' " ~ , - \ / []

Characters stripped off but not separator
@ $ % ?

The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as sb_analyzer_alphanumeric in the "analyzer" field.

"description": {
        "type": "text",
        "store": true,
        "analyzer": "sb_analyzer_alphanumeric"
      },

For example, if the meta field has the following data
<meta name="sbaspl" content="cat_ pat.bat+vat!(mat)rat{sat}fat@chat$dat"/>,
on filtering the field using sb_analyzer_alphanumeric , the filter would have the following terms

  • cat
  • pat
  • bat
  • vat
  • mat
  • rat
  • sat
  • fatchatdat

category_analyzer

category_analyzer is similar to comma_analyzer except that its resulting values remain case sensitive.

The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as category_analyzer in the "analyzer" field.

"test": {
        "type": "text",
        "store": true,
        "analyzer": "category_analyzer",
        "fielddata": true
      },

For example, if the meta field has the following data:
<meta name="test" content="World, news, Breaking News, TV, Part-time" />,
on filtering the field using category_analyzer, the filter would have the following terms:

  • World
  • news
  • Breaking News
  • TV
  • Part-time