Custom Analyzers

  • SearchBlox supports custom OpenSearch analyzers, which are extended from standard OpenSearch analyzers.
  • Analyzers determine how a string is broken into tokens to improve search and recall.
  • They are also used to split terms for filters in SearchBlox.
  • The character that splits a term is called a separator.
  • Click here to learn about using Custom Fields in Search.

Mapping Files for Collections

  • Mapping files such as mapping.json are created separately for each collection. These analyzers are listed in the JSON files located at <SEARCHBLOX_INSTALLATION_PATH>/webapps/ROOT/WEB-INF/mappings/collections/.
  • If you want a field to be analyzed, map it to the appropriate analyzer in the JSON file using the following format:
{
		"type": "text",
		"store": true,
		"fielddata": true,
		"analyzer": "comma_analyzer"
 },

Analyzers supported in SearchBlox are given below.

sb_analyzer

  • **sb_analyzer uses spaces, commas, and hyphens as separators to break content into tokens. It also removes most special characters during indexing.
  • sb_analyzer is the default analyzer for common string fields such as title, description, and content, and is often used for custom fields to enable filtering.
  • To use it for a specific field, specify sb_analyzer in the "analyzer" field in the JSON mapping file.
"test": {
        "type": "text",
        "store": true,
        "analyzer": "sb_analyzer"
        "fielddata": true
      },
  • For example, if a meta field contains:
    <meta name="test" content="world ,news, breaking news, tv radio, part-time />

  • Using sb_analyzer to filter this field, the terms generated would be:

    • world
    • news
    • breaking
    • news
    • tv
    • radio
    • part
    • time

sb_analyzer_special

sb_analyzer_special works like sb_analyzer, but it keeps special characters in the content during indexing. This allows special characters to appear in the search context.

To use it for a field, specify sb_analyzer_special in the "analyzer" field in the JSON mapping file.

"test": {
        "type": "text",
        "store": true,
        "analyzer": "sb_analyzer_special",
         "fielddata": true
      },

For example, if a meta field contains:
<meta name="test" content="world ,news, breaking news, tv radio, part-time />,

Using sb_analyzer_special to filter this field, the terms generated would be:

  1. world
  2. news
  3. breaking
  4. news
  5. tv
  6. radio
  7. part-time

comma_analyzer

comma_analyzer uses the comma character to split content into tokens. It is commonly used for the keywords field.

To use it for a field, specify comma_analyzer in the "analyzer" field in the JSON mapping file.

"test": {
        "type": "text",
        "store": true,
         "analyzer": "comma_analyzer",
        "fielddata": true
      },

For example, if a meta field contains:
<meta name="test" content="world ,news, breaking news, tv radio, part-time" />,

Using comma_analyzer to filter this field, the terms generated would be:

  1. world
  2. news
  3. breaking news
  4. tv radio
  5. part-time

pipe_analyzer

pipe_analyzer is a custom analyzer that uses the pipe (|) character as a separator. It is useful when you do not want to use comma or space as separators. This analyzer is not used by default in SearchBlox.

To use it for a field, specify pipe_analyzer in the "analyzer" field in the JSON mapping file.

"keywords": {
        "type": "text",
        "store": true,
        "analyzer": "pipe_analyzer",
        "fielddata": true
      },

For example, if a meta field contains:
<meta name="test" content="world news| breaking news| tv radio, part-time" />,

Using pipe_analyzer to filter this field, the terms generated would be:

  1. world news
  2. breaking news
  3. tv radio, part-time

whitespace

whitespace analyzer uses the space character to split content into tokens. To use it for a field, specify whitespace in the "analyzer" field in the JSON mapping file.

"test": {
        "type": "text",
        "store": true,
        "analyzer": "whitespace",
        "fielddata": true
      },

For example, if a meta field contains:
<meta name="test" content="world news breaking news tv radio part-time" />,

Using whitespace analyzer to filter this field, the terms generated would be:

  1. world
  2. news
  3. breaking
  4. news
  5. tv
  6. radio
  7. part-time

sb_analyzer_alphanumeric

sb_analyzer_alphanumeric is similar to sb_analyzer, but certain special characters are removed during indexing. Most special characters are also used as separators.

Characters stripped off and separator
_ . + ! # ^ & * ( ) { } > < : ; ' " ~ , - \ / []

Characters stripped off but not separator
@ $ % ?

To use it for a field, specify sb_analyzer_alphanumeric in the "analyzer" field in the JSON mapping file.

"description": {
        "type": "text",
        "store": true,
        "analyzer": "sb_analyzer_alphanumeric"
      },

For example, if a meta field contains:
<meta name="sbaspl" content="cat_ pat.bat+vat!(mat)rat{sat}fat@chat$dat"/>,

Using sb_analyzer_alphanumeric to filter this field, the terms generated would be:

  1. cat
  2. pat
  3. bat
  4. vat
  5. mat
  6. rat
  7. sat
  8. fatchatdat

category_analyzer

Category_analyzer works like comma_analyzer, but it keeps the original case of the values (case sensitive).

To use it for a field, specify category_analyzer in the "analyzer" field in the JSON mapping file.

"test": {
        "type": "text",
        "store": true,
        "analyzer": "category_analyzer",
        "fielddata": true
      },

For example, if a meta field contains:
<meta name="test" content="World, news, Breaking News, TV, Part-time" />,

Using category_analyzer to filter this field, the terms generated would be:

  1. World
  2. news
  3. Breaking News
  4. TV
  5. Part-time