Custom Analyzers
- SearchBlox supports custom Elasticsearch analyzers which have been extended from standard analyzers in Elasticsearch.
- The analyzers determine how a string is converted to tokens to improve their searchability or recall.
- They are also used to split the terms for the filters used in SearchBlox.
- The character that splits the term is called separator.
- Click here to learn about using Custom Fields in Search.
Mapping Files for Collections
Mapping files, like mapping.json will be generated seperately for each collections. These analyzers are mapped in the JSON files available in <SEARCHBLOX_INSTALLATION_PATH>/webapps/ROOT/WEB-INF/mappings/collections/
If fields are to be analyzed, then they have to be mapped to the relevant analyzer in the JSON file in the following format:
{
"type": "text",
"store": true,
"fielddata": true,
"analyzer": "comma_analyzer"
},
Analyzers supported in SearchBlox are given below.
sb_analyzer
sb_analyzer considers space, comma, hyphen operators as the separators to tokenize the content indexed. This analyzer strips off most special characters from the content while indexing.
sb_analyzer is the default analyzer for most string fields used in searches such as title, description, and content. This is the most common analyzer used for custom fields in order to filter them.
The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as sb_analyzer in the "analyzer" field.
"test": {
"type": "text",
"store": true,
"analyzer": "sb_analyzer"
"fielddata": true
},
For example, if the meta field has the following data
<meta name="test" content="world ,news, breaking news, tv radio, part-time />
,
on filtering the field using sb_analyzer, the filter would have the following terms
- world
- news
- breaking
- news
- tv
- radio
- part
- time
sb_analyzer_special
sb_analyzer_special is similar to sb_analyzer except the special characters are not stripped off the content while indexing. This analyzer is to make the special characters appear in context while search.
The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as sb_analyzer_special in the "analyzer" field.
"test": {
"type": "text",
"store": true,
"analyzer": "sb_analyzer_special",
"fielddata": true
},
For example, if the meta field has the following data
<meta name="test" content="world ,news, breaking news, tv radio, part-time />
,
on filtering the field using sb_analyzer, the filter would have the following terms
- world
- news
- breaking
- news
- tv
- radio
- part-time
comma_analyzer
comma_analyzer considers comma character as a separator or tokenizer in the content indexed. Currently, comma_analyzer is used for keywords field.
The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as comma_analyzer in the "analyzer" field.
"test": {
"type": "text",
"store": true,
"analyzer": "comma_analyzer",
"fielddata": true
},
For example, if the meta field has the following data
<meta name="test" content="world ,news, breaking news, tv radio, part-time" />
,
on filtering the field using comma_analyzer, the filter would have the following terms
- world
- news
- breaking news
- tv radio
- part-time
pipe_analyzer
pipe_analyzer is a custom analyzer developed which uses pipe operator as a separator, this analyzer can be used if both comma, as well as space, are not to be used as separators.
This analyzer is a custom one that is not used by default in SearchBlox.
The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as pipe_analyzer in the "analyzer" field.
"keywords": {
"type": "text",
"store": true,
"analyzer": "pipe_analyzer",
"fielddata": true
},
For example, if the meta field has the following data
<meta name="test" content="world news| breaking news| tv radio, part-time" />
,
on filtering the field using comma_analyzer, the filter would have the following terms
- world news
- breaking news
- tv radio, part-time
whitespace
whitespace analyzer uses space character as a separator or tokenizer in the content indexed.
The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as whitespace in the "analyzer" field.
"test": {
"type": "text",
"store": true,
"analyzer": "whitespace",
"fielddata": true
},
For example, if the meta field is have the following data
<meta name="test" content="world news breaking news tv radio part-time" />
,
on filtering the field using whitespace, the filter would have the following terms
- world
- news
- breaking
- news
- tv
- radio
- part-time
sb_analyzer_alphanumeric
sb_analyzer_alphanumeric is mostly similar to sb_analyzer except the following special characters are stripped off the content while indexing. Most special characters are also used as separators.
Please find the list of characters stripped off when using this analyzer:
Characters stripped off and separator
_ . + ! # ^ & * ( ) { } > < : ; ' " ~ , - \ / []
Characters stripped off but not separator
@ $ % ?
The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as sb_analyzer_alphanumeric in the "analyzer" field.
"description": {
"type": "text",
"store": true,
"analyzer": "sb_analyzer_alphanumeric"
},
For example, if the meta field has the following data
<meta name="sbaspl" content="cat_ pat.bat+vat!(mat)rat{sat}fat@chat$dat"/>
,
on filtering the field using sb_analyzer_alphanumeric , the filter would have the following terms
- cat
- pat
- bat
- vat
- mat
- rat
- sat
- fatchatdat
category_analyzer
category_analyzer is similar to comma_analyzer except that its resulting values remain case sensitive.
The following JSON code needs to be specified in the JSON file for the specific field and the analyzer has to be mentioned as category_analyzer in the "analyzer" field.
"test": {
"type": "text",
"store": true,
"analyzer": "category_analyzer",
"fielddata": true
},
For example, if the meta field has the following data:
<meta name="test" content="World, news, Breaking News, TV, Part-time" />,
on filtering the field using category_analyzer, the filter would have the following terms:
- World
- news
- Breaking News
- TV
- Part-time
Updated almost 3 years ago