写在后面

ES内置的token filter很多,大部分理论工作中都用不到。这段时间筹备ES认证工程师的考试,备考的时候须要相熟这些不罕用的filter。ES官网对一些filter只是一笔带过,我就想着把备考的笔记整顿成博客备忘,也心愿能帮忙到有这方面需要的人。

length filer

官网解释:

A token filter of type length that removes words that are too long or too short for the stream.

这个filter的性能是,去掉过长或者过短的单词。它有两个参数能够设置:

  • min 定义最短长度,默认是0
  • max 定义最长长度,默认是Integer.MAX_VALUE

先来简略测试下它的成果,

GET _analyze{  "tokenizer" : "standard",  "filter": [{"type": "length", "min":1, "max":3 }],    "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone"}

输入:

{  "tokens" : [    {      "token" : "The",      "start_offset" : 0,      "end_offset" : 3,      "type" : "<ALPHANUM>",      "position" : 0    },    {      "token" : "2",      "start_offset" : 4,      "end_offset" : 5,      "type" : "<NUM>",      "position" : 1    },    {      "token" : "the",      "start_offset" : 36,      "end_offset" : 39,      "type" : "<ALPHANUM>",      "position" : 7    }  ]}

能够看到大于3的单词都被过滤掉了。

如果要给某个索引指定length filer,能够参考上面这个示例:

PUT /length_example{    "settings" : {        "analysis" : {            "analyzer" : {                "default" : {                    "tokenizer" : "standard",                    "filter" : ["my_length"]                }            },            "filter" : {                "my_length" : {                    "type" : "length",                    "min" : 1,                    "max": 3                }            }        }    }}GET length_example/_analyze{  "analyzer": "default",   "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bonet"}

ngram filter

ngram filter的意义能够参考ngram tokenize,后者相当于是keyword tokenizer 加上 ngram filter,成果是一样的。

它的含意是:首先将text文本切分,执行时采纳N-gram切割算法。N-grams 算法,像一个穿梭单词的滑窗,是一个特定长度的继续的字符序列。

说着挺形象,来个例子:

GET _analyze{  "tokenizer": "ngram",  "text": "北京大学"}GET _analyze{  "tokenizer" : "keyword",  "filter": [{"type": "ngram", "min_gram":1, "max_gram":2 }],    "text" : "北京大学"}

能够看到有两个属性,

  • min_gram 在单词中最小字符长度,且默认为1
  • max_gram 在单词中最大字符长度,且默认为2

max和min的距离,也就是步长默认最大只能是1,能够通过设置索引的max_ngram_diff批改,示例如下:

PUT /ngram_example{    "settings" : {      "index": {      "max_ngram_diff": 10    },        "analysis" : {            "analyzer" : {                "default" : {                    "tokenizer" : "keyword",                    "filter" : ["my_ngram"]                }            },            "filter" : {                "my_ngram" : {                    "type" : "ngram",                    "min_gram" : 2,                    "max_gram": 4                }            }        }    }}

应用索引的analyzer测试,

GET ngram_example/_analyze{  "analyzer": "default",   "text" : "北京大学"}

输入,

{  "tokens" : [    {      "token" : "北京",      "start_offset" : 0,      "end_offset" : 4,      "type" : "word",      "position" : 0    },    {      "token" : "北京大",      "start_offset" : 0,      "end_offset" : 4,      "type" : "word",      "position" : 0    },    {      "token" : "北京大学",      "start_offset" : 0,      "end_offset" : 4,      "type" : "word",      "position" : 0    },    {      "token" : "京大",      "start_offset" : 0,      "end_offset" : 4,      "type" : "word",      "position" : 0    },    {      "token" : "京大学",      "start_offset" : 0,      "end_offset" : 4,      "type" : "word",      "position" : 0    },    {      "token" : "大学",      "start_offset" : 0,      "end_offset" : 4,      "type" : "word",      "position" : 0    }  ]}

你应该曾经根本理解ngram filter的用法了,可能会有个疑难,这个过滤器用在什么场景呢?事实上,它适宜前缀中断检索,比方搜寻举荐性能,当你只输出了某个句子的一部分时,搜索引擎会显示出以这部分为前缀的一些匹配项,从而实现举荐性能。

trim filter

这个filter从名字也能够看出它的性能,它能够删除前后空格。看个示例:

GET _analyze{  "tokenizer" : "keyword",  "filter": [{"type": "trim"}],    "text" : " 北京大学"}

输入,

{  "tokens" : [    {      "token" : " 北京大学",      "start_offset" : 0,      "end_offset" : 5,      "type" : "word",      "position" : 0    }  ]}

truncate filter

这个filter有一个length属性,能够截断分词后的term,确保term的长度不会超过length。上面看个示例,

GET _analyze{  "tokenizer" : "keyword",  "filter": [{"type": "truncate", "length": 3}],    "text" : "北京大学"}

输入,

{  "tokens" : [    {      "token" : "北京大",      "start_offset" : 0,      "end_offset" : 4,      "type" : "word",      "position" : 0    }  ]}

再来一个示例:

GET _analyze{  "tokenizer" : "standard",  "filter": [{"type": "truncate", "length": 3}],    "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}

输入,

{  "tokens" : [    {      "token" : "The",      "start_offset" : 0,      "end_offset" : 3,      "type" : "<ALPHANUM>",      "position" : 0    },    {      "token" : "2",      "start_offset" : 4,      "end_offset" : 5,      "type" : "<NUM>",      "position" : 1    },    {      "token" : "QUI",      "start_offset" : 6,      "end_offset" : 11,      "type" : "<ALPHANUM>",      "position" : 2    },    ...    

这个filter在keyword比拟长的场景下,能够用来避免出现一些OOM等问题。

unique filter

unique词元过滤器的作用就是保障同样后果的词元只呈现一次。看个示例:

GET _analyze{    "tokenizer": "standard",    "filter": ["unique"],    "text": "this is a test test test"}

输入,

{  "tokens" : [    {      "token" : "this",      "start_offset" : 0,      "end_offset" : 4,      "type" : "<ALPHANUM>",      "position" : 0    },    {      "token" : "is",      "start_offset" : 5,      "end_offset" : 7,      "type" : "<ALPHANUM>",      "position" : 1    },    {      "token" : "a",      "start_offset" : 8,      "end_offset" : 9,      "type" : "<ALPHANUM>",      "position" : 2    },    {      "token" : "test",      "start_offset" : 10,      "end_offset" : 14,      "type" : "<ALPHANUM>",      "position" : 3    }  ]}

synonym token filter

同义词过滤器。它的应用场景是这样的,比方有一个文档外面蕴含番茄这个词,咱们心愿搜寻番茄或者西红柿圣女果都能够找到这个文档。示例如下:

PUT /synonym_example{    "settings": {            "analysis" : {                "analyzer" : {                    "synonym" : {                        "tokenizer" : "whitespace",                        "filter" : ["my_synonym"]                    }                },                "filter" : {                    "my_synonym" : {                        "type" : "synonym",                        "synonyms_path" : "analysis/synonym.txt"                    }                }            }    }}

咱们须要在ES实例的config目录下,新建一个analysis/synonym.txt的文件,内容如下:

番茄,西红柿,圣女果

记得要重启。

而后测试下,

GET /synonym_example/_analyze{  "analyzer": "synonym",  "text": "番茄"}

输入,

{  "tokens" : [    {      "token" : "番茄",      "start_offset" : 0,      "end_offset" : 2,      "type" : "word",      "position" : 0    },    {      "token" : "西红柿",      "start_offset" : 0,      "end_offset" : 2,      "type" : "SYNONYM",      "position" : 0    },    {      "token" : "圣女果",      "start_offset" : 0,      "end_offset" : 2,      "type" : "SYNONYM",      "position" : 0    }  ]}

如何组合应用多个filter

咱们晓得一个分析器能够蕴含多个过滤器,那怎么来实现呢?看上面这个例子:

GET _analyze{  "tokenizer" : "standard",  "filter": [{"type": "length", "min":1, "max":4 },{"type": "truncate", "length": 3}],    "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}

这个例子中,咱们把length filter和truncate filter组合在一起应用,它首先基于规范分词,分词后的term大于4字节的会首先被过滤掉,接着剩下的term会被截断到3个字节。输入后果是,

{  "tokens" : [    {      "token" : "The",      "start_offset" : 0,      "end_offset" : 3,      "type" : "<ALPHANUM>",      "position" : 0    },    {      "token" : "2",      "start_offset" : 4,      "end_offset" : 5,      "type" : "<NUM>",      "position" : 1    },    {      "token" : "ove",      "start_offset" : 31,      "end_offset" : 35,      "type" : "<ALPHANUM>",      "position" : 6    },    {      "token" : "the",      "start_offset" : 36,      "end_offset" : 39,      "type" : "<ALPHANUM>",      "position" : 7    },    {      "token" : "laz",      "start_offset" : 40,      "end_offset" : 44,      "type" : "<ALPHANUM>",      "position" : 8    },    {      "token" : "bon",      "start_offset" : 51,      "end_offset" : 55,      "type" : "<ALPHANUM>",      "position" : 10    }  ]}

如果是在索引中应用的话,参考上面这个例子:

PUT /length_truncate_example{    "settings" : {        "analysis" : {            "analyzer" : {                "default" : {                    "tokenizer" : "standard",                    "filter" : ["my_length", "my_truncate"]                }            },            "filter" : {                "my_length" : {                    "type" : "length",                    "min" : 1,                    "max": 4                },                "my_truncate" : {                    "type" : "truncate",                    "length": 3                }            }        }    }}GET length_truncate_example/_analyze{  "analyzer": "default",   "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bonet"}