写在后面
ES内置的token filter很多,大部分理论工作中都用不到。这段时间筹备ES认证工程师的考试,备考的时候须要相熟这些不罕用的filter。ES官网对一些filter只是一笔带过,我就想着把备考的笔记整顿成博客备忘,也心愿能帮忙到有这方面需要的人。
length filer
官网解释:
A token filter of type length that removes words that are too long or too short for the stream.
这个filter的性能是,去掉过长或者过短的单词。它有两个参数能够设置:
- min 定义最短长度,默认是0
- max 定义最长长度,默认是
Integer.MAX_VALUE
先来简略测试下它的成果,
GET _analyze{ "tokenizer" : "standard", "filter": [{"type": "length", "min":1, "max":3 }], "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone"}
输入:
{ "tokens" : [ { "token" : "The", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "2", "start_offset" : 4, "end_offset" : 5, "type" : "<NUM>", "position" : 1 }, { "token" : "the", "start_offset" : 36, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 7 } ]}
能够看到大于3的单词都被过滤掉了。
如果要给某个索引指定length
filer,能够参考上面这个示例:
PUT /length_example{ "settings" : { "analysis" : { "analyzer" : { "default" : { "tokenizer" : "standard", "filter" : ["my_length"] } }, "filter" : { "my_length" : { "type" : "length", "min" : 1, "max": 3 } } } }}GET length_example/_analyze{ "analyzer": "default", "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bonet"}
ngram filter
ngram filter的意义能够参考ngram tokenize
,后者相当于是keyword
tokenizer 加上 ngram filter
,成果是一样的。
它的含意是:首先将text文本切分,执行时采纳N-gram切割算法。N-grams
算法,像一个穿梭单词的滑窗,是一个特定长度的继续的字符序列。
说着挺形象,来个例子:
GET _analyze{ "tokenizer": "ngram", "text": "北京大学"}GET _analyze{ "tokenizer" : "keyword", "filter": [{"type": "ngram", "min_gram":1, "max_gram":2 }], "text" : "北京大学"}
能够看到有两个属性,
- min_gram 在单词中最小字符长度,且默认为1
- max_gram 在单词中最大字符长度,且默认为2
max和min的距离,也就是步长默认最大只能是1,能够通过设置索引的max_ngram_diff
批改,示例如下:
PUT /ngram_example{ "settings" : { "index": { "max_ngram_diff": 10 }, "analysis" : { "analyzer" : { "default" : { "tokenizer" : "keyword", "filter" : ["my_ngram"] } }, "filter" : { "my_ngram" : { "type" : "ngram", "min_gram" : 2, "max_gram": 4 } } } }}
应用索引的analyzer测试,
GET ngram_example/_analyze{ "analyzer": "default", "text" : "北京大学"}
输入,
{ "tokens" : [ { "token" : "北京", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "北京大", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "北京大学", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "京大", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "京大学", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "大学", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 } ]}
你应该曾经根本理解ngram filter
的用法了,可能会有个疑难,这个过滤器用在什么场景呢?事实上,它适宜前缀中断检索,比方搜寻举荐性能,当你只输出了某个句子的一部分时,搜索引擎会显示出以这部分为前缀的一些匹配项,从而实现举荐性能。
trim filter
这个filter从名字也能够看出它的性能,它能够删除前后空格。看个示例:
GET _analyze{ "tokenizer" : "keyword", "filter": [{"type": "trim"}], "text" : " 北京大学"}
输入,
{ "tokens" : [ { "token" : " 北京大学", "start_offset" : 0, "end_offset" : 5, "type" : "word", "position" : 0 } ]}
truncate filter
这个filter有一个length
属性,能够截断分词后的term,确保term的长度不会超过length。上面看个示例,
GET _analyze{ "tokenizer" : "keyword", "filter": [{"type": "truncate", "length": 3}], "text" : "北京大学"}
输入,
{ "tokens" : [ { "token" : "北京大", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 } ]}
再来一个示例:
GET _analyze{ "tokenizer" : "standard", "filter": [{"type": "truncate", "length": 3}], "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}
输入,
{ "tokens" : [ { "token" : "The", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "2", "start_offset" : 4, "end_offset" : 5, "type" : "<NUM>", "position" : 1 }, { "token" : "QUI", "start_offset" : 6, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 2 }, ...
这个filter在keyword比拟长的场景下,能够用来避免出现一些OOM等问题。
unique filter
unique词元过滤器的作用就是保障同样后果的词元只呈现一次。看个示例:
GET _analyze{ "tokenizer": "standard", "filter": ["unique"], "text": "this is a test test test"}
输入,
{ "tokens" : [ { "token" : "this", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "is", "start_offset" : 5, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "a", "start_offset" : 8, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "test", "start_offset" : 10, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 3 } ]}
synonym token filter
同义词过滤器。它的应用场景是这样的,比方有一个文档外面蕴含番茄
这个词,咱们心愿搜寻番茄
或者西红柿
,圣女果
都能够找到这个文档。示例如下:
PUT /synonym_example{ "settings": { "analysis" : { "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["my_synonym"] } }, "filter" : { "my_synonym" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } }}
咱们须要在ES实例的config目录下,新建一个analysis/synonym.txt
的文件,内容如下:
番茄,西红柿,圣女果
记得要重启。
而后测试下,
GET /synonym_example/_analyze{ "analyzer": "synonym", "text": "番茄"}
输入,
{ "tokens" : [ { "token" : "番茄", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "西红柿", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "圣女果", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 } ]}
如何组合应用多个filter
咱们晓得一个分析器能够蕴含多个过滤器,那怎么来实现呢?看上面这个例子:
GET _analyze{ "tokenizer" : "standard", "filter": [{"type": "length", "min":1, "max":4 },{"type": "truncate", "length": 3}], "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}
这个例子中,咱们把length filter和truncate filter组合在一起应用,它首先基于规范分词,分词后的term大于4字节的会首先被过滤掉,接着剩下的term会被截断到3个字节。输入后果是,
{ "tokens" : [ { "token" : "The", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "2", "start_offset" : 4, "end_offset" : 5, "type" : "<NUM>", "position" : 1 }, { "token" : "ove", "start_offset" : 31, "end_offset" : 35, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "the", "start_offset" : 36, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "laz", "start_offset" : 40, "end_offset" : 44, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "bon", "start_offset" : 51, "end_offset" : 55, "type" : "<ALPHANUM>", "position" : 10 } ]}
如果是在索引中应用的话,参考上面这个例子:
PUT /length_truncate_example{ "settings" : { "analysis" : { "analyzer" : { "default" : { "tokenizer" : "standard", "filter" : ["my_length", "my_truncate"] } }, "filter" : { "my_length" : { "type" : "length", "min" : 1, "max": 4 }, "my_truncate" : { "type" : "truncate", "length": 3 } } } }}GET length_truncate_example/_analyze{ "analyzer": "default", "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bonet"}