共计 5562 个字符,预计需要花费 14 分钟才能阅读完成。
写在后面
ES 内置的 token filter 很多,大部分理论工作中都用不到。这段时间筹备 ES 认证工程师的考试,备考的时候须要相熟这些不罕用的 filter。ES 官网对一些 filter 只是一笔带过,我就想着把备考的笔记整顿成博客备忘,也心愿能帮忙到有这方面需要的人。
length filer
官网解释:
A token filter of type length that removes words that are too long or too short for the stream.
这个 filter 的性能是,去掉过长或者过短的单词。它有两个参数能够设置:
- min 定义最短长度,默认是 0
- max 定义最长长度,默认是
Integer.MAX_VALUE
先来简略测试下它的成果,
GET _analyze
{
"tokenizer" : "standard",
"filter": [{"type": "length", "min":1, "max":3}],
"text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone"
}
输入:
{
"tokens" : [
{
"token" : "The",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "the",
"start_offset" : 36,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 7
}
]
}
能够看到大于 3 的单词都被过滤掉了。
如果要给某个索引指定length
filer,能够参考上面这个示例:
PUT /length_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["my_length"]
}
},
"filter" : {
"my_length" : {
"type" : "length",
"min" : 1,
"max": 3
}
}
}
}
}
GET length_example/_analyze
{
"analyzer": "default",
"text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bonet"
}
ngram filter
ngram filter 的意义能够参考ngram tokenize
,后者相当于是keyword
tokenizer 加上 ngram filter
,成果是一样的。
它的含意是:首先将 text 文本切分,执行时采纳 N -gram 切割算法。N-grams
算法,像一个穿梭单词的滑窗,是一个特定长度的继续的字符序列。
说着挺形象,来个例子:
GET _analyze
{
"tokenizer": "ngram",
"text": "北京大学"
}
GET _analyze
{
"tokenizer" : "keyword",
"filter": [{"type": "ngram", "min_gram":1, "max_gram":2}],
"text" : "北京大学"
}
能够看到有两个属性,
- min_gram 在单词中最小字符长度,且默认为 1
- max_gram 在单词中最大字符长度,且默认为 2
max 和 min 的距离,也就是步长默认最大只能是 1,能够通过设置索引的 max_ngram_diff
批改,示例如下:
PUT /ngram_example
{
"settings" : {
"index": {"max_ngram_diff": 10},
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "keyword",
"filter" : ["my_ngram"]
}
},
"filter" : {
"my_ngram" : {
"type" : "ngram",
"min_gram" : 2,
"max_gram": 4
}
}
}
}
}
应用索引的 analyzer 测试,
GET ngram_example/_analyze
{
"analyzer": "default",
"text" : "北京大学"
}
输入,
{
"tokens" : [
{
"token" : "北京",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "北京大",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "北京大学",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "京大",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "京大学",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "大学",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}
]
}
你应该曾经根本理解 ngram filter
的用法了,可能会有个疑难,这个过滤器用在什么场景呢?事实上,它适宜前缀中断检索,比方搜寻举荐性能,当你只输出了某个句子的一部分时,搜索引擎会显示出以这部分为前缀的一些匹配项,从而实现举荐性能。
trim filter
这个 filter 从名字也能够看出它的性能,它能够删除前后空格。看个示例:
GET _analyze
{
"tokenizer" : "keyword",
"filter": [{"type": "trim"}],
"text" : "北京大学"
}
输入,
{
"tokens" : [
{
"token" : "北京大学",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
}
]
}
truncate filter
这个 filter 有一个 length
属性,能够截断分词后的 term,确保 term 的长度不会超过 length。上面看个示例,
GET _analyze
{
"tokenizer" : "keyword",
"filter": [{"type": "truncate", "length": 3}],
"text" : "北京大学"
}
输入,
{
"tokens" : [
{
"token" : "北京大",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}
]
}
再来一个示例:
GET _analyze
{
"tokenizer" : "standard",
"filter": [{"type": "truncate", "length": 3}],
"text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
输入,
{
"tokens" : [
{
"token" : "The",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "QUI",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
...
这个 filter 在 keyword 比拟长的场景下,能够用来避免出现一些 OOM 等问题。
unique filter
unique 词元过滤器的作用就是保障同样后果的词元只呈现一次。看个示例:
GET _analyze
{
"tokenizer": "standard",
"filter": ["unique"],
"text": "this is a test test test"
}
输入,
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
synonym token filter
同义词过滤器。它的应用场景是这样的,比方有一个文档外面蕴含 番茄
这个词,咱们心愿搜寻 番茄
或者 西红柿
, 圣女果
都能够找到这个文档。示例如下:
PUT /synonym_example
{
"settings": {
"analysis" : {
"analyzer" : {
"synonym" : {
"tokenizer" : "whitespace",
"filter" : ["my_synonym"]
}
},
"filter" : {
"my_synonym" : {
"type" : "synonym",
"synonyms_path" : "analysis/synonym.txt"
}
}
}
}
}
咱们须要在 ES 实例的 config 目录下,新建一个 analysis/synonym.txt
的文件,内容如下:
番茄, 西红柿, 圣女果
记得要重启。
而后测试下,
GET /synonym_example/_analyze
{
"analyzer": "synonym",
"text": "番茄"
}
输入,
{
"tokens" : [
{
"token" : "番茄",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "西红柿",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "圣女果",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
}
]
}
如何组合应用多个 filter
咱们晓得一个分析器能够蕴含多个过滤器,那怎么来实现呢?看上面这个例子:
GET _analyze
{
"tokenizer" : "standard",
"filter": [{"type": "length", "min":1, "max":4},{"type": "truncate", "length": 3}],
"text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
这个例子中,咱们把 length filter 和 truncate filter 组合在一起应用,它首先基于规范分词,分词后的 term 大于 4 字节的会首先被过滤掉,接着剩下的 term 会被截断到 3 个字节。输入后果是,
{
"tokens" : [
{
"token" : "The",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "ove",
"start_offset" : 31,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "the",
"start_offset" : 36,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "laz",
"start_offset" : 40,
"end_offset" : 44,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "bon",
"start_offset" : 51,
"end_offset" : 55,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
如果是在索引中应用的话,参考上面这个例子:
PUT /length_truncate_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["my_length", "my_truncate"]
}
},
"filter" : {
"my_length" : {
"type" : "length",
"min" : 1,
"max": 4
},
"my_truncate" : {
"type" : "truncate",
"length": 3
}
}
}
}
}
GET length_truncate_example/_analyze
{
"analyzer": "default",
"text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bonet"
}