关于elasticsearch:总结一些ES不常用的filter

ES 内置的 token filter 很多，大部分理论工作中都用不到。这段时间筹备 ES 认证工程师的考试，备考的时候须要相熟这些不罕用的 filter。ES 官网对一些 filter 只是一笔带过，我就想着把备考的笔记整顿成博客备忘，也心愿能帮忙到有这方面需要的人。

官网解释：

A token filter of type length that removes words that are too long or too short for the stream.

这个 filter 的性能是，去掉过长或者过短的单词。它有两个参数能够设置：

min 定义最短长度，默认是 0
max 定义最长长度，默认是Integer.MAX_VALUE

先来简略测试下它的成果，

GET _analyze
{
  "tokenizer" : "standard",
  "filter": [{"type": "length", "min":1, "max":3}],  
  "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone"
}

输入：

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

能够看到大于 3 的单词都被过滤掉了。

如果要给某个索引指定length filer，能够参考上面这个示例：

PUT /length_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "tokenizer" : "standard",
                    "filter" : ["my_length"]
                }
            },
            "filter" : {
                "my_length" : {
                    "type" : "length",
                    "min" : 1,
                    "max": 3
                }
            }
        }
    }
}

GET length_example/_analyze
{
  "analyzer": "default", 
  "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bonet"
}

ngram filter 的意义能够参考ngram tokenize，后者相当于是keyword tokenizer 加上 ngram filter，成果是一样的。

它的含意是：首先将 text 文本切分，执行时采纳 N -gram 切割算法。N-grams 算法，像一个穿梭单词的滑窗，是一个特定长度的继续的字符序列。

说着挺形象，来个例子：

GET _analyze
{
  "tokenizer": "ngram",
  "text": "北京大学"
}

GET _analyze
{
  "tokenizer" : "keyword",
  "filter": [{"type": "ngram", "min_gram":1, "max_gram":2}],  
  "text" : "北京大学"
}

能够看到有两个属性，

min_gram 在单词中最小字符长度，且默认为 1
max_gram 在单词中最大字符长度，且默认为 2

max 和 min 的距离，也就是步长默认最大只能是 1，能够通过设置索引的 max_ngram_diff 批改，示例如下：

PUT /ngram_example
{
    "settings" : {
      "index": {"max_ngram_diff": 10},
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "tokenizer" : "keyword",
                    "filter" : ["my_ngram"]
                }
            },
            "filter" : {
                "my_ngram" : {
                    "type" : "ngram",
                    "min_gram" : 2,
                    "max_gram": 4
                }
            }
        }
    }
}

应用索引的 analyzer 测试，

GET ngram_example/_analyze
{
  "analyzer": "default", 
  "text" : "北京大学"
}

输入，

{
  "tokens" : [
    {
      "token" : "北京",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "北京大",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "北京大学",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "京大",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "京大学",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "大学",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    }
  ]
}

你应该曾经根本理解 ngram filter 的用法了，可能会有个疑难，这个过滤器用在什么场景呢？事实上，它适宜前缀中断检索，比方搜寻举荐性能，当你只输出了某个句子的一部分时，搜索引擎会显示出以这部分为前缀的一些匹配项，从而实现举荐性能。

这个 filter 从名字也能够看出它的性能，它能够删除前后空格。看个示例：

GET _analyze
{
  "tokenizer" : "keyword",
  "filter": [{"type": "trim"}],  
  "text" : "北京大学"
}

输入，

{
  "tokens" : [
    {
      "token" : "北京大学",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    }
  ]
}

这个 filter 有一个 length 属性，能够截断分词后的 term，确保 term 的长度不会超过 length。上面看个示例，

GET _analyze
{
  "tokenizer" : "keyword",
  "filter": [{"type": "truncate", "length": 3}],  
  "text" : "北京大学"
}

输入，

{
  "tokens" : [
    {
      "token" : "北京大",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    }
  ]
}

再来一个示例：

GET _analyze
{
  "tokenizer" : "standard",
  "filter": [{"type": "truncate", "length": 3}],  
  "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输入，

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "QUI",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    ...

这个 filter 在 keyword 比拟长的场景下，能够用来避免出现一些 OOM 等问题。

unique 词元过滤器的作用就是保障同样后果的词元只呈现一次。看个示例：

GET _analyze
{
    "tokenizer": "standard",
    "filter": ["unique"],
    "text": "this is a test test test"
}

输入，

{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "test",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

同义词过滤器。它的应用场景是这样的，比方有一个文档外面蕴含 番茄 这个词，咱们心愿搜寻 番茄 或者 西红柿 ， 圣女果 都能够找到这个文档。示例如下：

PUT /synonym_example
{
    "settings": {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "whitespace",
                        "filter" : ["my_synonym"]
                    }
                },
                "filter" : {
                    "my_synonym" : {
                        "type" : "synonym",
                        "synonyms_path" : "analysis/synonym.txt"
                    }
                }
            }
    }
}

咱们须要在 ES 实例的 config 目录下，新建一个 analysis/synonym.txt 的文件，内容如下：

番茄, 西红柿, 圣女果

记得要重启。

而后测试下，

GET /synonym_example/_analyze
{
  "analyzer": "synonym",
  "text": "番茄"
}

输入，

{
  "tokens" : [
    {
      "token" : "番茄",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "西红柿",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "圣女果",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    }
  ]
}

咱们晓得一个分析器能够蕴含多个过滤器，那怎么来实现呢？看上面这个例子：

GET _analyze
{
  "tokenizer" : "standard",
  "filter": [{"type": "length", "min":1, "max":4},{"type": "truncate", "length": 3}],  
  "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

这个例子中，咱们把 length filter 和 truncate filter 组合在一起应用，它首先基于规范分词，分词后的 term 大于 4 字节的会首先被过滤掉，接着剩下的 term 会被截断到 3 个字节。输入后果是，

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "ove",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "laz",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "bon",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

如果是在索引中应用的话，参考上面这个例子:

PUT /length_truncate_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "tokenizer" : "standard",
                    "filter" : ["my_length", "my_truncate"]
                }
            },
            "filter" : {
                "my_length" : {
                    "type" : "length",
                    "min" : 1,
                    "max": 4
                },
                "my_truncate" : {
                    "type" : "truncate",
                    "length": 3
                }
            }
        }
    }
}

GET length_truncate_example/_analyze
{
  "analyzer": "default", 
  "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bonet"
}

关于elasticsearch:总结一些ES不常用的filter

写在后面

length filer

ngram filter

trim filter

truncate filter

unique filter

synonym token filter

如何组合应用多个 filter

Just My Socks（注册教程内含优惠码）

关于elasticsearch:总结一些ES不常用的filter

写在后面

length filer

ngram filter

trim filter

truncate filter

unique filter

synonym token filter

如何组合应用多个 filter

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）