关于elasticsearch:实践003elasticsearch之analyzer

3次阅读

共计 7613 个字符,预计需要花费 20 分钟才能阅读完成。

[toc]


一、Elasticsearch analizer 组成

1. 组成三大件

1.1 Character Filter(字符过滤器)

用于原始文本过滤,比方原文本为 html 的文本,须要去掉 html 标签:html_strip

1.2 Tokenizer(分词器)

按某种规定 (比方空格) 对输出(Character Filter 解决完的文本) 进行切分

1.3 Token Filter(分词过滤器)

对 Tokenizer 切分后的准 term 进行二次加工,比方大写 -> 小写,stop word 过滤(跑去 in、the 等)

二、Analyzer 测试分词

2.1 指定 analyzer 测试分词

2.1.1 standard analyzer

  • Tokenizer: Standard Tokenize

    基于 unicode 文本宰割,适于大多数语言

  • Token Filter: Lower Case Token Filter/Stop Token Filter(默认禁用)

    • LowerCase Token Filter: 过滤后,变小写 –> 所以 standard 默认分词后的搜寻匹配是小写
    • Stop Token Filter(默认禁用) –> 停用词:分词后索引里会抛弃的
GET _analyze
{
  "analyzer": "standard",
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

2.1.2 standard 后果可见

  • 全小写
  • 数字还在
  • 没有 stop word(默认敞开的)
{
  "tokens" : [
    {
      "token" : "for",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 44,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

2.2 其余 analyzer

  • standard
  • stop stopword 剔除
  • simple
  • whitespace 只用空白符宰割,不剔除
  • keyword 残缺文本,不分词

2.3 指定 Tokenizer 和 Token Filter 测试分词

2.3.1 应用 standard 雷同的 Tokenizer 和 Filter

后面一节说:standard analyzer 应用的 Tokenizer 是standard Tokenizer 应用的 filter 是lowercase, 咱们通过应用 tokenizer 和 filter 来替换 analyzer 试试:

GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"], 
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

后果和下面统一:

{
  "tokens" : [
    {
      "token" : "for",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 44,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

2.3.2 减少一个 stop 的 filter 再试

GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase","stop"], 
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

察看发现:in没了,所以 stop 里应该是有 in 这个过滤成分的呢~

filter 里有两个 (应用了两个 TokenFilter–ES 的字段都能够使多个多个值的就是数组式的) 如果去掉 filter 里的lowercase, 就不会转大写为小写了,这里就不贴出后果了~

{
  "tokens" : [
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

三、Elasticsearch 自带的 Analyzer 组件

3.1 ES 自带的 character filter

3.1.1 什么是 character filter?

在 tokenizer 之前,对文本进行解决,例如减少删除或替换字符;能够设置多个 character filter.

它会影响 tokenizer 的 positionoffset.

3.1.2 一些自带的 character filter

  • html strip: 剔除 html 标签
  • mapping: 字符串替换
  • pattern replace: 正则匹配替换

3.2 ES 自带的 tokenizer

3.2.1 什么是 tokenizer?

将原始文本 (character filter 解决后的原始文本) 依照肯定规定进行切分。(term or token)

3.2.2 自带的 tokenizer

  • whitespace: 空格分词
  • standard
  • uax_url_email: url/email
  • pattern
  • keyword: 不分词
  • pattern hierarchy: 路径名拆分

3.2.3 能够用 java 插件,实现自定义的 tokenizer

3.3 ES 自带的 token filter

3.3.1 什么是 tokenizer?

将 tokenizer 输入的单词进行加工(加工 term)

3.3.2 自带的 token filter

  • lowercase: 小写化
  • stop: 去除停用词(in/the 等)
  • synonym: 增加近义词

四、Demo 案例

4.1 html_strip/maping+keyword

GET _analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {"type": "html_strip"},
    {
      "type": "mapping",
      "mappings": ["- => _", ":) => _happy_", ":(=> _sad_"]
    }
  ],
  "text": "<b>Hello :) this-is-my-book,that-is-not :(World</b>"}

应用了 tokenizer:keyword,也就是切词时残缺保留,不切割;

应用了 char_filter 两个:html_strip(剔除掉 html 标签),mapping(用指定内容替换原内容)

下面后果:html 标签去掉了,减号符替换成了下划线

{
  "tokens" : [
    {
      "token" : "Hello _happy_ this_is_my_book,that_is_not _sad_ World",
      "start_offset" : 3,
      "end_offset" : 52,
      "type" : "word",
      "position" : 0
    }
  ]
}

4.2 char_filter 应用正则替换

GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "http://(.*)",
      "replacement": "$1"
    }
  ],
  "text": "http://www.elastic.co"
}

正则替换:type/pattern/replacement

后果:

{
  "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

4.3 tokenizer 应用目录切分

GET _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/user/niewj/a/b/c"
}

分词后果:

{
  "tokens" : [
    {
      "token" : "/user",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a/b",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a/b/c",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

4.4 tokenfilter 之 whitespace 与 stop

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"], // ["lowercase", "stop"]
  "text": "The girls in China are playing this game !"
}

后果:in、this 都被剔除了(stopword), 然而 term 是大写的还保留,因为 tokenizer 用的是 whitespace 而非 standard

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "girls",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "China",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "playing",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "game",
      "start_offset" : 36,
      "end_offset" : 40,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "!",
      "start_offset" : 41,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    }
  ]
}

4.5 自定义 analyzer

4.5.1 settings 自定义 analyzer

PUT my_new_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{ // 1. 自定义 analyzer 的名称
          "type": "custom",
          "char_filter": ["my_emoticons"], 
          "tokenizer": "my_punctuation", 
          "filter": ["lowercase", "my_english_stop"]
        }
      },
      "tokenizer": {
        "my_punctuation": { // 3. 自定义 tokenizer 的名称
          "type": "pattern", "pattern":"[.,!?]"
        }
      },
      "char_filter": {
        "my_emoticons": { // 2. 自定义 char_filter 的名称
          "type": "mapping", "mappings":[":) => _hapy_", ":(=> _sad_"]
        }
      },
      "filter": {
        "my_english_stop": { // 4. 自定义 token filter 的名称
          "type": "stop", "stopwords": "_english_"
        }
      }
    }
  }
}

4.5.2 测试自定义的 analyzer:

POST my_new_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I'm a :) person in the earth, :(And You? "}

输入

{
  "tokens" : [
    {"token" : "i'm","start_offset": 0,"end_offset": 3,"type":"word","position" : 0},
    {
      "token" : "_hapy_",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "person",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "earth",
      "start_offset" : 23,
      "end_offset" : 28,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "_sad_",
      "start_offset" : 30,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "you",
      "start_offset" : 37,
      "end_offset" : 40,
      "type" : "word",
      "position" : 9
    }
  ]
}
正文完
 0