关于elasticsearch:实践003elasticsearch之analyzer

[toc]

用于原始文本过滤，比方原文本为 html 的文本，须要去掉 html 标签：html_strip

按某种规定 (比方空格) 对输出(Character Filter 解决完的文本) 进行切分

对 Tokenizer 切分后的准 term 进行二次加工，比方大写 -> 小写，stop word 过滤(跑去 in、the 等)

Tokenizer: Standard Tokenize

基于 unicode 文本宰割，适于大多数语言
Token Filter: Lower Case Token Filter/Stop Token Filter(默认禁用)
- LowerCase Token Filter: 过滤后，变小写 –> 所以 standard 默认分词后的搜寻匹配是小写
- Stop Token Filter(默认禁用) –> 停用词：分词后索引里会抛弃的

GET _analyze
{
  "analyzer": "standard",
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

全小写
数字还在
没有 stop word(默认敞开的)

{
  "tokens" : [
    {
      "token" : "for",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 44,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

standard
stop stopword 剔除
simple
whitespace 只用空白符宰割，不剔除
keyword 残缺文本，不分词

后面一节说：standard analyzer 应用的 Tokenizer 是standard Tokenizer 应用的 filter 是lowercase, 咱们通过应用 tokenizer 和 filter 来替换 analyzer 试试：

GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"], 
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

后果和下面统一：

{
  "tokens" : [
    {
      "token" : "for",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 44,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase","stop"], 
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

察看发现：in没了，所以 stop 里应该是有 in 这个过滤成分的呢~

filter 里有两个 (应用了两个 TokenFilter–ES 的字段都能够使多个多个值的就是数组式的) 如果去掉 filter 里的lowercase, 就不会转大写为小写了，这里就不贴出后果了~

{
  "tokens" : [
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

在 tokenizer 之前，对文本进行解决，例如减少删除或替换字符；能够设置多个 character filter.

它会影响 tokenizer 的 position 和 offset.

html strip: 剔除 html 标签
mapping: 字符串替换
pattern replace: 正则匹配替换

将原始文本 (character filter 解决后的原始文本) 依照肯定规定进行切分。(term or token)

whitespace: 空格分词
standard
uax_url_email: url/email
pattern
keyword: 不分词
pattern hierarchy: 路径名拆分

将 tokenizer 输入的单词进行加工(加工 term)

lowercase: 小写化
stop: 去除停用词(in/the 等)
synonym: 增加近义词

GET _analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {"type": "html_strip"},
    {
      "type": "mapping",
      "mappings": ["- => _", ":) => _happy_", ":(=> _sad_"]
    }
  ],
  "text": "<b>Hello :) this-is-my-book,that-is-not :(World</b>"}

应用了 tokenizer：keyword，也就是切词时残缺保留，不切割；

应用了 char_filter 两个：html_strip(剔除掉 html 标签)，mapping(用指定内容替换原内容)

下面后果：html 标签去掉了，减号符替换成了下划线

{
  "tokens" : [
    {
      "token" : "Hello _happy_ this_is_my_book,that_is_not _sad_ World",
      "start_offset" : 3,
      "end_offset" : 52,
      "type" : "word",
      "position" : 0
    }
  ]
}

GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "http://(.*)",
      "replacement": "$1"
    }
  ],
  "text": "http://www.elastic.co"
}

正则替换：type/pattern/replacement

后果：

{
  "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

GET _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/user/niewj/a/b/c"
}

分词后果：

{
  "tokens" : [
    {
      "token" : "/user",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a/b",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a/b/c",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"], // ["lowercase", "stop"]
  "text": "The girls in China are playing this game !"
}

后果：in、this 都被剔除了(stopword), 然而 term 是大写的还保留，因为 tokenizer 用的是 whitespace 而非 standard

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "girls",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "China",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "playing",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "game",
      "start_offset" : 36,
      "end_offset" : 40,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "!",
      "start_offset" : 41,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    }
  ]
}

PUT my_new_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{ // 1. 自定义 analyzer 的名称
          "type": "custom",
          "char_filter": ["my_emoticons"], 
          "tokenizer": "my_punctuation", 
          "filter": ["lowercase", "my_english_stop"]
        }
      },
      "tokenizer": {
        "my_punctuation": { // 3. 自定义 tokenizer 的名称
          "type": "pattern", "pattern":"[.,!?]"
        }
      },
      "char_filter": {
        "my_emoticons": { // 2. 自定义 char_filter 的名称
          "type": "mapping", "mappings":[":) => _hapy_", ":(=> _sad_"]
        }
      },
      "filter": {
        "my_english_stop": { // 4. 自定义 token filter 的名称
          "type": "stop", "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_new_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I'm a :) person in the earth, :(And You? "}

输入

{
  "tokens" : [
    {"token" : "i'm","start_offset": 0,"end_offset": 3,"type":"word","position" : 0},
    {
      "token" : "_hapy_",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "person",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "earth",
      "start_offset" : 23,
      "end_offset" : 28,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "_sad_",
      "start_offset" : 30,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "you",
      "start_offset" : 37,
      "end_offset" : 40,
      "type" : "word",
      "position" : 9
    }
  ]
}

关于elasticsearch:实践003elasticsearch之analyzer

一、Elasticsearch analizer 组成

1. 组成三大件

1.1 Character Filter(字符过滤器)

1.2 Tokenizer(分词器)

1.3 Token Filter(分词过滤器)

二、Analyzer 测试分词

2.1 指定 analyzer 测试分词

2.1.1 standard analyzer

2.1.2 standard 后果可见

2.2 其余 analyzer

2.3 指定 Tokenizer 和 Token Filter 测试分词

2.3.1 应用 standard 雷同的 Tokenizer 和 Filter

2.3.2 减少一个 stop 的 filter 再试

三、Elasticsearch 自带的 Analyzer 组件

3.1 ES 自带的 character filter

3.1.1 什么是 character filter?

3.1.2 一些自带的 character filter

3.2 ES 自带的 tokenizer

3.2.1 什么是 tokenizer?

3.2.2 自带的 tokenizer

3.2.3 能够用 java 插件，实现自定义的 tokenizer

3.3 ES 自带的 token filter

3.3.1 什么是 tokenizer?

3.3.2 自带的 token filter

四、Demo 案例

4.1 html_strip/maping+keyword

4.2 char_filter 应用正则替换

4.3 tokenizer 应用目录切分

4.4 tokenfilter 之 whitespace 与 stop

4.5 自定义 analyzer

4.5.1 settings 自定义 analyzer

4.5.2 测试自定义的 analyzer：