关于elasticsearch:实践003elasticsearch之analyzer

[toc]

一、Elasticsearch analizer组成

1. 组成三大件

1.1 Character Filter(字符过滤器)

用于原始文本过滤，比方原文本为html的文本，须要去掉html标签： html_strip

1.2 Tokenizer(分词器)

按某种规定(比方空格) 对输出(Character Filter解决完的文本)进行切分

1.3 Token Filter(分词过滤器)

对Tokenizer切分后的准term进行二次加工，比方大写->小写，stop word过滤(跑去in、the等)

二、Analyzer测试分词

2.1 指定analyzer测试分词

2.1.1 standard analyzer

Tokenizer: Standard Tokenize
基于unicode文本宰割，适于大多数语言
Token Filter: Lower Case Token Filter/Stop Token Filter(默认禁用)
- LowerCase Token Filter: 过滤后，变小写-->所以standard默认分词后的搜寻匹配是小写
- Stop Token Filter(默认禁用) -->停用词：分词后索引里会抛弃的

GET _analyze{  "analyzer": "standard",  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."}

2.1.2 standard后果可见

全小写
数字还在
没有stop word(默认敞开的)

{  "tokens" : [    {      "token" : "for",      "start_offset" : 3,      "end_offset" : 6,      "type" : "<ALPHANUM>",      "position" : 0    },    {      "token" : "example",      "start_offset" : 7,      "end_offset" : 14,      "type" : "<ALPHANUM>",      "position" : 1    },    {      "token" : "uuu",      "start_offset" : 16,      "end_offset" : 19,      "type" : "<ALPHANUM>",      "position" : 2    },    {      "token" : "you",      "start_offset" : 20,      "end_offset" : 23,      "type" : "<ALPHANUM>",      "position" : 3    },    {      "token" : "can",      "start_offset" : 24,      "end_offset" : 27,      "type" : "<ALPHANUM>",      "position" : 4    },    {      "token" : "see",      "start_offset" : 28,      "end_offset" : 31,      "type" : "<ALPHANUM>",      "position" : 5    },    {      "token" : "27",      "start_offset" : 32,      "end_offset" : 34,      "type" : "<NUM>",      "position" : 6    },    {      "token" : "accounts",      "start_offset" : 35,      "end_offset" : 43,      "type" : "<ALPHANUM>",      "position" : 7    },    {      "token" : "in",      "start_offset" : 44,      "end_offset" : 46,      "type" : "<ALPHANUM>",      "position" : 8    },    {      "token" : "id",      "start_offset" : 47,      "end_offset" : 49,      "type" : "<ALPHANUM>",      "position" : 9    },    {      "token" : "idaho",      "start_offset" : 51,      "end_offset" : 56,      "type" : "<ALPHANUM>",      "position" : 10    }  ]}

2.2 其余analyzer

standard
stop stopword剔除
simple
whitespace 只用空白符宰割，不剔除
keyword 残缺文本，不分词

2.3 指定Tokenizer和Token Filter测试分词

2.3.1 应用standard雷同的Tokenizer和Filter

后面一节说：standard analyzer应用的Tokenizer是standard Tokenizer 应用的filter是lowercase, 咱们通过应用tokenizer和filter来替换analyzer试试：

GET _analyze{  "tokenizer": "standard",  "filter": ["lowercase"],   "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."}

后果和下面统一：

{  "tokens" : [    {      "token" : "for",      "start_offset" : 3,      "end_offset" : 6,      "type" : "<ALPHANUM>",      "position" : 0    },    {      "token" : "example",      "start_offset" : 7,      "end_offset" : 14,      "type" : "<ALPHANUM>",      "position" : 1    },    {      "token" : "uuu",      "start_offset" : 16,      "end_offset" : 19,      "type" : "<ALPHANUM>",      "position" : 2    },    {      "token" : "you",      "start_offset" : 20,      "end_offset" : 23,      "type" : "<ALPHANUM>",      "position" : 3    },    {      "token" : "can",      "start_offset" : 24,      "end_offset" : 27,      "type" : "<ALPHANUM>",      "position" : 4    },    {      "token" : "see",      "start_offset" : 28,      "end_offset" : 31,      "type" : "<ALPHANUM>",      "position" : 5    },    {      "token" : "27",      "start_offset" : 32,      "end_offset" : 34,      "type" : "<NUM>",      "position" : 6    },    {      "token" : "accounts",      "start_offset" : 35,      "end_offset" : 43,      "type" : "<ALPHANUM>",      "position" : 7    },    {      "token" : "in",      "start_offset" : 44,      "end_offset" : 46,      "type" : "<ALPHANUM>",      "position" : 8    },    {      "token" : "id",      "start_offset" : 47,      "end_offset" : 49,      "type" : "<ALPHANUM>",      "position" : 9    },    {      "token" : "idaho",      "start_offset" : 51,      "end_offset" : 56,      "type" : "<ALPHANUM>",      "position" : 10    }  ]}

2.3.2 减少一个stop的filter再试

GET _analyze{  "tokenizer": "standard",  "filter": ["lowercase","stop"],   "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."}

察看发现：in没了，所以stop里应该是有in这个过滤成分的呢~

filter里有两个(应用了两个TokenFilter--ES的字段都能够使多个多个值的就是数组式的)如果去掉filter里的lowercase, 就不会转大写为小写了，这里就不贴出后果了~

{  "tokens" : [    {      "token" : "example",      "start_offset" : 7,      "end_offset" : 14,      "type" : "<ALPHANUM>",      "position" : 1    },    {      "token" : "uuu",      "start_offset" : 16,      "end_offset" : 19,      "type" : "<ALPHANUM>",      "position" : 2    },    {      "token" : "you",      "start_offset" : 20,      "end_offset" : 23,      "type" : "<ALPHANUM>",      "position" : 3    },    {      "token" : "can",      "start_offset" : 24,      "end_offset" : 27,      "type" : "<ALPHANUM>",      "position" : 4    },    {      "token" : "see",      "start_offset" : 28,      "end_offset" : 31,      "type" : "<ALPHANUM>",      "position" : 5    },    {      "token" : "27",      "start_offset" : 32,      "end_offset" : 34,      "type" : "<NUM>",      "position" : 6    },    {      "token" : "accounts",      "start_offset" : 35,      "end_offset" : 43,      "type" : "<ALPHANUM>",      "position" : 7    },    {      "token" : "id",      "start_offset" : 47,      "end_offset" : 49,      "type" : "<ALPHANUM>",      "position" : 9    },    {      "token" : "idaho",      "start_offset" : 51,      "end_offset" : 56,      "type" : "<ALPHANUM>",      "position" : 10    }  ]}

三、Elasticsearch自带的Analyzer组件

3.1 ES自带的character filter

3.1.1 什么是character filter?

在tokenizer之前，对文本进行解决，例如减少删除或替换字符；能够设置多个character filter.
它会影响tokenizer的 position 和 offset.

3.1.2 一些自带的character filter

html strip: 剔除html标签
mapping: 字符串替换
pattern replace: 正则匹配替换

3.2 ES自带的tokenizer

3.2.1 什么是tokenizer?

将原始文本(character filter解决后的原始文本)依照肯定规定进行切分。(term or token)

3.2.2 自带的tokenizer

whitespace: 空格分词
standard
uax_url_email: url/email
pattern
keyword: 不分词
pattern hierarchy: 路径名拆分

3.2.3 能够用java插件，实现自定义的tokenizer

3.3 ES自带的token filter

3.3.1 什么是tokenizer?

将tokenizer输入的单词进行加工(加工term)

3.3.2 自带的token filter

lowercase: 小写化
stop: 去除停用词(in/the等)
synonym: 增加近义词

四、Demo案例

4.1 html_strip/maping+keyword

GET _analyze{  "tokenizer": "keyword",  "char_filter": [    {      "type": "html_strip"    },    {      "type": "mapping",      "mappings": [        "- => _", ":) => _happy_", ":( => _sad_"      ]    }  ],  "text": "<b>Hello :) this-is-my-book,that-is-not :( World</b>"}

应用了 tokenizer：keyword，也就是切词时残缺保留，不切割；

应用了char_filter两个：html_strip(剔除掉html标签)，mapping(用指定内容替换原内容)

下面后果：html标签去掉了，减号符替换成了下划线

{  "tokens" : [    {      "token" : "Hello _happy_ this_is_my_book,that_is_not _sad_ World",      "start_offset" : 3,      "end_offset" : 52,      "type" : "word",      "position" : 0    }  ]}

4.2 char_filter应用正则替换

GET _analyze{  "tokenizer": "standard",  "char_filter": [    {      "type": "pattern_replace",      "pattern": "http://(.*)",      "replacement": "$1"    }  ],  "text": "http://www.elastic.co"}

正则替换：type/pattern/replacement

后果：

{  "tokens" : [    {      "token" : "www.elastic.co",      "start_offset" : 0,      "end_offset" : 21,      "type" : "<ALPHANUM>",      "position" : 0    }  ]}

4.3 tokenizer应用目录切分

GET _analyze{  "tokenizer": "path_hierarchy",  "text": "/user/niewj/a/b/c"}

分词后果：

{  "tokens" : [    {      "token" : "/user",      "start_offset" : 0,      "end_offset" : 5,      "type" : "word",      "position" : 0    },    {      "token" : "/user/niewj",      "start_offset" : 0,      "end_offset" : 11,      "type" : "word",      "position" : 0    },    {      "token" : "/user/niewj/a",      "start_offset" : 0,      "end_offset" : 13,      "type" : "word",      "position" : 0    },    {      "token" : "/user/niewj/a/b",      "start_offset" : 0,      "end_offset" : 15,      "type" : "word",      "position" : 0    },    {      "token" : "/user/niewj/a/b/c",      "start_offset" : 0,      "end_offset" : 17,      "type" : "word",      "position" : 0    }  ]}

4.4 tokenfilter之whitespace与stop

GET _analyze{  "tokenizer": "whitespace",  "filter": ["stop"], // ["lowercase", "stop"]  "text": "The girls in China are playing this game !"}

后果：in、this都被剔除了(stopword), 然而term是大写的还保留，因为tokenizer用的是whitespace而非standard

{  "tokens" : [    {      "token" : "The",      "start_offset" : 0,      "end_offset" : 3,      "type" : "word",      "position" : 0    },    {      "token" : "girls",      "start_offset" : 4,      "end_offset" : 9,      "type" : "word",      "position" : 1    },    {      "token" : "China",      "start_offset" : 13,      "end_offset" : 18,      "type" : "word",      "position" : 3    },    {      "token" : "playing",      "start_offset" : 23,      "end_offset" : 30,      "type" : "word",      "position" : 5    },    {      "token" : "game",      "start_offset" : 36,      "end_offset" : 40,      "type" : "word",      "position" : 7    },    {      "token" : "!",      "start_offset" : 41,      "end_offset" : 42,      "type" : "word",      "position" : 8    }  ]}

4.5 自定义analyzer

4.5.1 settings自定义analyzer

PUT my_new_index{  "settings": {    "analysis": {      "analyzer": {        "my_analyzer":{ // 1.自定义analyzer的名称          "type": "custom",          "char_filter": ["my_emoticons"],           "tokenizer": "my_punctuation",           "filter": ["lowercase", "my_english_stop"]        }      },      "tokenizer": {        "my_punctuation": { // 3.自定义tokenizer的名称          "type": "pattern", "pattern":"[ .,!?]"        }      },      "char_filter": {        "my_emoticons": { // 2.自定义char_filter的名称          "type": "mapping", "mappings":[":) => _hapy_", ":( => _sad_"]        }      },      "filter": {        "my_english_stop": { // 4.自定义token filter的名称          "type": "stop", "stopwords": "_english_"        }      }    }  }}

4.5.2 测试自定义的analyzer：

POST my_new_index/_analyze{  "analyzer": "my_analyzer",  "text": "I'm a :) person in the earth, :( And You? "}

输入

{  "tokens" : [    {      "token" : "i'm",      "start_offset" : 0,      "end_offset" : 3,      "type" : "word",      "position" : 0    },    {      "token" : "_hapy_",      "start_offset" : 6,      "end_offset" : 8,      "type" : "word",      "position" : 2    },    {      "token" : "person",      "start_offset" : 9,      "end_offset" : 15,      "type" : "word",      "position" : 3    },    {      "token" : "earth",      "start_offset" : 23,      "end_offset" : 28,      "type" : "word",      "position" : 6    },    {      "token" : "_sad_",      "start_offset" : 30,      "end_offset" : 32,      "type" : "word",      "position" : 7    },    {      "token" : "you",      "start_offset" : 37,      "end_offset" : 40,      "type" : "word",      "position" : 9    }  ]}