[toc]
一、Elasticsearch analizer组成
1. 组成三大件
1.1 Character Filter(字符过滤器)
用于原始文本过滤,比方原文本为html的文本,须要去掉html标签: html_strip
1.2 Tokenizer(分词器)
按某种规定(比方空格) 对输出(Character Filter解决完的文本)进行切分
1.3 Token Filter(分词过滤器)
对Tokenizer切分后的准term进行二次加工,比方大写->小写,stop word过滤(跑去in、the等)
二、Analyzer测试分词
2.1 指定analyzer测试分词
2.1.1 standard analyzer
Tokenizer: Standard Tokenize
基于unicode文本宰割,适于大多数语言
Token Filter: Lower Case Token Filter/Stop Token Filter(默认禁用)
- LowerCase Token Filter: 过滤后,变小写-->所以standard默认分词后的搜寻匹配是小写
- Stop Token Filter(默认禁用) -->停用词:分词后索引里会抛弃的
GET _analyze{ "analyzer": "standard", "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."}
2.1.2 standard后果可见
- 全小写
- 数字还在
- 没有stop word(默认敞开的)
{ "tokens" : [ { "token" : "for", "start_offset" : 3, "end_offset" : 6, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "example", "start_offset" : 7, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "uuu", "start_offset" : 16, "end_offset" : 19, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "you", "start_offset" : 20, "end_offset" : 23, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "can", "start_offset" : 24, "end_offset" : 27, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "see", "start_offset" : 28, "end_offset" : 31, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : "27", "start_offset" : 32, "end_offset" : 34, "type" : "<NUM>", "position" : 6 }, { "token" : "accounts", "start_offset" : 35, "end_offset" : 43, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "in", "start_offset" : 44, "end_offset" : 46, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "id", "start_offset" : 47, "end_offset" : 49, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "idaho", "start_offset" : 51, "end_offset" : 56, "type" : "<ALPHANUM>", "position" : 10 } ]}
2.2 其余analyzer
- standard
- stop stopword剔除
- simple
- whitespace 只用空白符宰割,不剔除
- keyword 残缺文本,不分词
2.3 指定Tokenizer和Token Filter测试分词
2.3.1 应用standard雷同的Tokenizer和Filter
后面一节说:standard analyzer应用的Tokenizer是standard Tokenizer
应用的filter是lowercase
, 咱们通过应用tokenizer和filter来替换analyzer试试:
GET _analyze{ "tokenizer": "standard", "filter": ["lowercase"], "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."}
后果和下面统一:
{ "tokens" : [ { "token" : "for", "start_offset" : 3, "end_offset" : 6, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "example", "start_offset" : 7, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "uuu", "start_offset" : 16, "end_offset" : 19, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "you", "start_offset" : 20, "end_offset" : 23, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "can", "start_offset" : 24, "end_offset" : 27, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "see", "start_offset" : 28, "end_offset" : 31, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : "27", "start_offset" : 32, "end_offset" : 34, "type" : "<NUM>", "position" : 6 }, { "token" : "accounts", "start_offset" : 35, "end_offset" : 43, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "in", "start_offset" : 44, "end_offset" : 46, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "id", "start_offset" : 47, "end_offset" : 49, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "idaho", "start_offset" : 51, "end_offset" : 56, "type" : "<ALPHANUM>", "position" : 10 } ]}
2.3.2 减少一个stop的filter再试
GET _analyze{ "tokenizer": "standard", "filter": ["lowercase","stop"], "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."}
察看发现:in
没了,所以stop里应该是有in
这个过滤成分的呢~
filter里有两个(应用了两个TokenFilter--ES的字段都能够使多个多个值的就是数组式的)如果去掉filter里的lowercase
, 就不会转大写为小写了,这里就不贴出后果了~
{ "tokens" : [ { "token" : "example", "start_offset" : 7, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "uuu", "start_offset" : 16, "end_offset" : 19, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "you", "start_offset" : 20, "end_offset" : 23, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "can", "start_offset" : 24, "end_offset" : 27, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "see", "start_offset" : 28, "end_offset" : 31, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : "27", "start_offset" : 32, "end_offset" : 34, "type" : "<NUM>", "position" : 6 }, { "token" : "accounts", "start_offset" : 35, "end_offset" : 43, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "id", "start_offset" : 47, "end_offset" : 49, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "idaho", "start_offset" : 51, "end_offset" : 56, "type" : "<ALPHANUM>", "position" : 10 } ]}
三、Elasticsearch自带的Analyzer组件
3.1 ES自带的character filter
3.1.1 什么是character filter?
在tokenizer之前,对文本进行解决,例如减少删除或替换字符;能够设置多个character filter.
它会影响tokenizer的
position
和offset
.
3.1.2 一些自带的character filter
- html strip: 剔除html标签
- mapping: 字符串替换
- pattern replace: 正则匹配替换
3.2 ES自带的tokenizer
3.2.1 什么是tokenizer?
将原始文本(character filter解决后的原始文本)依照肯定规定进行切分。(term or token)
3.2.2 自带的tokenizer
- whitespace: 空格分词
- standard
- uax_url_email: url/email
- pattern
- keyword: 不分词
- pattern hierarchy: 路径名拆分
3.2.3 能够用java插件,实现自定义的tokenizer
3.3 ES自带的token filter
3.3.1 什么是tokenizer?
将tokenizer输入的单词进行加工(加工term)
3.3.2 自带的token filter
- lowercase: 小写化
- stop: 去除停用词(in/the等)
- synonym: 增加近义词
四、Demo案例
4.1 html_strip/maping+keyword
GET _analyze{ "tokenizer": "keyword", "char_filter": [ { "type": "html_strip" }, { "type": "mapping", "mappings": [ "- => _", ":) => _happy_", ":( => _sad_" ] } ], "text": "<b>Hello :) this-is-my-book,that-is-not :( World</b>"}
应用了 tokenizer:keyword,也就是切词时残缺保留,不切割;
应用了char_filter两个:html_strip(剔除掉html标签),mapping(用指定内容替换原内容)
下面后果:html标签去掉了,减号符替换成了下划线
{ "tokens" : [ { "token" : "Hello _happy_ this_is_my_book,that_is_not _sad_ World", "start_offset" : 3, "end_offset" : 52, "type" : "word", "position" : 0 } ]}
4.2 char_filter应用正则替换
GET _analyze{ "tokenizer": "standard", "char_filter": [ { "type": "pattern_replace", "pattern": "http://(.*)", "replacement": "$1" } ], "text": "http://www.elastic.co"}
正则替换:type
/pattern
/replacement
后果:
{ "tokens" : [ { "token" : "www.elastic.co", "start_offset" : 0, "end_offset" : 21, "type" : "<ALPHANUM>", "position" : 0 } ]}
4.3 tokenizer应用目录切分
GET _analyze{ "tokenizer": "path_hierarchy", "text": "/user/niewj/a/b/c"}
分词后果:
{ "tokens" : [ { "token" : "/user", "start_offset" : 0, "end_offset" : 5, "type" : "word", "position" : 0 }, { "token" : "/user/niewj", "start_offset" : 0, "end_offset" : 11, "type" : "word", "position" : 0 }, { "token" : "/user/niewj/a", "start_offset" : 0, "end_offset" : 13, "type" : "word", "position" : 0 }, { "token" : "/user/niewj/a/b", "start_offset" : 0, "end_offset" : 15, "type" : "word", "position" : 0 }, { "token" : "/user/niewj/a/b/c", "start_offset" : 0, "end_offset" : 17, "type" : "word", "position" : 0 } ]}
4.4 tokenfilter之whitespace与stop
GET _analyze{ "tokenizer": "whitespace", "filter": ["stop"], // ["lowercase", "stop"] "text": "The girls in China are playing this game !"}
后果:in、this都被剔除了(stopword), 然而term是大写的还保留, 因为tokenizer用的是whitespace而非standard
{ "tokens" : [ { "token" : "The", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 0 }, { "token" : "girls", "start_offset" : 4, "end_offset" : 9, "type" : "word", "position" : 1 }, { "token" : "China", "start_offset" : 13, "end_offset" : 18, "type" : "word", "position" : 3 }, { "token" : "playing", "start_offset" : 23, "end_offset" : 30, "type" : "word", "position" : 5 }, { "token" : "game", "start_offset" : 36, "end_offset" : 40, "type" : "word", "position" : 7 }, { "token" : "!", "start_offset" : 41, "end_offset" : 42, "type" : "word", "position" : 8 } ]}
4.5 自定义analyzer
4.5.1 settings自定义analyzer
PUT my_new_index{ "settings": { "analysis": { "analyzer": { "my_analyzer":{ // 1.自定义analyzer的名称 "type": "custom", "char_filter": ["my_emoticons"], "tokenizer": "my_punctuation", "filter": ["lowercase", "my_english_stop"] } }, "tokenizer": { "my_punctuation": { // 3.自定义tokenizer的名称 "type": "pattern", "pattern":"[ .,!?]" } }, "char_filter": { "my_emoticons": { // 2.自定义char_filter的名称 "type": "mapping", "mappings":[":) => _hapy_", ":( => _sad_"] } }, "filter": { "my_english_stop": { // 4.自定义token filter的名称 "type": "stop", "stopwords": "_english_" } } } }}
4.5.2 测试自定义的analyzer:
POST my_new_index/_analyze{ "analyzer": "my_analyzer", "text": "I'm a :) person in the earth, :( And You? "}
输入
{ "tokens" : [ { "token" : "i'm", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 0 }, { "token" : "_hapy_", "start_offset" : 6, "end_offset" : 8, "type" : "word", "position" : 2 }, { "token" : "person", "start_offset" : 9, "end_offset" : 15, "type" : "word", "position" : 3 }, { "token" : "earth", "start_offset" : 23, "end_offset" : 28, "type" : "word", "position" : 6 }, { "token" : "_sad_", "start_offset" : 30, "end_offset" : 32, "type" : "word", "position" : 7 }, { "token" : "you", "start_offset" : 37, "end_offset" : 40, "type" : "word", "position" : 9 } ]}