关于elasticsearch:实践003elasticsearch之analyzer

[toc]

一、Elasticsearch analizer组成

1. 组成三大件

1.1 Character Filter(字符过滤器)

用于原始文本过滤，比方原文本为html的文本，须要去掉html标签： html_strip

1.2 Tokenizer(分词器)

按某种规定(比方空格) 对输出(Character Filter解决完的文本)进行切分

1.3 Token Filter(分词过滤器)

对Tokenizer切分后的准term进行二次加工，比方大写->小写，stop word过滤(跑去in、the等)

二、Analyzer测试分词

2.1 指定analyzer测试分词

2.1.1 standard analyzer

Tokenizer: Standard Tokenize

基于unicode文本宰割，适于大多数语言
Token Filter: Lower Case Token Filter/Stop Token Filter(默认禁用)
- LowerCase Token Filter: 过滤后，变小写–>所以standard默认分词后的搜寻匹配是小写
- Stop Token Filter(默认禁用) –>停用词：分词后索引里会抛弃的

GET _analyze
{
  "analyzer": "standard",
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

2.1.2 standard后果可见

全小写
数字还在
没有stop word(默认敞开的)

{
  "tokens" : [
    {
      "token" : "for",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 44,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

2.2 其余analyzer

standard
stop stopword剔除
simple
whitespace 只用空白符宰割，不剔除
keyword 残缺文本，不分词

2.3 指定Tokenizer和Token Filter测试分词

2.3.1 应用standard雷同的Tokenizer和Filter

后面一节说：standard analyzer应用的Tokenizer是standard Tokenizer 应用的filter是lowercase, 咱们通过应用tokenizer和filter来替换analyzer试试：

GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"], 
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

后果和下面统一：

{
  "tokens" : [
    {
      "token" : "for",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 44,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

2.3.2 减少一个stop的filter再试

GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase","stop"], 
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

察看发现：in没了，所以stop里应该是有in这个过滤成分的呢~

filter里有两个(应用了两个TokenFilter–ES的字段都能够使多个多个值的就是数组式的)如果去掉filter里的lowercase, 就不会转大写为小写了，这里就不贴出后果了~

{
  "tokens" : [
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

三、Elasticsearch自带的Analyzer组件

3.1 ES自带的character filter

3.1.1 什么是character filter?

在tokenizer之前，对文本进行解决，例如减少删除或替换字符；能够设置多个character filter.

它会影响tokenizer的 position 和 offset.

3.1.2 一些自带的character filter

html strip: 剔除html标签
mapping: 字符串替换
pattern replace: 正则匹配替换

3.2 ES自带的tokenizer

3.2.1 什么是tokenizer?

将原始文本(character filter解决后的原始文本)依照肯定规定进行切分。(term or token)

3.2.2 自带的tokenizer

whitespace: 空格分词
standard
uax_url_email: url/email
pattern
keyword: 不分词
pattern hierarchy: 路径名拆分

3.2.3 能够用java插件，实现自定义的tokenizer

3.3 ES自带的token filter

3.3.1 什么是tokenizer?

将tokenizer输入的单词进行加工(加工term)

3.3.2 自带的token filter

lowercase: 小写化
stop: 去除停用词(in/the等)
synonym: 增加近义词

四、Demo案例

4.1 html_strip/maping+keyword

GET _analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "html_strip"
    },
    {
      "type": "mapping",
      "mappings": [
        "- => _", ":) => _happy_", ":( => _sad_"
      ]
    }
  ],
  "text": "<b>Hello :) this-is-my-book,that-is-not :( World</b>"
}

应用了 tokenizer：keyword，也就是切词时残缺保留，不切割；

应用了char_filter两个：html_strip(剔除掉html标签)，mapping(用指定内容替换原内容)

下面后果：html标签去掉了，减号符替换成了下划线

{
  "tokens" : [
    {
      "token" : "Hello _happy_ this_is_my_book,that_is_not _sad_ World",
      "start_offset" : 3,
      "end_offset" : 52,
      "type" : "word",
      "position" : 0
    }
  ]
}

4.2 char_filter应用正则替换

GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "http://(.*)",
      "replacement": "$1"
    }
  ],
  "text": "http://www.elastic.co"
}

正则替换：type/pattern/replacement

后果：

{
  "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

4.3 tokenizer应用目录切分

GET _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/user/niewj/a/b/c"
}

分词后果：

{
  "tokens" : [
    {
      "token" : "/user",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a/b",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a/b/c",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

4.4 tokenfilter之whitespace与stop

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"], // ["lowercase", "stop"]
  "text": "The girls in China are playing this game !"
}

后果：in、this都被剔除了(stopword), 然而term是大写的还保留，因为tokenizer用的是whitespace而非standard

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "girls",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "China",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "playing",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "game",
      "start_offset" : 36,
      "end_offset" : 40,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "!",
      "start_offset" : 41,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    }
  ]
}

4.5 自定义analyzer

4.5.1 settings自定义analyzer

PUT my_new_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{ // 1.自定义analyzer的名称
          "type": "custom",
          "char_filter": ["my_emoticons"], 
          "tokenizer": "my_punctuation", 
          "filter": ["lowercase", "my_english_stop"]
        }
      },
      "tokenizer": {
        "my_punctuation": { // 3.自定义tokenizer的名称
          "type": "pattern", "pattern":"[ .,!?]"
        }
      },
      "char_filter": {
        "my_emoticons": { // 2.自定义char_filter的名称
          "type": "mapping", "mappings":[":) => _hapy_", ":( => _sad_"]
        }
      },
      "filter": {
        "my_english_stop": { // 4.自定义token filter的名称
          "type": "stop", "stopwords": "_english_"
        }
      }
    }
  }
}

4.5.2 测试自定义的analyzer：

POST my_new_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I'm a :) person in the earth, :( And You? "
}

输入

{
  "tokens" : [
    {
      "token" : "i'm",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "_hapy_",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "person",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "earth",
      "start_offset" : 23,
      "end_offset" : 28,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "_sad_",
      "start_offset" : 30,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "you",
      "start_offset" : 37,
      "end_offset" : 40,
      "type" : "word",
      "position" : 9
    }
  ]
}

关于elasticsearch:实践003elasticsearch之analyzer

一、Elasticsearch analizer组成

1. 组成三大件

1.1 Character Filter(字符过滤器)

1.2 Tokenizer(分词器)

1.3 Token Filter(分词过滤器)

二、Analyzer测试分词

2.1 指定analyzer测试分词

2.1.1 standard analyzer

2.1.2 standard后果可见

2.2 其余analyzer

2.3 指定Tokenizer和Token Filter测试分词

2.3.1 应用standard雷同的Tokenizer和Filter

2.3.2 减少一个stop的filter再试

三、Elasticsearch自带的Analyzer组件

3.1 ES自带的character filter

3.1.1 什么是character filter?

3.1.2 一些自带的character filter

3.2 ES自带的tokenizer

3.2.1 什么是tokenizer?

3.2.2 自带的tokenizer

3.2.3 能够用java插件，实现自定义的tokenizer

3.3 ES自带的token filter

3.3.1 什么是tokenizer?

3.3.2 自带的token filter

四、Demo案例

4.1 html_strip/maping+keyword

4.2 char_filter应用正则替换

4.3 tokenizer应用目录切分

4.4 tokenfilter之whitespace与stop

4.5 自定义analyzer

4.5.1 settings自定义analyzer

4.5.2 测试自定义的analyzer：

评论

发表回复取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

关于elasticsearch:实践003elasticsearch之analyzer

一、Elasticsearch analizer组成

1. 组成三大件

1.1 Character Filter(字符过滤器)

1.2 Tokenizer(分词器)

1.3 Token Filter(分词过滤器)

二、Analyzer测试分词

2.1 指定analyzer测试分词

2.1.1 standard analyzer

2.1.2 standard后果可见

2.2 其余analyzer

2.3 指定Tokenizer和Token Filter测试分词

2.3.1 应用standard雷同的Tokenizer和Filter

2.3.2 减少一个stop的filter再试

三、Elasticsearch自带的Analyzer组件

3.1 ES自带的character filter

3.1.1 什么是character filter?

3.1.2 一些自带的character filter

3.2 ES自带的tokenizer

3.2.1 什么是tokenizer?

3.2.2 自带的tokenizer

3.2.3 能够用java插件，实现自定义的tokenizer

3.3 ES自带的token filter

3.3.1 什么是tokenizer?

3.3.2 自带的token filter

四、Demo案例

4.1 html_strip/maping+keyword

4.2 char_filter应用正则替换

4.3 tokenizer应用目录切分

4.4 tokenfilter之whitespace与stop

4.5 自定义analyzer

4.5.1 settings自定义analyzer

4.5.2 测试自定义的analyzer：

评论

发表回复 取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

发表回复取消回复