Elasticsearch实践（四）：IK分词

环境：Elasticsearch 6.2.4 + Kibana 6.2.4 + ik 6.2.4Elasticsearch默认也能对中文进行分词。我们先来看看自带的中文分词效果：curl -XGET “http://localhost:9200/_analyze” -H ‘Content-Type: application/json;’ -d ‘{“analyzer”: “default”,“text”: “今天天气真好”}‘GET /_analyze{ “analyzer”: “default”, “text”: “今天天气真好”}结果：{ “tokens”: [ { “token”: “今”, “start_offset”: 0, “end_offset”: 1, “type”: “<IDEOGRAPHIC>”, “position”: 0 }, { “token”: “天”, “start_offset”: 1, “end_offset”: 2, “type”: “<IDEOGRAPHIC>”, “position”: 1 }, { “token”: “天”, “start_offset”: 2, “end_offset”: 3, “type”: “<IDEOGRAPHIC>”, “position”: 2 }, { “token”: “气”, “start_offset”: 3, “end_offset”: 4, “type”: “<IDEOGRAPHIC>”, “position”: 3 }, { “token”: “真”, “start_offset”: 4, “end_offset”: 5, “type”: “<IDEOGRAPHIC>”, “position”: 4 }, { “token”: “好”, “start_offset”: 5, “end_offset”: 6, “type”: “<IDEOGRAPHIC>”, “position”: 5 } ]}我们发现，是按照每个字进行分词的。这种在实际应用里肯定达不到想要的效果。当然，如果是日志搜索，使用自带的就足够了。analyzer=default其实调用的是standard分词器。接下来，我们安装IK分词插件进行分词。安装IKIK项目地址：https://github.com/medcl/elas…首先需要说明的是，IK插件必须和 ElasticSearch 的版本一致，否则不兼容。安装方法1：从 https://github.com/medcl/elas… 下载压缩包，然后在ES的plugins目录创建analysis-ik子目录，把压缩包的内容复制到这个目录里面即可。最终plugins/analysis-ik/目录里面的内容：plugins/analysis-ik/ commons-codec-1.9.jar commons-logging-1.2.jar elasticsearch-analysis-ik-6.2.4.jar httpclient-4.5.2.jar httpcore-4.4.4.jar plugin-descriptor.properties然后重启 ElasticSearch。安装方法2：./usr/local/elk/elasticsearch-6.2.4/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4/elasticsearch-analysis-ik-6.2.4.zip如果已下载压缩包，直接使用：./usr/local/elk/elasticsearch-6.2.4/bin/elasticsearch-plugin install file:///tmp/elasticsearch-analysis-ik-6.2.4.zip然后重启 ElasticSearch。IK分词IK支持两种分词模式：ik_max_word: 会将文本做最细粒度的拆分，会穷尽各种可能的组合ik_smart: 会做最粗粒度的拆分接下来，我们测算IK分词效果和自带的有什么不同：curl -XGET “http://localhost:9200/_analyze” -H ‘Content-Type: application/json’ -d’{“analyzer”: “ik_smart”,“text”: “今天天气真好”}‘结果：{ “tokens”: [ { “token”: “今天天气”, “start_offset”: 0, “end_offset”: 4, “type”: “CN_WORD”, “position”: 0 }, { “token”: “真好”, “start_offset”: 4, “end_offset”: 6, “type”: “CN_WORD”, “position”: 1 } ]}再试一下ik_max_word的效果：{ “tokens”: [ { “token”: “今天天气”, “start_offset”: 0, “end_offset”: 4, “type”: “CN_WORD”, “position”: 0 }, { “token”: “今天”, “start_offset”: 0, “end_offset”: 2, “type”: “CN_WORD”, “position”: 1 }, { “token”: “天天”, “start_offset”: 1, “end_offset”: 3, “type”: “CN_WORD”, “position”: 2 }, { “token”: “天气”, “start_offset”: 2, “end_offset”: 4, “type”: “CN_WORD”, “position”: 3 }, { “token”: “真好”, “start_offset”: 4, “end_offset”: 6, “type”: “CN_WORD”, “position”: 4 } ]}设置mapping默认分词器示例：{ “properties”: { “content”: { “type”: “text”, “analyzer”: “ik_max_word”, “search_analyzer”: “ik_max_word” } }}注：这里设置 search_analyzer 与 analyzer 相同是为了确保搜索时和索引时使用相同的分词器，以确保查询中的术语与反向索引中的术语具有相同的格式。如果不设置 search_analyzer，则 search_analyzer 与 analyzer 相同。详细请查阅：https://www.elastic.co/guide/…防盗版声明：本文系原创文章，发布于公众号飞鸿影的博客(fhyblog)及博客园，转载需作者同意。自定义分词词典我们也可以定义自己的词典供IK使用。比如：curl -XGET “http://localhost:9200/_analyze” -H ‘Content-Type: application/json’ -d’{“analyzer”: “ik_smart”,“text”: “去朝阳公园”}‘结果：{ “tokens”: [ { “token”: “去”, “start_offset”: 0, “end_offset”: 1, “type”: “CN_CHAR”, “position”: 0 }, { “token”: “朝阳”, “start_offset”: 1, “end_offset”: 3, “type”: “CN_WORD”, “position”: 1 }, { “token”: “公园”, “start_offset”: 3, “end_offset”: 5, “type”: “CN_WORD”, “position”: 2 } ]}我们希望朝阳公园作为一个整体，这时候可以把该词加入到自己的词典里。新建自己的词典只需要简单几步就可以完成：1、在elasticsearch-6.2.4/config/analysis-ik/目录增加一个my.dic:$ touch my.dic$ echo 朝阳公园 > my.dic$ cat my.dic朝阳公园.dic为词典文件，其实就是简单的文本文件，词语与词语直接需要换行。注意是UTF8编码。我们看一下自带的分词文件：$ head -n 5 main.dic一一列举一一对应一一道来一丁一丁不识2、然后修改elasticsearch-6.2.4/config/analysis-ik/IKAnalyzer.cfg.xml文件：<?xml version=“1.0” encoding=“UTF-8”?><!DOCTYPE properties SYSTEM “http://java.sun.com/dtd/properties.dtd"><properties> <comment>IK Analyzer 扩展配置</comment> <!–用户可以在这里配置自己的扩展字典 –> <entry key=“ext_dict”>my.dic</entry> <!–用户可以在这里配置自己的扩展停止词字典–> <entry key=“ext_stopwords”></entry> <!–用户可以在这里配置远程扩展字典 –> <!– <entry key=“remote_ext_dict”>words_location</entry> –> <!–用户可以在这里配置远程扩展停止词字典–> <!– <entry key=“remote_ext_stopwords”>words_location</entry> –></properties>增加了my.dic，然后重启ES。我们再看一下效果：GET /_analyze{ “analyzer”: “ik_smart”, “text”: “去朝阳公园”}结果：{ “tokens”: [ { “token”: “去”, “start_offset”: 0, “end_offset”: 1, “type”: “CN_CHAR”, “position”: 0 }, { “token”: “朝阳公园”, “start_offset”: 1, “end_offset”: 5, “type”: “CN_WORD”, “position”: 1 } ]}说明自定义词典生效了。如果有多个词典，使用英文分号隔开：<entry key=“ext_dict”>my.dic;custom/single_word_low_freq.dic</entry>另外，我们看到配置里还有个扩展停止词字典，这个是用来辅助断句的。我们可以看一下自带的一个扩展停止词字典：$ head -n 5 extra_stopword.dic也了仍从以也就是IK分词器遇到这些词就认为前面的词语不会与这些词构成词语。IK分词也支持远程词典，远程词典的好处是支持热更新。词典格式和本地的一致，都是一行一个分词（换行符用 \n），还要求填写的URL满足：该 http 请求需要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，这两者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。详见：https://github.com/medcl/elas… 热更新 IK 分词使用方法部分。注意：上面的示例里我们改的是`elasticsearch-6.2.4/config/analysis-ik/目录下内容，是因为IK是通过方法2里elasticsearch-plugin安装的。如果你是通过解压方式安装的，那么IK配置会在plugins目录，即：elasticsearch-6.2.4/plugins/analysis-ik/config。也就是说插件的配置既可以放在插件所在目录，也可以放在Elasticsearch的config目录里面。ES内置的Analyzer分析器es自带了许多内置的Analyzer分析器，无需配置就可以直接在index中使用：标准分词器（standard）：以单词边界切分字符串为terms，根据Unicode文本分割算法。它会移除大部分的标点符号，小写分词后的term，支持停用词。简单分词器（simple）：该分词器会在遇到非字母时切分字符串，小写所有的term。空格分词器（whitespace）：遇到空格字符时切分字符串，停用词分词器（stop）：类似简单分词器，同时支持移除停用词。关键词分词器（keyword）：无操作分词器，会输出与输入相同的内容作为一个single term。模式分词器（pattern）：使用正则表达式讲字符串且分为terms。支持小写字母和停用词。语言分词器（language）：支持许多基于特定语言的分词器，比如english或french。签名分词器（fingerprint）：是一个专家分词器，会产生一个签名，可以用于去重检测。自定义分词器：如果内置分词器无法满足你的需求，可以自定义custom分词器，根据不同的character filters，tokenizer，token filters的组合。例如IK就是自定义分词器。详见文档：https://www.elastic.co/guide/…参考1、medcl/elasticsearch-analysis-ik: The IK Analysis plugin integrates Lucene IK analyzer into elasticsearch, support customized dictionary.https://github.com/medcl/elas… 2、ElesticSearch IK中文分词使用详解 - xsdxs的博客 - CSDN博客 https://blog.csdn.net/xsdxs/a…