关于elasticsearch:elasticsearch的开发应用

平时我的项目开发中，常常会遇到含糊搜寻的需要。通常当须要含糊搜寻的数据库字段不大，咱们能够简略通过 字段名 like '%搜寻值%'实现，搜寻效率不高，而且就算加索引也无奈失效。对于数据库字段很大的，mysql还提供全文索引，开销也很大。

有没有一种专门做搜寻的“数据库”呢？不仅能够实现高效的含糊搜寻，而且还能像百度、谷歌这类搜索引擎一样，从我输出的一段文字中，自动识别关键词进行搜寻。上面介绍的elasticsearch就是这方面的里手。

1. 简介

全文搜寻属于最常见的需要，开源的 Elasticsearch是目前全文搜索引擎的首选。它能够疾速地贮存、搜寻和剖析海量数据。维基百科、Stack Overflow、Github 都采纳它。Elasticsearch 的底层是开源库 Lucene。然而，你没法间接用 Lucene，必须本人写代码去调用它的接口。Elasticsearch 是 Lucene 的封装，提供了 REST API 的操作接口，开箱即用。

index 索引

elasticsearch 数据管理的顶层单位叫做 Index（索引）。它是单个数据库的同义词。每个 Index （即数据库）的名字必须是小写。上面的命令能够查看以后节点的所有 Index。

 curl -X GET 'http://localhost:9200/_cat/indices?v'

document 文档

Index 外面单条的记录称为 Document（文档）。许多条 Document 形成了一个 Index，Document 应用 JSON 格局示意。

同一个 Index 外面的 Document，不要求有雷同的构造（scheme），然而最好放弃雷同，这样有利于进步搜寻效率。

type (已移除)

type 是之前存在的概念，elasticsearch 7.x 就不在应用。Document 能够分组，比方weather这个 Index 外面，能够按城市分组（北京和上海），这种分组就叫做 Type，它是虚构的逻辑分组，用来过滤 Document。

然而不同的分组中，Document 的数据结构该当尽量保持一致，否则会影响搜寻效率，type的定位就很鸡肋。因而elasticsearch就间接做强制限度，在6.X 版本中，一个index下只能存在一个type；在 7.X 版本中，间接去除了 type 的概念，就是说 index 不再会有 type。从前的那些写法，能够间接用index下的_doc代替。

倒排索引

什么是倒排索引: 倒排索引也叫反向索引，艰深来讲正向索引是通过key找value，反向索引则是通过value找key。Elasticsearch 应用一种称为倒排索引的构造，它实用于疾速的全文搜寻。一个倒排索引由文档中所有不反复词的列表形成，对于其中每个词，有一个蕴含它的文档列表。倒排索引建设的是分词（Term）和文档（Document）之间的映射关系，在倒排索引中，数据是面向词（Term）而不是面向文档的。

假如咱们有两个文档，每个文档的content域蕴含如下内容：
The quick brown fox jumped over the lazy dog
Quick brown foxes leap over lazy dogs in summer

为了创立倒排索引，咱们首先将每个文档的content域拆分成独自的词（咱们称它为词条或tokens），创立一个蕴含所有不反复词条的排序列表，而后列出每个词条呈现在哪个文档。

当初，如果咱们想搜寻quick brown，咱们只须要查找蕴含每个词条的文档。两个文档都匹配，然而第一个文档比第二个匹配度更高。如果咱们应用仅计算匹配词条数量的简略相似性算法，那么，咱们能够说，对于咱们查问的相关性来讲，第一个文档比第二个文档更佳。

分词

Elasticsearch在倒排索引时会文本会应用设置的分析器，而输出的检索语句也会首先通过设置的雷同分析器，而后在进行查问。

es内置很多分词器，然而对中文分词并不敌对，例如应用standard分词器对一句中文话进行分词，会分成一个字一个字的。这时能够应用第三方的Analyzer插件，比方 ik、pinyin等，本文用的是ik。

2. 装置

本次除了装置elasticsearch以外，咱们还会装置kibana（可视化展现elasticsearch数据的产品），以及给elasticsearch预装一个ik中文分词器的插件。因为elasticsearch选用的版本是 7.9.3 ，为了保障兼容，因而kibana和ik插件都应用该对应版本。

先查看上面容器化装置脚本：

# 启动 Elasticsearchdocker run -d \ --name elasticsearch \ --restart=on-failure:3 \ -p 9200:9200 \ -p 9300:9300 \ -e "discovery.type=single-node" \ -v /Volumes/elasticsearch/data/:/usr/share/elasticsearch/data/ \ -v /Volumes/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml \ -v /Volumes/elasticsearch/plugins/:/usr/share/elasticsearch/plugins/ \ elasticsearch:7.9.3#启动 Kibanadocker run -d \ --name kibana \ --link elasticsearch:es \ -p 5601:5601 \ -e ELASTICSEARCH_URL=es:9200 \ kibana:7.9.3

elasticsearch局部

单节点执行，裸露9200和9300端口，值得关注的是有三个挂载卷：

/data: elasticsearch的数据库都在该目录，能够间接挂载该目录。
/config/elasticsearch.yml: 次要是装置后要批改elasticsearch.yml文件，在文件开端加上下列两行，以解决跨域问题。为了不便，就间接挂载该文件了。

http.cors.enabled: truehttp.cors.allow-origin: "*"

/plugins: 这是elasticsearch装置插件的目录，默认是空的，通常将插件解压后放入该目录，重启后即可应用。对于分词器的插件，下文再说。

kibana局部

因为kibana的环境变量中是要配置elasticsearch地址的，然而docker不同容器之间默认网络隔离。能够通过配置同一网络环境的形式解决，这里应用的形式，是通过--link 给elasticsearch设置了一个别名，相当于容器外部保护了一个DNS域名映射。

ik分词器插件

ik中文分词器插件的下载地址，最好找elasticsearch对应版本的插件。

通常来说是启动elasticsearch服务后再装置插件的，重启之后失效。我为了简略，先将插件zip下载到本地，elasticsearch-analysis-ik-7.9.3.zip 解压到 /Volumes/elasticsearch/plugins/analysis-ik/ 目录，再对 /Volumes/elasticsearch/plugins/ 目录做挂载，当elasticsearch 容器启动后，插件就曾经失效了。

3. API示例

这里咱们通过REST API，从零开始做一个残缺的示例。咱们通过创立一个articles的索引，应用ik中文分词器，创立完数据后，再做查问。本文不会介绍所有的API，但好在是 RESTFul 格调，很多用法都能本人斟酌进去。

创立索引

(PUT) http://localhost:9200/articles

更新索引mapping

(POST) http://localhost:9200/articles/_mapping

{    "properties": {        "title": {            "type": "text",            "analyzer": "ik_max_word",            "search_analyzer": "ik_max_word"        },        "content": {            "type": "text",            "analyzer": "ik_max_word",            "search_analyzer": "ik_max_word"        }    }}

查看索引mapping

(GET) http://localhost:9200/articles/_mapping

新建文档

(POST) http://localhost:9200/articles/_doc

{    "title": "都江堰",    "content": "一位年迈的老祖宗，没有成为挂在墙上的画像，没有成为写在书里的回顾..."}

搜寻文档

(GET) http://localhost:9200/articles/_search

{    "query": {        "match": {            "content": "画像回顾"        }    }}

4. spring开发

理论spring联合elasticsearch的开发还是比较复杂的，如果具体的讲这块内容，得独自讲几篇文章。本文的目标次要是简要介绍elasticsearch的利用，因而本章只是做一个很简略的demo，理论开发还得查阅相干文档。

下文应用的是spring-boot-starter-data-elasticsearch，还好目前4.2版本是兼容7.9.x版本elasticsearch的。因为都是属于 spring data jpa 体系的，所以dao层的语法还是容易了解的。另外一些简单的操作，能够通过注入 ElasticsearchRestTemplate 来实现。

下列代码中的 /articles 接口，和上文中的 http://localhost:9200/articles/_search 一样，都是对文章正文的含糊搜寻。

pom.xml

        <dependency>            <groupId>org.springframework.boot</groupId>            <artifactId>spring-boot-starter-data-elasticsearch</artifactId>        </dependency>

application.yml

spring:  data:    elasticsearch:      client:        reactive:          endpoints: localhost:9200

ArticlesEO.java

@Data@Document(indexName = "articles")public class ArticlesEO {    @Id    private String id;    @Field(type = FieldType.Text,analyzer = "ik_max_word")    private String title;    @Field(type = FieldType.Text,analyzer = "ik_max_word")    private String content;}

ArticlesRepository.java

@Repositorypublic interface ArticlesRepository extends ElasticsearchRepository<ArticlesEO,String> {    Page<ArticlesEO> findByContent(String content, Pageable pageable);}

ArticleController.java

@RestController@RequestMappingpublic class ArticleController {    private Pageable pageable = PageRequest.of(0,10);    private final ArticlesRepository articlesRepository;    private final ElasticsearchRestTemplate elasticsearchRestTemplate;    public ArticleController(ArticlesRepository articlesRepository,ElasticsearchRestTemplate elasticsearchRestTemplate) {        this.articlesRepository = articlesRepository;        this.elasticsearchRestTemplate=elasticsearchRestTemplate;    }    /**     * 查问 document     * @param content     * @return     */    @GetMapping("/articles")    public Page<ArticlesEO> searchArticle(@RequestParam("content")String content){        return articlesRepository.findByContent(content,pageable);    }}