第18篇用ElasticSearch索引MongoDB一个简单的自动完成索引项目

我的 Elasticsearch 系列文章，逐渐更新中，欢迎关注
0A. 关于 Elasticsearch 及实例应用
00.Solr 与 ElasticSearch 对比
01.ElasticSearch 能做什么？
02.Elastic Stack 功能介绍
03. 如何安装与设置 Elasticsearch API
04. 如果通过 elasticsearch 的 head 插件建立索引_CRUD 操作
05.Elasticsearch 多个实例和 head plugin 使用介绍
06. 当 Elasticsearch 进行文档索引时，它是如何工作的?
07.Elasticsearch 中的映射方式—简洁版教程
08.Elasticsearch 中的分析和分析器应用方式
09.Elasticsearch 中构建自定义分析器
10.Kibana 科普 - 作为 Elasticsearhc 开发工具
11.Elasticsearch 查询方法
12.Elasticsearch 全文查询
13.Elasticsearch 查询 - 术语级查询
14.Python 中的 Elasticsearch 入门
15. 使用 Django 进行 ElasticSearch 的简单方法
16. 关于 Elasticsearch 的 6 件不太明显的事情
17. 使用 Python 的初学者 Elasticsearch 教程
18. 用 ElasticSearch 索引 MongoDB, 一个简单的自动完成索引项目

另外 Elasticsearch 入门，我强烈推荐 ElasticSearch 入门指南给你，非常想尽的入门指南手册。

关于全文搜索
如今，在任何网站或应用程序中都具有搜索功能已经很普遍。这通常发生在具有大量信息要提供给用户的平台上。从拥有数千种不同类别产品的电子商务网站，到拥有数千篇文章的博客或新闻网站。每当客户 / 用户 / 阅读者访问此类网站时，他们都会自动趋向于找到一个搜索框，在其中可以键入查询以找到所需的特定文章 / 产品 / 内容。糟糕的搜索引擎会导致用户沮丧，他们很可能永远不会再回到我们的网站。
全文搜索为您每天在网站上使用的所有搜索框提供强大的功能，以查找所需的内容。每当您想在 Amazon 产品数据库中找到该蝙蝠侠手机壳时，或者当您在 Youtube 上搜索玩有激光灯视频的猫时。当然，这个庞大的网站还依赖其他许多功能来增强搜索引擎的功能，但是所有搜索的基础都是全文索引。也就是说，让我们看看这篇文章是关于什么的。
MongoDB 的局限性
如果您快速进行 Google 搜索，
MongoDB full text

则会在 MongoDB 文档中发现支持全文搜索。那么，为什么我们要麻烦学习像 Elastic Search 这样的新复杂技术，又为什么要在我们的系统架构中引入新的复杂性呢？让我们看一下 MongoDB 文本搜索支持以找出原因。
我将假定您已经安装了 MongoDB，并且您已经了解它的基础知识。如果是这种情况，请继续打开控制台，然后运行
mongo

命令以访问 MongoDB 控制台并创建一个名为的数据库
fulltext

$ mongo
$ use fulltext
  switched to db fulltext

我们的测试数据库将存储文章，因此让我们添加一个称为的集合
articles


$ db.createCollection('articles')
  '{"ok": 1}'

现在，让我们添加一些对测试有用的文档。我们将插入标题和段落作为内容的文章。我从《纽约时报》的《交易手册》的两篇文章中摘录了几段。$ db.articles.insert({
  ... title: 'Yahoo sale to Verizon',
  ... content: 'The sale is being done in two steps. The first step will be the transfer of any assets related to Yahoo business to a singular subsidiary. This includes the stock in the business subsidiaries that make up Yahoo that are not already in the single subsidiary, as well as the odd assets like benefit plan rights. This is what is being sold to Verizon. A license of Yahoo’s oldest patents is being held back in the so-called Excalibur portfolio. This will stay with Yahoo, as will Yahoo’s stakes in Alibaba Group and Yahoo Japan.'
  ... })
  WriteResult({"nInserted" : 1})


$ db.articles.insert({
  ... title: 'Chinese Group to Pay $4.4 Billion for Caesars Mobile Games',
  ... content: 'In the most recent example in a growing trend of big deals for smartphone-based games, a consortium of Chinese investors led by the game company Shanghai Giant Network Technology said in a statement on Saturday that it would pay $4.4 billion to Caesars Interactive Entertainment for Playtika, its social and mobile games unit. Caesars Interactive is controlled by the owners of Caesars Palace and other casinos in Las Vegas and elsewhere.'
  ... })
  WriteResult({"nInserted" : 1})

现在我们有了文档，我们需要使用 MongoDB 文本索引对它们进行索引。因此，让我们在集合的

title

和
content

字段中创建一个文本索引
articles

$ db.articles.createIndex（{
 ... title：'text'，... content：'text' 
... }）{“createdCollectionAutomatically”：false，“numIndexesBefore”：1，“numIndexesAfter”：2，“ok”：1 
}

索引已创建，现在是时候进行一些搜索以查看其进展了，让我们来看一下！

$ db.articles.find({ $text: { $search: "chinese"} } )
{"_id" : ObjectId("579e0a35c6d02e54ad6fe556"), "title" : "Chinese Group to Pay $4.4 Billion for Caesars Mobile Games", "content" : "In the most recent example in a growing trend of big deals for smartphone-based games, a consortium of Chinese investors led by the game company Shanghai Giant Network Technology said in a statement on Saturday that it would pay $4.4 billion to Caesars Interactive Entertainment for Playtika, its social and mobile games unit. Caesars Interactive is controlled by the owners of Caesars Palace and other casinos in Las Vegas and elsewhere." }

很好，看来一切正常，我们搜寻了这个词，
chinese

并将其与有关中文小组的文章相匹配。现在让我们为 MongoDB 设置难度。假设我们要构建一个自动完成输入（在用户输入内容时会推荐用户的输入之一）。为此，如果我搜索单词，我将假设 MongoDB 将返回同一文章

$ db.articles.find({ $text: { $search: "chi"} } )

这是 MongoDB 对全文搜索功能的最大限制之一。问题在于它在单词级别上为文档建立索引，因此无法通过使用文本索引来完成所谓的
partial matching

这就是匹配单词的部分部分。
在这一点上，一个功能更强大的文本索引平台很有用。在我们的案例中，我选择了 Elastic Search，主要是因为文档非常有用，并且它提供了开箱即用的 RESTful API 端点的完整集合，使测试变得非常容易。
弹性搜索
我们正在尝试做的
我只想指出，这篇文章只是您可以通过 Elastic Search 实现的一个小巧的简单示例。有关于它的书，所以我不希望您认为 Elastic Search 仅对实现自动完成输入有用。我只是发现它是一个易于理解的示例，它展示了 Elastic 如何帮助进行 MongoDB 无法提供给我们的复杂搜索。
这篇文章的第二个目的是展示如何将现有的 MongoDB 文档导入到 ElasticSearch 中的全文索引文档中。同样，自动完成示例很小，因此也可以在一篇文章中进行解释。如果您发现文本索引世界很有趣，请继续阅读有关 ElasticSearch 的更多信息（
ES

从现在开始）及其丰富的功能。
我不会在这里解释如何安装 ES，因为该过程非常简单。由于 ES 是基于 Java 构建的，因此只需确保已安装 Java 和
JAVA_HOME

变量集即可。

一旦安装了 ES，我们将遵循以下总体过程：
为我们的文档创建索引。

使用名为的工具将我们的 MongoDB 集合导入 ES

mongo-connector

。
将
mongo-connector

ES 中创建的索引迁移到我们在步骤 1 中创建的索引。

试用我们的新索引，看看在保持
mongo-connector

运行的同时，如何始终为文档建立索引。
创建 ES 索引
那么……我们如何创建一个性能比内置 MongoDB 文本索引更好的索引？我们需要在 ES 中配置什么？我们必须定义 ES 所说的
Analysis Chain

。简而言之，我们插入索引中的每个文档所经过的管道将被索引。
分析器由分析器组成。分析器是用于获取文档，对其进行分析和修改并将其传递给下一个文档的过滤器。例如，可能有一个分析器删除了所谓的停用词，这是非常常见的词，它们不提供任何有用的索引信息，例如

第一个负责在标记字符串之前清理字符串，例如通过剥离 HTML 标签。第二个负责将其拆分为术语，例如通过将字符串拆分为空格。最后一个工作是修改术语以优化索引目的，例如，通过删除停用词或将所有术语改为小写。
ES 提供了不同的分析器，这些分析器可作为创建自定义分析器的起点，以更好地满足任何索引需求。ES 提供的替代方法之一称为
edge_ngrams

分析器。要了解什么是边缘 n -gram，我们首先需要了解什么是 n -gram。正如 n -gram 维基百科页面所指出的：
n-gram 是来自给定文本或语音序列的 n 个项目的连续序列
因此，假设您有单词
blueberry
根据 ES 文档，我们可以查看内容：
边缘 n -gram 固定在单词的开头
这意味着对于
blueberry

边缘 n 元语法为：
[b, bl, blu, blue, blueb, bluebe, blueber, blueberr, blueberry]

看看我们要去哪里？如果您的单词
blueberry

的边缘 n -gram 被索引，则可以轻松创建自动完成搜索模块。因为如果用户

将不再匹配，则自动完成选项将消失。
因此，这条边缘 n -gram 绝对应该成为索引的一部分，这就是我们如何定义它

{“filter”：{“autocomplete_filter”：{“type”：“edge_ngram”，“min_gram”：3，“max_gram”：20 
        } 
    } 
}

因此，通过这个 json 对象，我们定义了一个
filter

称为“autocomplete_filter”的令牌过滤器（）。而且，我们说这将是一个
edge_ngram

过滤器，过滤器的大小从 3 克到 20 克不等。我之所以使用 3 作为最小值的原因是，对于非常大的数据库，使用 unigram 会大大降低性能，因为很多文档都会与搜索匹配。这就是为什么许多具有自动完成功能的网站要求用户键入至少三个字符，直到他们可以提出替代方案为止。
现在我们已经定义了令牌过滤器，我们需要定义我们的自定义分析器：

{“analyzer”：{“autocomplete”：{“type”：“custom”，“tokenizer”：“standard”，“filter”：[“lowercase”，“autocomplete_filter”] 
        } 
    } 
}

在这里，我们定义了一个
analyzer

称为“自动完成”的自定义，我们告诉 ES 这将是一个自定义分析器，它将使用
standard

令牌生成器，并设置两个过滤步骤：（
lowercase

这是不言自明的），然后设置我们的 custom
autocomplete_filter

。
现在我们定义了过滤器和分析器，让我们创建索引。抓住控制台并执行以下
curl

命令：

$ curl -H 'Content-Type: application/json' \
       -X PUT http://localhost:9200/fulltext_opt \
       -d \
      "{ \
          \"settings\": { \
              \"number_of_shards\": 1, \
              \"analysis\": { \
                  \"filter\": { \
                      \"autocomplete_filter\": { \
                          \"type\":     \"edge_ngram\", \
                          \"min_gram\": 3, \
                          \"max_gram\": 20 \
                      } \
                  }, \
                  \"analyzer\": { \
                      \"autocomplete\": { \
                          \"type\":      \"custom\", \
                          \"tokenizer\": \"standard\", \
                          \"filter\": [ \
                              \"lowercase\", \
                              \"autocomplete_filter\" \
                          ] \
                      } \
                  } \
              } \
          } \
      }"{"acknowledged":true}

将
fulltext_opt

在端点 URL 告诉 ES 创建一个名为像新指标。之所以选择该名称，是因为我们的 MongoDB 集合名为
fulltext

，并且当我们第一次将其导入 ES 时，
fulltext

将自动创建一个索引。稍后，我们将所有文档从
fulltext

移至优化
fulltext_opt

索引。
我们在
fulltext_opt

索引中要做的最后一件事是创建映射。映射只是文档组。我们将创建一个名为的映射
articles

，
title

并
content

在其上定义属性：

$ curl -H 'Content-Type: application/json' \
        -X PUT http://localhost:9200/fulltext_opt/_mapping/articles \
        -d \
       "{ \
           \"articles\": { \
               \"properties\": { \
                   \"title\": { \
                       \"type\":     \"string\", \
                       \"analyzer\": \"autocomplete\" \
                   }, \
                   \"content\": { \
                       \"type\":    \"string\" \
                   } \
               } \
           } \
       }"{"acknowledged":true}

您可以看到我们仅将
autocomplete

分析器用于该
title

属性。由于我们应该将其用于自动完成功能，因此对文章内容建立索引是没有意义的（除非您希望向用户建议文章内容……这很奇怪）。
该
acknowledged: true

响应意味着我们的指数已成功创建和映射添加。现在是时候将文档从我们的 MongoDB 导入到其中了。
从 MongoDB 导入 ES
要导入我们的文档，我可以简单地将它们手动插入到我们的 ES 索引中（我的文章集中只有两个文档。问题是，在现实生活中，我们希望 MongoDB 和我们的索引保持同步，以便在任何时候创建一个新文档插入后，同一文档将在 ES 中建立索引。
对我们来说幸运的是，有一个工具可以满足
mongo-connector

我们的需求。甚至更好的是，它支持弹性搜索。我不会在 mongo-connector 中跳得太深。您可以在上一个链接中找到很多有关其工作原理的详细信息。让我们坚持这样的想法，它将使用 MongoDB 中的文档并将它们放入我们的 ES 索引中。
您可以使用 Python 软件包管理器安装 mongo-connector
pip

。您需要安装
elastic2-doc-manager

，它将提供将 MongoDB 中的内容复制到 ElasticSearch 2.X 中的支持。
$ pip install mongo-connector
$ pip install elastic2-doc-manager 下一步是将 MongoDB 服务器作为副本集启动。如果您不知道 MongoDB 中的副本集是什么，我也不会对此深究：)。要将 MongoDB 作为副本集运行，只需
–replSet

在启动它时传递选项并给副本集命名（
rs0

在这种情况下）：

$ rs.initiate()
 {
     "info2" : "no configuration explicitly specified -- making one",
     "me" : "mbp-mauricio:27017",
     "ok" : 0,
     "errmsg" : "couldn't initiate : can't find self in the replset config"
 }

您所要做的就是服从并打开 mongo shell，然后运行
rs.initiate()

。尝试初始化副本集时，可能会看到此错误消息：

$ rs.initiate()
 {
     "info2" : "no configuration explicitly specified -- making one",
     "me" : "mbp-mauricio:27017",
     "ok" : 0,
     "errmsg" : "couldn't initiate : can't find self in the replset config"
 }

问题在于，
mbp-mauricio

在这种情况下，副本无法找到具有名称的计算机。您所要做的就是转到
/etc/hosts

文件并添加一个条目：
127.0.0.1 [您的计算机名称]MongoDB 已启动并正在运行，现在让我们启动 ES。进入您的 ES 安装目录并运行：

$ ./bin/elastic

一切就绪，该运行 mongo-connector 了。

$ mongo-connector -m 127.0.0.1:27017 -t 127.0.0.1:9200 -d elastic2_doc_manager

您可以使用自定义数据替换参数，这只是默认的 localhost 实现。因此，在这里，我们基本上告诉 mongo-connector 使用 MongoDB 数据
localhost:27017

并将其发送到在上运行的 ES 实例
localhost:9200

。所有这些都将通过使用来完成
elastic2_doc_manager

。一段时间后（取决于您拥有的 MongoDB 数据库的数量和大小），您应该能够在 ES 实例中看到新索引。就我而言，这几乎是即时的，因为我的
fulltext

数据库中只有两个文档。因此，如果调用相应的 ES 端点以列出索引，则应看到以下内容：

$ curl localhost:9200/_cat/indices?v
health status index                pri rep docs.count docs.deleted store.size pri.store.size
yellow open   fulltext               5   1          2            0     10.9kb         10.9kb
yellow open   fulltext_opt           1   1          0            0       159b           159b

如果 MongoDB 实例中有其他数据库，则可能会有更多条目。mongo-connector 的优点在于它是超级可配置的，因此您可以告诉它要从哪个数据库导入哪些集合。
在索引之间移动文档
因此，我们现在有了两个索引，一个索引是由 mongo-connector 创建的，该索引尚未优化，但有两个文档，另一个索引是优化的，但为空。我们现在要做的就是在索引之间复制文档。再说一次，对我来说，手动插入它们会更简单，因为我只有两个文档，但是实际应用程序中有成千上万个文档。
为此目的有一个很棒的工具，
elasticdump

它使这项任务非常容易。您可以通过 NPM 安装它：

$ npm install -g elasticdump

使用 elasticdump，您可以将分析器，映射和数据从一个 ES 索引导入另一个索引（甚至导入 json 文件）。在我们的案例中，我们不在乎分析器和映射，我们将仅导入数据，因为分析器和映射已在
fulltext_opt

索引中定义。

$ elasticdump \
   --input=http://localhost:9200/fulltext \
   --output=http://localhost:9200/fulltext_opt
Mon, 01 Aug 2016 01:21:10 GMT | starting dump
Mon, 01 Aug 2016 01:21:10 GMT | got 2 objects from source elasticsearch (offset: 0)
Mon, 01 Aug 2016 01:21:10 GMT | sent 2 objects to destination elasticsearch, wrote 2
Mon, 01 Aug 2016 01:21:10 GMT | got 0 objects from source elasticsearch (offset: 2)
Mon, 01 Aug 2016 01:21:10 GMT | Total Writes: 2
Mon, 01 Aug 2016 01:21:10 GMT | dump complete

现在，如果再次运行 ES 中的索引查询，您应该会看到
docs.count

fulltext_opt 索引已更改为 2 而不是 0：

$ curl localhost:9200/_cat/indices?v health status index                pri rep docs.count docs.deleted store.size pri.store.size yellow open   fulltext               5   1          2            0     10.9kb         10.9kb yellow open   fulltext_opt           1   1          2            0       159b           159b

就是这样，我们的文档从一个索引复制到另一个索引。现在，您可以根据需要删除 mongo-connector 创建的索引。最后一步是尝试我们的新索引，看看它是否真的支持我们的自动完成功能的部分匹配：

curl -H 'Content-Type: application/json' \
    localhost:9200/fulltext_opt/articles/_search?pretty \
    -d "{\"query\": { \"match\": { \"title\": { \"query\": \"chi\", \"analyzer\": \"standard\"} } } }"{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.3125,
    "hits" : [ {
      "_index" : "fulltext_opt",
      "_type" : "articles",
      "_id" : "579e0a35c6d02e54ad6fe556",
      "_score" : 0.3125,
      "_source" : {
        "content" : "In the most recent example in a growing trend of big deals for smartphone-based games, a consortium of Chinese investors led by the game company Shanghai Giant Network Technology said in a statement on Saturday that it would pay $4.4 billion to Caesars Interactive Entertainment for Playtika, its social and mobile games unit. Caesars Interactive is controlled by the owners of Caesars Palace and other casinos in Las Vegas and elsewhere.",
        "title" : "Chinese Group to Pay $4.4 Billion for Caesars Mobile Games"
      }
    } ]
  }
}

瞧！回到我们的文档。请注意，我们在查询中定义了要使用的特定分析器，并将其设置为标准分析器：

{ 
    title: { 
        query: "chi", 
        analyzer: "standard" 
    } 
}

如果不这样做，则由于我们使用自定义分析器查询索引，因此
autocomplete

默认情况下它将使用分析器，并使用查询文本的边缘 n -gram 进行查询。这将导致不必要的结果，因为我们要
chi

专门搜索文本，而不是
c 或 ch 或 chi

。这就是为什么我们必须将分析仪显式设置为标准分析仪的原因。
处理新的 MongoDB 插入
到目前为止，我们已使用 mongo-connector 将所有 MongoDB 集合的内容移至 fulltext_opt 索引。您可能还记得，唯一的问题是 mongo-connector 从 MongoDB 复制到具有相同数据库名称的索引。这意味着，如果我们像现在一样保持 mongo-connector 的运行，则插入数据库的所有新文档都将
fulltext

在 ES 的索引中进行索引，而不是优化的
fulltext_opt

。
解决此问题的方法是配置更多的 mongo-connector 命令。您可以在此处找到许多配置选项。我们将使用其中的两个：
namespaces.include
namespaces.mapping

命令行）。您可以看到如何通过 json 文件配置 mongo-connector，在这里我将仅使用命令行参数方式。
该
-n

选项将告诉 mongo-connector 我们要索引 MongoDB 中的哪些集合。语法为
database_name.collection_name

。在我们的例子中，我们想索引
fulltext

数据库中的所有文章。因此，我们将传递这样的命令行参数：
-n fulltext.articles

选项将告诉 mongo-connector 应该将使用该

选项定义的集合中的所有文档放入哪个索引。因此，在本例中，我们希望将所有文章放入
fulltext_opt

索引中。我们还需要
提供要在 ES 中使用的类型，因此完整的参数应为：
-g fulltext_opt.articles

，因为我们希望将文章与文章类型一起存储在索引中。
就是这样，现在我们可以运行如下命令：

$ mongo-connector -m 127.0.0.1:27017 -t 127.0.0.1:9200 -d elastic2_doc_manager -n fulltext.articles -g fulltext_opt.articles

如果让 mongo-connector 保持运行，则所有新插入的内容也会在 ES 中建立索引。继续，在 articles 集合中插入一个新文档，然后向 ES 索引发送查询，该文档应返回。

结论
以创建自动完成兼容索引为借口，我们学习了如何将 MongoDB 与 Elastic Search 混合使用，并使两者与
mongo-connector 模块保持同步。