第09篇Elasticsearch中构建自定义分析器

我的 Elasticsearch 系列文章，逐渐更新中，欢迎关注
0A. 关于 Elasticsearch 及实例应用
00.Solr 与 ElasticSearch 对比
01.ElasticSearch 能做什么？
02.Elastic Stack 功能介绍
03. 如何安装与设置 Elasticsearch API
04. 如果通过 elasticsearch 的 head 插件建立索引_CRUD 操作
05.Elasticsearch 多个实例和 head plugin 使用介绍
06. 当 Elasticsearch 进行文档索引时，它是如何工作的?
07.Elasticsearch 中的映射方式—简洁版教程
08.Elasticsearch 中的分析和分析器应用方式
09.Elasticsearch 中构建自定义分析器

另外 Elasticsearch 入门，我强烈推荐 Elasticsearch 基础入门教程给你，非常想尽的入门指南手册。

介绍
在此阶段的上一篇博客中，我已经解释了有关常规分析器结构和组件的更多信息。我也解释了每个组件的功能。在此博客中，我们将通过构建自定义分析器，然后查询并查看差异来了解实现方面。
定制分析仪的外壳
因此，让我们考虑定制分析仪的情况。假设我们输入到 Elasticsearch 的文本包含以下内容

html 标签

html 标签在索引时可能会出现在我们的文本中，其实这在大多数情况下是不需要的。所以我们需要删除这些。
2. 停止词
像 the,and,or 等这样的词，在搜索内容时意义不大，一般被称为停止词。
3. 大写字母。
4. 简写形式如 H2O、$、%。
在某些情况下，像这样的简式应该用英文原词代替。

应用自定义分析器
在上面的示例文本中，下表列出了需要执行的操作以及自定义分析器的相应组件

Arun has 100 $ which accounts to 3 % of the total <h2> money </h2>

“settings”中的层次结构如下所示：

应用所有组件
现在应用上述所有组件创建一个自定义分析器，如下所示：

curl -XPUT localhost:9200/testindex_0204 -d '{"settings": {"analysis": {"char_filter": {"subsitute": {"type":"mapping","mappings": ["$=> dollar","%=> percentage"]
        },
        "html-strip": {"type": "html_strip"}
      },
      "tokenizer": "standard",
      "filter": {
        "stopwords_removal": {
          "type": "stop",
          "stopwords": [
            "has",
            "which",
            "to",
            "of",
            "the"
          ]
        }
      },
      "analyzer": {
        "custom_analyzer_type_01": {
          "type": "custom",
          "char_filter": [
            "subsitute",
            "html_strip"
          ],
          "tokenizer": "standard",
          "filter": [
            "stopwords_removal",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "test_type": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "custom_analyzer_type_01"
        }
      }
    }
  }
}'

这将使用名为“custom_analyzer_01”的自定义分析器创建索引。
详细说明了此映射，下图说明了每个部分：

使用自定义分析器生成令牌
使用分析器可以看到使用此分析器生成的令牌，如下所示：

curl -XGET "localhost:9200/testindex_0204/_analyze?analyzer=custom_analyzer_type_01&pretty=true" -d 'Arun has 100 $ which accounts to 3 % of the total <h2> money </h2>'

令牌列表如下：

在这里您可以进行一些观察：
令牌号 3 和 6 最初是 $ 和 %，但随后如本节中所指定的那样被替换为“dollar”和“percentage”char_filter。
还有 html 标记 <h2>，</h2> 也被 html_strip 过滤器从令牌列表中删除
过滤器 “to”,”the”,”which”,”has” 中提到的术语等 stopwords 已从令牌列表中删除。令牌编号 1 最初看起来应该像是“Arun”，但已被应用的过滤器小写。

结论
在此博客中，我们看到了如何构建自定义分析器并将其应用于 Elasticsearch 中的字段。通过这个博客，我打算结束博客系列的第二阶段（索引，映射和分析）。从现在开始，此阶段是理解 Elasticsearch 的基础部分之一，我们可能会将此阶段的输入用于许多目的。从阶段 03 开始，我将向您介绍 elasticsearch 的查询 DSL 世界。