elasticsearch学习笔记高级篇十三混合使用match和近似匹配实现召回率和精准度的平衡

召回率和准确度

对于Elasticsearch而言
当使用match查询的时候
召回率=匹配到的文档数量/所有文档的数量，所以匹配到的文档数量越多，召回率就越高。
准确度指的就是匹配到的文档中，我们真正查询想要的文档相关度分数越高，返回结果中排在越前面，准确度就越高。

match和match_phrase

我们知道使用match匹配的话，如果我们的搜索文本是java spark，那么在返回结果中，只要包含有java或者是spark的文档都会返回。所以只使用match匹配的话，查询的召回率会非常高，但是准确度就会很低。

对于match_phrase短语搜索，会导致必须所有的term都在文档的字段中出现，而且距离在slop限定范围内才能匹配得上。如果我们的搜索文本是java spark，那么在返回结果中只包含java和只包含spark的文档不会返回，并且如果文档包含java也包含spark,但是距离范围大于slop限定的范围，那么也不会返回。这样准确度会很高，但是召回率就会过低，可能会没有文档返回，或是返回文档过少。

match和match_phrase实现召回率和精准度的平衡

有时我们可能希望匹配到几个term中的部分，就可以作为结果返回，这样就可以提高召回率。同时我们也希望用上match_phrase根据距离提升分数的功能，让几个term距离越近分数就越高，优先返回。也就是如果我们的搜索文本是java spark，那么在返回结果中只要包含java或者是spark的文档就返回，但是如果文档既包含java也包含spark，并且距离非常近，那么这样的文档分数会非常高，会在结果中优先被返回。

实现方法：

用bool组合match和match_phrase,来实现，must条件中用match,保证尽量匹配更多的结果，should中用match_phrase来提高我们想要的文档的相关度分数，让这些文档优先返回。
示例：
只使用match

GET /test_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "test_field": "java spark"
          }
        }
      ]
    }
  }
}

输出结果：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.031828,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.031828,
        "_source" : {
          "test_field" : "spark is best big data solution based on scala ,an programming language similar to java spark"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.21110919,
        "_source" : {
          "test_field" : "i think java is the best programming language"
        }
      }
    ]
  }
}

只使用match_phrase

GET /test_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "test_field": {
              "query": "java spark",
              "slop": 10
            }
          }
        }
      ]
    }
  }
}

输出结果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.7704125,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.7704125,
        "_source" : {
          "test_field" : "spark is best big data solution based on scala ,an programming language similar to java spark"
        }
      }
    ]
  }
}

混合使用match和近似匹配实现召回率和精准度的平衡

GET /test_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "test_field": "java spark"
          }
        }
      ],
      "should": [
        {
          "match_phrase": {
            "test_field": {
              "query": "java spark",
              "slop": 10
            }
          }
        }
      ]
    }
  }
}

输出结果：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.8022406,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.8022406,
        "_source" : {
          "test_field" : "spark is best big data solution based on scala ,an programming language similar to java spark"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.21110919,
        "_source" : {
          "test_field" : "i think java is the best programming language"
        }
      }
    ]
  }
}

使用rescoring机制优化近似匹配搜索的性能

match和match_phrase的区别

match: 只要简单的匹配到了一个term，就会将term对应的文档作为结果返回，扫描倒排索引，扫描到了就完事
match_phrase: 首先要扫描到所有term的文档列表，找到包含所有term的文档列表，然后对每个文档都计算每个term的position，是否符合指定的范围，需要进行复杂的运算，才能判断能否通过slop移动，匹配到这个文档。

性能比较

match query的性能比match phrase和proximity match（有slop的match phrase）要高得多。因为后两者都需要计算position的距离
match query比natch_phrase的性能要高10倍，比proximity match（有slop的match phrase）要高20倍。
但是Elasticsearch性能是很强大的，基本都在毫秒级。match可能是几毫秒，match phrase和proximity match也基本在几十毫秒和几百毫秒之前。

性能优化

优化match_phrase和proximity match的性能，一般就是减少要进行proximity match搜索的文档的数量。
主要的思路就是用match query先过滤出需要的数据，然后在用proximity match来根据term距离提高文档的分数，同时proximity match只针对每个shard的分数排名前n个文档起作用，来重新调整它们的分数，这个过程称之为重打分rescoring。主要是因为一般用户只会分页查询，只会看前几页的数据，所以不需要对所有的结果进行proximity match操作。也就是使用match + proximity match同时实现召回率和精准度。

默认情况下，match也许匹配了1000个文档，proximity match需要对每个doc进行一遍运算，判断能否slop移动匹配上，然后去贡献自己的分数，但是很多情况下，match出来也许是1000个文档，其实用户大部分情况下都是分页查询的，可以就看前5页，每页就10条数据，也就50个文档。proximity match只要对前50个doc进行slop移动去匹配，去贡献自己的分数即可，不需要对全部1000个doc都去进行计算和贡献分数。这个时候通过window_size这个参数即可实现限制重打分rescoring的文档数量。
示例：

GET /test_index/_search
{
  "query": {
    "match": {
      "test_field": "java spark"
    }
  },
  "rescore": {
    "query": {
      "rescore_query": {
        "match_phrase": {
          "test_field": {
            "query": "java spark",
            "slop": 10
          }
        }
      }
    },
    "window_size": 50
  }
}

输出结果：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.8022406,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.8022406,
        "_source" : {
          "test_field" : "spark is best big data solution based on scala ,an programming language similar to java spark"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.21110919,
        "_source" : {
          "test_field" : "i think java is the best programming language"
        }
      }
    ]
  }
}

可以看到其实跟使用bool方式实现的效果是一样的。

召回率和准确度

match和match_phrase

match和match_phrase实现召回率和精准度的平衡

实现方法：

使用rescoring机制优化近似匹配搜索的性能

match和match_phrase的区别

性能比较

性能优化

评论

发表回复取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

elasticsearch学习笔记高级篇十三混合使用match和近似匹配实现召回率和精准度的平衡

召回率和准确度

match和match_phrase

match和match_phrase实现召回率和精准度的平衡

实现方法：

使用rescoring机制优化近似匹配搜索的性能

match和match_phrase的区别

性能比较

性能优化

评论

发表回复 取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

发表回复取消回复