elasticsearch学习笔记高级篇十一多字段搜索下 | 乐趣区

承接上一篇博客 https://segmentfault.com/a/11...

4、most_fields查询

most_fields是以字段为中心，这就使得它会查询最多匹配的字段。
假设我们有一个让用户搜索地址。其中有两个文档如下：

PUT /test_index/_create/1{    "street":   "5 Poland Street",    "city":     "Poland",    "country":  "United W1V",    "postcode": "W1V 3DG"}PUT /test_index/_create/2{    "street":   "5 Poland Street W1V",    "city":     "London",    "country":  "United Kingdom",    "postcode": "3DG"}

使用most_fields进行查询：

GET /test_index/_search{  "query": {    "bool": {      "should": [        {          "match": {            "street": "Poland Street W1V"          }        },        {          "match": {            "city": "Poland Street W1V"          }        },        {          "match": {            "country": "Poland Street W1V"          }        },        {          "match": {            "postcode": "Poland Street W1V"          }        }      ]    }  }}

我们发现对每个字段重复查询字符串很快就会显得冗长，此时用multi_match进行简化如下:

GET /test_index/_search{  "query": {    "multi_match": {      "query": "Poland Street W1V",      "type": "most_fields",       "fields": ["street", "city", "country", "postcode"]    }  }}

结果：

{  "took" : 4,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 2,      "relation" : "eq"    },    "max_score" : 2.3835402,    "hits" : [      {        "_index" : "test_index",        "_type" : "_doc",        "_id" : "1",        "_score" : 2.3835402,        "_source" : {          "street" : "5 Poland Street",          "city" : "Poland",          "country" : "United W1V",          "postcode" : "W1V 3DG"        }      },      {        "_index" : "test_index",        "_type" : "_doc",        "_id" : "2",        "_score" : 0.99938464,        "_source" : {          "street" : "5 Poland Street W1V",          "city" : "London",          "country" : "United Kingdom",          "postcode" : "3DG"        }      }    ]  }}

如果用best_fields,那么doc2会在doc1的前面

GET /test_index/_search{  "query": {    "multi_match": {      "query": "Poland Street W1V",      "type": "best_fields",       "fields": ["street", "city", "country", "postcode"]    }  }}

结果：

{  "took" : 3,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 2,      "relation" : "eq"    },    "max_score" : 0.99938464,    "hits" : [      {        "_index" : "test_index",        "_type" : "_doc",        "_id" : "2",        "_score" : 0.99938464,        "_source" : {          "street" : "5 Poland Street W1V",          "city" : "London",          "country" : "United Kingdom",          "postcode" : "3DG"        }      },      {        "_index" : "test_index",        "_type" : "_doc",        "_id" : "1",        "_score" : 0.6931472,        "_source" : {          "street" : "5 Poland Street",          "city" : "Poland",          "country" : "United W1V",          "postcode" : "W1V 3DG"        }      }    ]  }}

使用most_fields存在的问题

（1）它被设计用来找到匹配任意单词的多数字段，而不是找到跨越所有字段的最匹配的单词
（2）它不能使用operator或者minimum_should_match参数来减少低相关度结果带来的长尾效应
（3）每个字段的词条频度是不同的，会互相干扰最终得到较差的排序结果

5、全字段查询使用copy_to参数

上面那说了most_fields的问题，下面就来解决一下这个问题，解决这个问题的第一种方式就是使用copy_to参数。
我们可以用copy_to将多个field组合成一个field
建立如下索引：

DELETE /test_indexPUT /test_index{  "mappings": {    "properties": {      "street": {        "type": "text",        "copy_to": "full_address"      },      "city": {        "type": "text",        "copy_to": "full_address"      },      "country": {        "type": "text",        "copy_to": "full_address"      },      "postcode": {        "type": "text",        "copy_to": "full_address"      },      "full_address": {        "type": "text"      }    }  }}

插入之前的数据：

PUT /test_index/_create/1{    "street":   "5 Poland Street",    "city":     "Poland",    "country":  "United W1V",    "postcode": "W1V 3DG"}PUT /test_index/_create/2{    "street":   "5 Poland Street W1V",    "city":     "London",    "country":  "United Kingdom",    "postcode": "3DG"}

查询：

GET /test_index/_search{  "query": {    "match": {      "full_address": "Poland Street W1V"    }  }}

结果：

{  "took" : 2,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 2,      "relation" : "eq"    },    "max_score" : 0.68370587,    "hits" : [      {        "_index" : "test_index",        "_type" : "_doc",        "_id" : "1",        "_score" : 0.68370587,        "_source" : {          "street" : "5 Poland Street",          "city" : "Poland",          "country" : "United W1V",          "postcode" : "W1V 3DG"        }      },      {        "_index" : "test_index",        "_type" : "_doc",        "_id" : "2",        "_score" : 0.5469647,        "_source" : {          "street" : "5 Poland Street W1V",          "city" : "London",          "country" : "United Kingdom",          "postcode" : "3DG"        }      }    ]  }}

我们可以发现这样变成一个字段full_address之后，就可以解决most_fields的问题了。

5、cross_fields查询

解决most_fields的问题的第二种方式就是使用cross_fields查询。
如果我们在索引文档之前都能够使用_all或是提前定义好copy_to的话，那就没什么问题。但是，Elasticsearch同时也提供了一个搜索期间的解决方案就是使用cross_fields查询。cross_fields采用了一种以词条为中心的方法，这种方法和best_fields以及most_fields采用的以字段为中心的方法有很大的区别。它将所有的字段视为一个大的字段，然后在任一字段中搜索每个词条。
下面解释一下以字段为中心和以词条为中心的区别。

以字段为中心

通过查询：

GET /test_index/_validate/query?explain{  "query": {    "multi_match": {      "query": "Poland Street W1V",      "type": "best_fields",      "fields": ["street", "city", "country", "postcode"]    }  }}

得到：

{  "_shards" : {    "total" : 1,    "successful" : 1,    "failed" : 0  },  "valid" : true,  "explanations" : [    {      "index" : "test_index",      "valid" : true,      "explanation" : "((postcode:poland postcode:street postcode:w1v) | (country:poland country:street country:w1v) | (city:poland city:street city:w1v) | (street:poland street:street street:w1v))"    }  ]}

((postcode:poland postcode:street postcode:w1v) |
(country:poland country:street country:w1v) |
(city:poland city:street city:w1v) |
(street:poland street:street street:w1v))
这个就是规则。
将operator设置成and就变成
((+postcode:poland +postcode:street +postcode:w1v) |
(+country:poland +country:street +country:w1v) |
(+city:poland +city:street +city:w1v) |
(+street:poland +street:street +street:w1v))
标识四个词条都需要出现在相同的字段中

以词条为中心

通过查询

GET /test_index/_validate/query?explain{  "query": {    "multi_match": {      "query": "Poland Street W1V",      "type": "cross_fields",       "operator": "and",       "fields": ["street", "city", "country", "postcode"]    }  }}

得到：

{  "_shards" : {    "total" : 1,    "successful" : 1,    "failed" : 0  },  "valid" : true,  "explanations" : [    {      "index" : "test_index",      "valid" : true,      "explanation" : "+blended(terms:[postcode:poland, country:poland, city:poland, street:poland]) +blended(terms:[postcode:street, country:street, city:street, street:street]) +blended(terms:[postcode:w1v, country:w1v, city:w1v, street:w1v])"    }  ]}

+blended(terms:[postcode:poland, country:poland, city:poland, street:poland]) +blended(terms:[postcode:street, country:street, city:street, street:street]) +blended(terms:[postcode:w1v, country:w1v, city:w1v, street:w1v])
这个是规则。换言之所有的词必须出现在任意字段中。
cross_fields类型首先会解析查询字符串来得到一个词条列表，然后在任一字段中搜索每个词条。通过混合字段的倒排文档频度来解决词条频度问题。从而完美结局了most_fields的问题。
使用cross_fields相比较于copy_to，可以在查询期间对个别字段进行加权。
示例：

GET /test_index/_search{  "query": {    "multi_match": {      "query": "Poland Street W1V",      "type": "cross_fields",       "fields": ["street^2", "city", "country", "postcode"]    }  }}

这样street字段的boost就是2，其它字段都为1