1、什么是近似搜索

假设有两个句子

java is my favourite programming langurage, and I also think spark is a very good big data system.java spark are very related, because scala is spark's programming langurage and scala is also based on jvm like java. 

适用match query 搜索java spark

{    {        "match": {            "content": "java spark"        }    }}

match query 只能搜索到包含java和spark的document,但是不知道java和spark是不是离得很近。
假设我们想要java和spark离得很近的document优先返回,就要给它一个更高的relevance score,这就涉及到了proximity match近似匹配。
下面给出要实现的两个需求:
(1)搜索java spark,就靠在一起,中间不能插入任何其它字符
(2)搜索java spark,要求java和spark两个单词靠的越近,doc的分数越高,排名越靠前

2、match phrase

准备数据:

PUT /test_index/_create/1{  "content": "java is my favourite programming language, and I also think spark is a very good big data system."}PUT /test_index/_create/2{  "content": "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."}

对于需求1 搜索java spark,就靠在一起,中间不能插入任何其它字符:
使用match query搜索无法实现

GET /test_index/_search{  "query": {    "match": {      "content": "java spark"    }  }}

结果:

{  "took" : 16,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 2,      "relation" : "eq"    },    "max_score" : 0.4255141,    "hits" : [      {        "_index" : "test_index",        "_type" : "_doc",        "_id" : "2",        "_score" : 0.4255141,        "_source" : {          "content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."        }      },      {        "_index" : "test_index",        "_type" : "_doc",        "_id" : "1",        "_score" : 0.37266707,        "_source" : {          "content" : "java is my favourite programming language, and I also think spark is a very good big data system."        }      }    ]  }}

使用match phrase搜索就可以实现

GET /test_index/_search{  "query": {    "match_phrase": {      "content": "java spark"    }  }}

结果:

{  "took" : 1,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 1,      "relation" : "eq"    },    "max_score" : 0.35695744,    "hits" : [      {        "_index" : "test_index",        "_type" : "_doc",        "_id" : "2",        "_score" : 0.35695744,        "_source" : {          "content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."        }      }    ]  }}

3、term position

假设我们有两个document

doc1: hello world, java sparkdoc2: hi, spark javahello doc1(0)world doc1(1)java  doc1(2) doc2(2)spark doc1(3) doc2(1)

position详情如下:

GET /_analyze{  "text": ["hello world, java spark"],  "analyzer": "standard"}
{  "tokens" : [    {      "token" : "hello",      "start_offset" : 0,      "end_offset" : 5,      "type" : "<ALPHANUM>",      "position" : 0    },    {      "token" : "world",      "start_offset" : 6,      "end_offset" : 11,      "type" : "<ALPHANUM>",      "position" : 1    },    {      "token" : "java",      "start_offset" : 13,      "end_offset" : 17,      "type" : "<ALPHANUM>",      "position" : 2    },    {      "token" : "spark",      "start_offset" : 18,      "end_offset" : 23,      "type" : "<ALPHANUM>",      "position" : 3    }  ]}
GET /_analyze{  "text": ["hi, spark java"],  "analyzer": "standard"}
{  "tokens" : [    {      "token" : "hi",      "start_offset" : 0,      "end_offset" : 2,      "type" : "<ALPHANUM>",      "position" : 0    },    {      "token" : "spark",      "start_offset" : 4,      "end_offset" : 9,      "type" : "<ALPHANUM>",      "position" : 1    },    {      "token" : "java",      "start_offset" : 10,      "end_offset" : 14,      "type" : "<ALPHANUM>",      "position" : 2    }  ]}

4、match phrase基本原理

索引中的position,match_phrasehello world, java spark        doc1hi, spark java                doc2hello         doc1(0)        wolrd        doc1(1)java        doc1(2) doc2(2)spark        doc1(3) doc2(1)

使用match_phrase查询要求找到每个term都在一个共有的那些doc,就是要求一个doc,必须要包含查询的每个term,并且满足位置运算。

doc1 --> java和spark --> spark position恰巧比java大1 --> java的position是2,spark的position是3,恰好满足条件doc1符合条件doc2 --> java和spark --> java position是2,spark position是1,spark position比java position小1,而不是大1 --> 光是position就不满足,那么doc2不匹配doc2不符合条件

5、slop参数

含义:query string搜索文本中的几个term,要经过几次移动才能与一个document匹配,这个移动的次数就是slop。
实际举一个例子:
对于hello world, java is very good, spark is also very good. 假设我们要用match phrase 匹配到java spark。可以发现直接进行查询会查不到

PUT /test_index/_create/1{  "content": "hello world, java is very good, spark is also very good."}GET /test_index/_search{  "query": {    "match_phrase": {      "content": "java spark"    }  }}
{  "took" : 1,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 0,      "relation" : "eq"    },    "max_score" : null,    "hits" : [ ]  }}

此时使用

GET /_analyze{  "text": ["hello world, java is very good, spark is also very good."],  "analyzer": "standard"}

结果:

{  "tokens" : [    {      "token" : "hello",      "start_offset" : 0,      "end_offset" : 5,      "type" : "<ALPHANUM>",      "position" : 0    },    {      "token" : "world",      "start_offset" : 6,      "end_offset" : 11,      "type" : "<ALPHANUM>",      "position" : 1    },    {      "token" : "java",      "start_offset" : 13,      "end_offset" : 17,      "type" : "<ALPHANUM>",      "position" : 2    },    {      "token" : "is",      "start_offset" : 18,      "end_offset" : 20,      "type" : "<ALPHANUM>",      "position" : 3    },    {      "token" : "very",      "start_offset" : 21,      "end_offset" : 25,      "type" : "<ALPHANUM>",      "position" : 4    },    {      "token" : "good",      "start_offset" : 26,      "end_offset" : 30,      "type" : "<ALPHANUM>",      "position" : 5    },    {      "token" : "spark",      "start_offset" : 32,      "end_offset" : 37,      "type" : "<ALPHANUM>",      "position" : 6    },    {      "token" : "is",      "start_offset" : 38,      "end_offset" : 40,      "type" : "<ALPHANUM>",      "position" : 7    },    {      "token" : "also",      "start_offset" : 41,      "end_offset" : 45,      "type" : "<ALPHANUM>",      "position" : 8    },    {      "token" : "very",      "start_offset" : 46,      "end_offset" : 50,      "type" : "<ALPHANUM>",      "position" : 9    },    {      "token" : "good",      "start_offset" : 51,      "end_offset" : 55,      "type" : "<ALPHANUM>",      "position" : 10    }  ]}
java        is        very        good        spark        isjava        sparkjava        -->        sparkjava                -->            sparkjava                            -->            spark

可以发现java的position是2,spark的position是6,那么我们只需要设置slop大于等于3(也就是移动3词就可以了)就可以搜到了

GET /test_index/_search{  "query": {    "match_phrase": {      "content": {        "query": "java spark",        "slop": 3      }    }  }}

结果:

{  "took" : 1,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 1,      "relation" : "eq"    },    "max_score" : 0.21824157,    "hits" : [      {        "_index" : "test_index",        "_type" : "_doc",        "_id" : "1",        "_score" : 0.21824157,        "_source" : {          "content" : "hello world, java is very good, spark is also very good."        }      }    ]  }}

此时加上slop的match phrase就是proximity match近似匹配了。加上slop之后虽然是近似匹配可以搜索到很多结果,但是距离越近的会优先返回,也就是相关度分数就会越高。