共计 5325 个字符,预计需要花费 14 分钟才能阅读完成。
1、什么是近似搜索
假设有两个句子
java is my favourite programming langurage, and I also think spark is a very good big data system.
java spark are very related, because scala is spark's programming langurage and scala is also based on jvm like java.
适用 match query 搜索 java spark
{
{
"match": {"content": "java spark"}
}
}
match query 只能搜索到包含 java 和 spark 的 document, 但是不知道 java 和 spark 是不是离得很近。
假设我们想要 java 和 spark 离得很近的 document 优先返回,就要给它一个更高的 relevance score, 这就涉及到了 proximity match 近似匹配。
下面给出要实现的两个需求:
(1)搜索 java spark,就靠在一起,中间不能插入任何其它字符
(2)搜索 java spark,要求 java 和 spark 两个单词靠的越近,doc 的分数越高,排名越靠前
2、match phrase
准备数据:
PUT /test_index/_create/1
{"content": "java is my favourite programming language, and I also think spark is a very good big data system."}
PUT /test_index/_create/2
{"content": "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."}
对于需求 1 搜索 java spark,就靠在一起,中间不能插入任何其它字符:
使用 match query 搜索无法实现
GET /test_index/_search
{
"query": {
"match": {"content": "java spark"}
}
}
结果:
{
"took" : 16,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.4255141,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.4255141,
"_source" : {"content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.37266707,
"_source" : {"content" : "java is my favourite programming language, and I also think spark is a very good big data system."}
}
]
}
}
使用 match phrase 搜索就可以实现
GET /test_index/_search
{
"query": {
"match_phrase": {"content": "java spark"}
}
}
结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.35695744,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.35695744,
"_source" : {"content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."}
}
]
}
}
3、term position
假设我们有两个 document
doc1: hello world, java spark
doc2: hi, spark java
hello doc1(0)
world doc1(1)
java doc1(2) doc2(2)
spark doc1(3) doc2(1)
position 详情如下:
GET /_analyze
{"text": ["hello world, java spark"],
"analyzer": "standard"
}
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "java",
"start_offset" : 13,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "spark",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
GET /_analyze
{"text": ["hi, spark java"],
"analyzer": "standard"
}
{
"tokens" : [
{
"token" : "hi",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "spark",
"start_offset" : 4,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "java",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}
4、match phrase 基本原理
索引中的 position,match_phrase
hello world, java spark doc1
hi, spark java doc2
hello doc1(0)
wolrd doc1(1)
java doc1(2) doc2(2)
spark doc1(3) doc2(1)
使用 match_phrase 查询要求找到每个 term 都在一个共有的那些 doc, 就是要求一个 doc,必须要包含查询的每个 term,并且满足位置运算。
doc1 --> java 和 spark --> spark position 恰巧比 java 大 1 --> java 的 position 是 2,spark 的 position 是 3,恰好满足条件
doc1 符合条件
doc2 --> java 和 spark --> java position 是 2,spark position 是 1,spark position 比 java position 小 1,而不是大 1 --> 光是 position 就不满足,那么 doc2 不匹配
doc2 不符合条件
5、slop 参数
含义:query string 搜索文本中的几个 term, 要经过几次移动才能与一个 document 匹配,这个移动的次数就是 slop。
实际举一个例子:
对于 hello world, java is very good, spark is also very good. 假设我们要用 match phrase 匹配到 java spark。可以发现直接进行查询会查不到
PUT /test_index/_create/1
{"content": "hello world, java is very good, spark is also very good."}
GET /test_index/_search
{
"query": {
"match_phrase": {"content": "java spark"}
}
}
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : []}
}
此时使用
GET /_analyze
{"text": ["hello world, java is very good, spark is also very good."],
"analyzer": "standard"
}
结果:
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "java",
"start_offset" : 13,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "is",
"start_offset" : 18,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "very",
"start_offset" : 21,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "good",
"start_offset" : 26,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "spark",
"start_offset" : 32,
"end_offset" : 37,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "is",
"start_offset" : 38,
"end_offset" : 40,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "also",
"start_offset" : 41,
"end_offset" : 45,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "very",
"start_offset" : 46,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "good",
"start_offset" : 51,
"end_offset" : 55,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
java is very good spark is
java spark
java --> spark
java --> spark
java --> spark
可以发现 java 的 position 是 2,spark 的 position 是 6,那么我们只需要设置 slop 大于等于 3(也就是移动 3 词就可以了)就可以搜到了
GET /test_index/_search
{
"query": {
"match_phrase": {
"content": {
"query": "java spark",
"slop": 3
}
}
}
}
结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.21824157,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.21824157,
"_source" : {"content" : "hello world, java is very good, spark is also very good."}
}
]
}
}
此时加上 slop 的 match phrase 就是 proximity match 近似匹配了。加上 slop 之后虽然是近似匹配可以搜索到很多结果,但是距离越近的会优先返回,也就是相关度分数就会越高。