elasticsearch学习笔记高级篇十二掌握phrase-matching搜索技术

5次阅读

共计 5325 个字符,预计需要花费 14 分钟才能阅读完成。

1、什么是近似搜索

假设有两个句子

java is my favourite programming langurage, and I also think spark is a very good big data system.

java spark are very related, because scala is spark's programming langurage and scala is also based on jvm like java. 

适用 match query 搜索 java spark

{
    {
        "match": {"content": "java spark"}
    }
}

match query 只能搜索到包含 java 和 spark 的 document, 但是不知道 java 和 spark 是不是离得很近。
假设我们想要 java 和 spark 离得很近的 document 优先返回,就要给它一个更高的 relevance score, 这就涉及到了 proximity match 近似匹配。
下面给出要实现的两个需求:
(1)搜索 java spark,就靠在一起,中间不能插入任何其它字符
(2)搜索 java spark,要求 java 和 spark 两个单词靠的越近,doc 的分数越高,排名越靠前

2、match phrase

准备数据:

PUT /test_index/_create/1
{"content": "java is my favourite programming language, and I also think spark is a very good big data system."}
PUT /test_index/_create/2
{"content": "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."}

对于需求 1 搜索 java spark,就靠在一起,中间不能插入任何其它字符:
使用 match query 搜索无法实现

GET /test_index/_search
{
  "query": {
    "match": {"content": "java spark"}
  }
}

结果:

{
  "took" : 16,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.4255141,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.4255141,
        "_source" : {"content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."}
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.37266707,
        "_source" : {"content" : "java is my favourite programming language, and I also think spark is a very good big data system."}
      }
    ]
  }
}

使用 match phrase 搜索就可以实现

GET /test_index/_search
{
  "query": {
    "match_phrase": {"content": "java spark"}
  }
}

结果:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.35695744,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.35695744,
        "_source" : {"content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."}
      }
    ]
  }
}

3、term position

假设我们有两个 document

doc1: hello world, java spark
doc2: hi, spark java

hello doc1(0)
world doc1(1)
java  doc1(2) doc2(2)
spark doc1(3) doc2(1)

position 详情如下:

GET /_analyze
{"text": ["hello world, java spark"],
  "analyzer": "standard"
}
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "java",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "spark",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}
GET /_analyze
{"text": ["hi, spark java"],
  "analyzer": "standard"
}
{
  "tokens" : [
    {
      "token" : "hi",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "spark",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "java",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

4、match phrase 基本原理


索引中的 position,match_phrase

hello world, java spark        doc1
hi, spark java                doc2

hello         doc1(0)        
wolrd        doc1(1)
java        doc1(2) doc2(2)
spark        doc1(3) doc2(1)

使用 match_phrase 查询要求找到每个 term 都在一个共有的那些 doc, 就是要求一个 doc,必须要包含查询的每个 term,并且满足位置运算。

doc1 --> java 和 spark --> spark position 恰巧比 java 大 1 --> java 的 position 是 2,spark 的 position 是 3,恰好满足条件
doc1 符合条件
doc2 --> java 和 spark --> java position 是 2,spark position 是 1,spark position 比 java position 小 1,而不是大 1 --> 光是 position 就不满足,那么 doc2 不匹配
doc2 不符合条件 

5、slop 参数

含义:query string 搜索文本中的几个 term, 要经过几次移动才能与一个 document 匹配,这个移动的次数就是 slop。
实际举一个例子:
对于 hello world, java is very good, spark is also very good. 假设我们要用 match phrase 匹配到 java spark。可以发现直接进行查询会查不到

PUT /test_index/_create/1
{"content": "hello world, java is very good, spark is also very good."}

GET /test_index/_search
{
  "query": {
    "match_phrase": {"content": "java spark"}
  }
}
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : []}
}

此时使用

GET /_analyze
{"text": ["hello world, java is very good, spark is also very good."],
  "analyzer": "standard"
}

结果:

{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "java",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "very",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "good",
      "start_offset" : 26,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "spark",
      "start_offset" : 32,
      "end_offset" : 37,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 38,
      "end_offset" : 40,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "also",
      "start_offset" : 41,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "very",
      "start_offset" : 46,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "good",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}
java        is        very        good        spark        is

java        spark
java        -->        spark
java                -->            spark
java                            -->            spark

可以发现 java 的 position 是 2,spark 的 position 是 6,那么我们只需要设置 slop 大于等于 3(也就是移动 3 词就可以了)就可以搜到了

GET /test_index/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "java spark",
        "slop": 3
      }
    }
  }
}

结果:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.21824157,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.21824157,
        "_source" : {"content" : "hello world, java is very good, spark is also very good."}
      }
    ]
  }
}

此时加上 slop 的 match phrase 就是 proximity match 近似匹配了。加上 slop 之后虽然是近似匹配可以搜索到很多结果,但是距离越近的会优先返回,也就是相关度分数就会越高。

正文完
 0