写给自己的Elasticsearch使用指南

ES在处理大数据搜索方面拥有关系型数据库不可比拟的速度优势。这篇不是什么专业的ES指南，也不是ES分析，这里只是我在使用ES中遇到的一些问题，一些搜索方式。因为ES的文档和API查询起来比较困难，因此自己在查询翻译文档时总是耗费很多时间，于是就想把自己遇到过的问题和搜索记录下来，给自己归纳一个简单的常用的ES使用指南。

什么是ES的文档？

ES中文档是以key-value的json数据包形式存储的，它有三个元数据。

_index：文档存储的地方

数据被存储和索引在分片「shards」中，索引只是把一个或多个分片组合在一起的逻辑空间。这些由ES实现。使用者无需关心。

_type：表示一种事务，在5.x中type被移除。
_id：文档的唯一标识符。

ES的映射是怎么回事？

映射(mapping)机制用于进行字段类型确认，将每个字段匹配为一种特定的类型（string，number，booleans，date等）。
分析(analysis)机制用于进行全文文本（full text）分词，以建立供搜索用的反向索引。

查看索引（index）的 type的mapping

ES为对字段类型进行猜测，动态生成了字段和类型的映射关系。
date 类型的字段和 string 类型的字段的索引方式不同的，搜索的结果也是不同的。每一种数据类型都以不同的方式进行索引。

GET    /user/_mapping/user--Response{    "user": {        "mappings": {            "user": {                "properties": {                    "height": {                        "type": "long"                    },                    "nickname": {                        "type": "text",                        "fields": {                            "keyword": {                                "type": "keyword",                                "ignore_above": 256                            }                        }                    },                }            }        }    }}

确切值（Exact values）和全文文本（Full text）

确切值是确定的，表示date或number等；
全文文本指非结构化数据，如 这是一个文本

Index参数控制字符串以何种方式被索引

值	解释
analyzed	首先分析这个字符串，然后索引。（以全文形式索引此字段）
not_analyzed	索引这个字段，使之可以被索引，但索引内容和指定值一样
no	不索引这个字段

string 类型字段默认值是 analyzed。如果想映射字段为确切值，则设置为 not_analyzed

{    "nickname": {      "type": "string",    "index": "not_analyzed"  }}

从空搜索来看ES搜索的返回

GET    /_search--Response{    "took": 7,                        // 整个搜索的请求花费的时间（毫秒）    "timed_out": false,                // 是否请求超时    "_shards": {        "total": 25,                // 参与查询的分片        "successful": 25,            // 成功的分片        "skipped": 0,                // 跳过的        "failed": 0                    // 失败的    },    "hits": {        "total": 1291,                // 匹配到到文档数量        "max_score": 1,                // 查询结果 _score 中的最大值        "hits": [            {                "_index": "feed",                "_type": "feed",                "_id": "1JNC42oB07Tkhuy89JSd",                "_score": 1,                "_source": {                 }              }       ]    }}

创建索引

创建一个名为 megacorp 的索引

PUT /user/--Response{    "megacorp": {        "aliases": {},        "mappings": {},        "settings": {            "index": {                "creation_date": "1558956839146",        // 创建时间（微秒时间戳）                "number_of_shards": "5",                "number_of_replicas": "1",                "uuid": "pGBqNmxrR1S8I7_jAYdvBA",        // 唯一id                "version": {                    "created": "6060199"                },                "provided_name": "megacorp"            }        }    }}

创建文档

PUT /user/user/1001--Body{    "nickname": "你有病啊",    "height": 180,    "expect": "我喜欢看_书",    "tags": ["大学党", "求偶遇"]}--Response{    "_index": "user",    "_type": "user",    "_id": "10001",    "_version": 1,        // 版本号，ES中每个文档都有版本号，每当文档变化（包括删除）_version会增加    "result": "created",    "_shards": {        "total": 2,        "successful": 1,        "failed": 0    },    "_seq_no": 1,    "_primary_term": 1}

使用过滤器(filter)

搜索 height 大于 30 且 nickname 为 threads 的文档

GET    /user/user/_search--Body{    "query": {      "filtered": {        "filter": {          "range":{            "height":{              "gt": 170          }        }      },      "query": {          "match": {            "nickname": "threads"        }      }    }  }}

结构化查询

结构化查询需要传递 query 参数。

GET /_search{    "query": 子查询}# 子查询{    "query_name": {      "argument": value,...  }}# 子查询指向特定字段{    "query_name": {        "field_name": {            "argument": value,...        }    }}

合并多子句

查询子句可以合并简单的子句为一个复杂的查询语句。如：
简单子句用以在将查询字符串与一个字段（或多个字段）进行比较。
复合子句用以合并其他的子句。

GET /user/user/_search--Body{    "query": {        "bool": {            "must": {                            // must:必须满足该子句的条件                "match": {                    "nickname": "threads"                }            },            "must_not": {                        // must_not: 必须不满足该子句条件                "match": {                    "height": 170                }            },            "should": [                            // should: 结果可能满足该数组内的条件                {                    "match": {                        "nickname": "threads"                    }                },                {                    "match": {                        "expect": "我喜欢唱歌、跳舞、打游戏"                    }                }            ]        }    }}--Response{    "hits": {        "total": 1,        "max_score": 0.91862875,        "hits": [            {                "_index": "user",                "_type": "user",                "_id": "98047",                "_score": 0.91862875,                "_source": {                    "nickname": "threads",                    "height": 175,                    "expect": "我喜欢唱歌、跳舞、打游戏",                    "tags": [                        "工作党",                        "吃鸡"                    ]                }            }        ]    }}

全文搜索

GET /user/user/_search--Body{    "query":{        "match":{            "expect": "打游戏"        }    }}--Response{    "took": 2,    "timed_out": false,    "_shards": {        "total": 5,        "successful": 5,        "skipped": 0,        "failed": 0    },    "hits": {        "total": 3,        "max_score": 0.8630463,        "hits": [            {                "_index": "user",                "_type": "user",                "_id": "98047",                "_score": 0.8630463,                            // 匹配分                "_source": {                    "nickname": "threads",                    "height": 175,                    "expect": "我喜欢唱歌、跳舞、打游戏",                    "tags": [                        "工作党",                        "吃鸡"                    ]                }            },            {                "_index": "user",                "_type": "user",                "_id": "94302",                "_score": 0.55900055,                                    "_source": {                    "nickname": "摇了摇头",                    "height": 173,                    "expect": "我喜欢rap、跳舞、打游戏",                    "tags": [                        "工作党",                        "吃饭"                    ]                }            },            {                "_index": "user",                "_type": "user",                "_id": "91031",                "_score": 0.53543615,                "_source": {                    "nickname": "你有病啊",                    "height": 180,                    "expect": "我喜欢学习、逛街、打游戏",                    "tags": [                        "大学党",                        "求偶遇"                    ]                }            }        ]    }}

ES根据结果相关性评分来对结果集进行排序。即文档与查询条件的匹配程度

短语搜索

确切的匹配若干单词或短语。

GET /user/user/_search--Body{    "query": {      "match_phrase": {        "expect": "学习"    }  }}--Response{    "took": 3,    "timed_out": false,    "_shards": {        "total": 5,        "successful": 5,        "skipped": 0,        "failed": 0    },    "hits": {        "total": 1,        "max_score": 1.357075,        "hits": [            {                "_index": "user",                "_type": "user",                "_id": "91031",                "_score": 1.357075,                "_source": {                    "nickname": "你有病啊",                    "height": 180,                    "expect": "我喜欢学习、逛街、打游戏",                    "tags": [                        "大学党",                        "求偶遇"                    ]                }            }        ]    }}

结构化过滤

一条过滤语句会询问每个文档的字段值是否包含着特定值：
是否 created 的日期范围在 xxx 到 xxx
是否 expect 包含 学习
是否 location 字段中的地理位置与目标点相距不超过 xxkm

过滤语句和查询语句的性能对比

使用过滤语句得到的结果集，快速匹配运算并存入内存是十分方便的，每个文档仅需要1个字节。

查询语句不仅要查询相匹配的文档，还需要计算每个文档的相关性，所以一般来说查询语句要比过滤语句更耗时，并且查询结果也不可缓存。

什么情况下使用过滤语句，什么时候使用查询语句？

原则上来说：使用查询语句做全文文本搜索或其他需要进行相关性评分的时候，剩下的全部使用过滤语句。

高亮结果

GET /user/user/_search--Body{    "query": {      "match_phrase": {        "expect": "学习"    }  },  "highlight": {      "fields": {        "expect": {}    }  }}

排序

GET    /user/user/_search--Body{    "query": {        "match_all": {        }    },    "size": 1000,                                        // 返回的数据集大小    "sort": {        "nickname.keyword": {                // 按nickname排序            "order": "desc"        },        "height": {                                    // 按height排序            "order": "desc"        }    }}

搜索结果分页

GET    /user/user/_search--Body{    "from": 2,                //  同mysql 的 offset    "size": 1                //  同mysql 的 limit}

分析

GET    /user/user/_search--Body{    "aggs": {      "all_tags": {        "terms": {          "field": "tags"  // 这种分词对于中文很不友好，会把“学习”分为“学”，“习”        "field": "tags.keyword"    //     5.x后的ES，使用这种写法可以完美分词      },      "aggs": {          "avg_height": {          "avg": {              "field": "height"          }      }    }  }}

查询过滤语句集合

`term` 过滤

term 主要用于精确匹配哪些值，比如数字，日期，布尔值或 not_analyzed 的字符串

{"term": {"height": 175}}                                        // height值为175 （number类型）{"term": {"date": "2014-09-01"}}                        // date值为2014-09-01（date类型）{"term": {"public": true}}                                    // public值为true（布尔类型）{"term": {"nickname": "threads"}}                        // nickname值为threads（full_text类型）

`terms` 过滤

terms 跟 term 一样，但 terms 允许指定多个匹配条件。如果某个字段指定了多个值，那么文档需要一起去做匹配文档。

// 匹配 nickname 为 threads 或 摇了摇头 的结果（使用keyword关键词匹配中文）{    "query": {        "terms": {            "nickname.keyword": ["threads", "摇了摇头"]        }    }}

`range` 过滤

range 过滤是按照指定范围查找数据。

{    "query": {"range":{"height":{"gte": 150, "lt": 180}}}}// gt:         大于// gte:     大于等于// lt:         小于// lte:        小于等于

`exists` 和 `missing` 过滤

exists 和 missing 过滤可以用于查找文档中是否包含指定字段或没有某个字段，类似于SQL语句中的 IS_NULL 条件。

// 查询存在 nickname 字段的结果{"exists": {"field": "nickname"}}

`bool` 过滤

bool 过滤可以用来合并多个过滤条件查询结果的布尔逻辑：

must：多个查询条件的完全匹配，相当于 and
must_not: 多个查询条件的相反匹配，相当于 not
should：至少有一个查询条件匹配，相当于 or

{    "bool": {      "must":    {"term": {"nickname": "threads"}},    "must_not":    {"term": {"height": 165}},    "should": [            {"term": {"height": 175}},        {"term": {"nickname": "threads"}},    ]  }}

`match_all` 查询

使用 match_all 可以查询到所以文档，是没有查询条件下的默认语句。（常用于合并过滤条件）

{    "match_all":{}}

match 查询

match 查询是一个标准查询，不管你需要全文本查询还是精确查询基本上都要用到它。

{    "match": {"nickname": "threads"}}

`multi_match` 查询

multi_match 查询允许你做 match 查询的基础上同时搜索多个字段：

// 搜索 nickname 或 expect 包含 threads 的结果{    "multi_match": {      "query": "threads",    "fields": ["nickname", "expect"]  }}

`bool` 查询

如果 bool 查询下没有 must 子句，那至少应该有一个 should 子句。但是如果有 must 子句，那么没有 should 子句也可以进行查询。

{      "query":   {        "bool": {              "must": {                  "multi_match": {"query": "学习", "fields": ["nickname", "expect"]}              },              "must_not": {                  "match": {"height": 175}              },              "should": [                  {"match": {"nickname": "threads"}}              ]        }    }}

使用 `filter` 带过滤的查询

使用 filter 来同时使一个语句中包含 查询 和 过滤

// nickname 为 threads 的结果，并在此结果集中筛选出 height 为 175 的结果{    "query": {        "bool": {            "must": {"match": {"nickname": "threads"}},            "filter": {"term": {"height":175}}        }    }}

查询语句的分析

验证一个查询语句的对错？

GET        /user/user/_validate/query{    "query": {      "match_all": {"nickname": "threads"}  }}--Response{    "valid": false,    "error": "ParsingException[[4:13] [match_all] unknown field [nickname], parser not found]; nested: XContentParseException[[4:13] [match_all] unknown field [nickname], parser not found];; org.elasticsearch.common.xcontent.XContentParseException: [4:13] [match_all] unknown field [nickname], parser not found"}

如何理解一个查询语句的执行？

GET        /user/user/_validate/query{    "query": {      "match_all": {"nickname": "threads"}  }}--Response{    "valid": false,    "error": "ParsingException[[4:13] [match_all] unknown field [nickname], parser not found]; nested: XContentParseException[[4:13] [match_all] unknown field [nickname], parser not found];; org.elasticsearch.common.xcontent.XContentParseException: [4:13] [match_all] unknown field [nickname], parser not found"}// 通过返回可以看出，验证结果是非法的

ES查询时经常出现的异常

在使用聚合时关于fielddata的异常

{    "error": {        "root_cause": [            {                "type": "illegal_argument_exception",                "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [tags] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."            }        ],        ...    },    "status": 400}

5.x后对排序，聚合这些操作用单独的数据结构(fielddata)缓存到内存里了，需要单独开启。
开启fielddata

PUT user/_mapping/user/--Body{  "properties": {    "tags": {       "type":     "text",      "fielddata": true    }  }}