Elastic-Search

ElasticSearch基础入门

ElasticSearch简写ES，ES是一个高扩展、开源的全文检索和分析引擎，它可以准实时地快速存储、搜索、分析海量的数据。应用场景我们常见的商城商品的搜索日志分析系统（ELK）基于大量数据（数千万的数据）需要快速调查、分析并且并将结果可视化的业务需求安装并运行ESJava环境安装Elastic 需要 Java 8 环境。如果你的机器还没安装 Java，可以参考JAVA安装 ElasticSearch安装安装完Java环境后，我们可以开始以下ElasticSearch安装或者根据官方文档安装 wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.1.zipunzip elasticsearch-5.5.1.zipcd elasticsearch-5.5.1/进入解压目录之后，运行下面命令，启动ElasticSearch ./bin/elasticsearch 如果此时报以下错误错误一OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure the number of parallel GC threads appropriately using -XX:ParallelGCThreads=N打开: elasticsearch-5.5.1/config/jvm.options 在末尾添加: -XX:-AssumeMP错误二OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000085330000, 2060255232, 0) failed; error='Cannot allocate memory' (errno=12)先执行： sysctl -w vm.max_map_count=262144再打开elasticsearch-5.5.1/config/jvm.options -Xmx512m-Xms512m错误三[2019-06-27T15:01:43,165][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]org.elasticsearch.bootstrap.StartupException: java.lang.RuntimeException: can not run elasticsearch as root原因：elasticsearch自5版本之后，处于安全考虑，不允许使用root用户运行。 ...

极客时间Elasticsearch核心技术与实战课程返现-学习笔记

关注有课学微信公众号，回复暗号 elastic 获取购买《Elasticsearch核心技术与实战》极客时间专栏地址，购买成功后提交购买截图即可获得返现，另外送《Elasticsearch核心技术与实战》专栏学习笔记，待课程更新完成送统一通过微信公众号发放。这门课将带你全面掌握 Elasticsearch 在生产环境中的核心实战技能，学完之后你将能够在工作中快速构建出符合自身业务的分布式搜索和数据分析系统。由浅入深：从基础概念到进阶用法，再到集群管理和大数据分析，学完即可应用到实际生产环境中；实战演练：通过两个 Elasticsearch 实战项目，手把手带你进行实战服务搭建，巩固所学知识点认证备考：课程内容涵盖 Elastic 认证的全部考点，有助于你顺利通过认证考试。

Elasticsearch 索引创建 / 数据查询

es 6.0 开始不推荐一个index下多个type的模式，并且会在 7.0 中完全移除。在 6.0 的index下是无法创建多个type的，type带来的字段类型冲突和检索效率下降的问题，导致了type会被移除。（5.x到6.x）_all字段也被舍弃了，使用 copy_to自定义联合字段。（5.x到6.x）type:text/keyword 来决定是否分词，index: true/false决定是否索引（2.x到5.x）analyzer来单独设定分词器（2.x到5.x）创建索引我们新建一个名news的索引：设定默认分词器为ik分词器用来处理中文使用默认名 _doc 定义 type关闭_source存储（用来验证 store 选项）title 不存储 author 不分词 content 存储PUT /news{ “settings”: { “number_of_shards”: 5, “number_of_replicas”: 1, “index”: { “analysis.analyzer.default.type” : “ik_smart” } }, “mappings”: { “_doc”: { “_source”: { “enabled”: false }, “properties”: { “news_id”: { “type”: “integer”, “index”: true }, “title”: { “type”: “text”, “store”: false }, “author”: { “type”: “keyword” }, “content”: { “type”: “text”, “store”: true }, “created_at”: { “type”: “date”, “format”: “yyyy-MM-dd hh:mm:ss” } } } }}# 查看创建的结构GET /news/_mapping验证分词器是否生效# 验证分词插件是否生效GET /_analyze{ “analyzer”: “ik_smart”, “text”: “我热爱祖国”}GET /_analyze{ “analyzer”: “ik_max_word”, “text”: “我热爱祖国”}# 索引的默认分词器GET /news/_analyze{ “text”: “我热爱祖国！”}# 指定字段分词器将根据字段属性做相应分词处理# author 为 keyword 是不会做分词处理GET /news/_analyze{ “field”: “author” “text”: “我热爱祖国！”}# title 的分词结果GET /news/_analyze{ “field”: “title” “text”: “我热爱祖国！"}添加文档用于演示，后面的查询会以这些文档为例。POST /news/_doc{ “news_id”: 1, “title”: “我们一起学旺叫”, “author”: “才华横溢王大猫”, “content”: “我们一起学旺叫，一起旺旺旺旺旺，在你面撒个娇，哎呦旺旺旺旺旺，我的尾巴可劲儿摇”, “created_at”: “2019-03-26 11:55:20”}{ “news_id”: 2, “title”: “我们一起学猫叫”, “author”: “王大猫不会被分词”, “content”: “我们一起学猫叫，还是旺旺旺旺旺，在你面撒个娇，哎呦旺旺旺旺旺，我的尾巴可劲儿摇”, “created_at”: “2019-03-26 11:55:20”}{ “news_id”: 3, “title”: “实在编不出来了”, “author”: “王大猫”, “content”: “实在编不出来了，随便写点数据做测试吧，旺旺旺”, “created_at”: “2019-03-26 11:55:20”}检索数据match_all即无检索条件获取全部数据#无条件分页检索以 news_id 排序GET /news/_doc/_search{ “query”: { “match_all”: {} }, “from”: 0, “size”: 2, “sort”: { “news_id”: “desc” }}因为我们关掉了_source字段，即 ES 只会对数据建立倒排索引，不会存储其原数据，所以结果里没有相关文档原数据内容。关掉的原因主要是想演示highlight机制。match普通检索，很多文章都说match查询会对查询内容进行分词，其实并不完全正确，match查询也要看检索的字段type类型，如果字段类型本身就是不分词的keyword(not_analyzed)，那match就等同于term查询了。我们可以通过分词器explain一下字段会被如何处理:GET /news/_analyze{ “filed”: “title”, “text”: “我会被如何处理呢？分词？不分词？"}查询GET /news/_doc/_search{ “query”: { “match”: { “title”: “我们会被分词” } }, “highlight”: { “fields”: { “title”: {} } }}通过highlight我们可以将检索到的关键词以高亮的方式返回上下文内容，如果关闭了_source就得开启字段的store属性存储字段的原数据，这样才能做高亮处理，不然没有原内容了，也就没办法高亮关键词了multi_match对多个字段进行检索，比如我想查询title或content中有我们关键词的文档，如下即可：GET /news/_doc/_search{ “query”: { “multi_match”: { “query”: “我们是好人”, “fields”: [“title”, “content”] } }, “highlight”: { “fields”: { “title”: {}, “content”: {} } }}match_phrase这个需要认证理解一下，match_phrase，短语查询，何为短语查询呢？简单来说即被查询的文档字段中要包含查询内容被分词解析后的所有关键词，且关键词在文档中的分布距离差offset要满足slop设定的阈值。slop表征可以将关键词平移几次来满足在文档中的分布，如果slop足够的大，那么即便所有关键词在文档中分布的很离散，也是可以通过平移满足的。content: i love chinamatch_phrase: i chinaslop: 0//查不到需要将 i china 的 china 关键词 slop 1 后变为 i - china 才能满足slop: 1//查得到测试实例# 先看下查询会被如何解析分词GET /news/_analyze{ “field”: “title”, “text”: “我们学”}# reponse{ “tokens”: [ { “token”: “我们”, “start_offset”: 0, “end_offset”: 2, “type”: “CN_WORD”, “position”: 0 }, { “token”: “学”, “start_offset”: 2, “end_offset”: 3, “type”: “CN_CHAR”, “position”: 1 } ]}# 再看下某文档的title是被怎样建立倒排索引的GET /news/_analyze{ “field”: “title”, “text”: “我们一起学旺叫”}# reponse{ “tokens”: [ { “token”: “我们”, “start_offset”: 0, “end_offset”: 2, “type”: “CN_WORD”, “position”: 0 }, { “token”: “一起”, “start_offset”: 2, “end_offset”: 4, “type”: “CN_WORD”, “position”: 1 }, { “token”: “学”, “start_offset”: 4, “end_offset”: 5, “type”: “CN_CHAR”, “position”: 2 }, … ]}注意position字段，只有slop的阈值大于两个不相邻的关键词的position差时，才能满足平移关键词至查询内容短语分布的位置条件。查询内容被分词为：[“我们”, “学”]，而文档中[“我们”, “学”]两个关键字的距离为 1，所以，slop必须大于等于1，此文档才能被查询到。使用查询短语模式：GET /news/_doc/_search{ “query”: { “match_phrase”: { “title”: { “query”: “我们学”, “slop”: 1 } } }, “highlight”: { “fields”: { “title”: {} } }}查询结果：{ … { “_index”: “news”, “_type”: “_doc”, “_id”: “if-CuGkBddO9SrfVBoil”, “_score”: 0.37229446, “highlight”: { “title”: [ “我们一起学猫叫” ] } }, { “_index”: “news”, “_type”: “_doc”, “_id”: “iP-AuGkBddO9SrfVOIg3”, “_score”: 0.37229446, “highlight”: { “title”: [ “我们一起学旺叫” ] } } …}termterm要理解只是不对查询条件分词，作为一个关键词去检索索引。但文档存储时字段是否被分词建立索引由_mappings时设定了。可能有[“我们”, “一起”]两个索引，但并没有[“我们一起”]这个索引，查询不到。keyword类型的字段则存储时不分词，建立完整索引，查询时也不会对查询条件分词，是强一致性的。GET /news/_doc/_search{ “query”: { “term”: { “title”: “我们一起” } }, “highlight”: { “fields”: { “title”: {} } }}termsterms则是给定多个关键词，就好比人工分词{ “query”: { “terms”: { “title”: [“我们”, “一起”] } }, “highlight”: { “fields”: { “title”: {} } }}满足[“我们”, “一起”]任意关键字的文档都能被检索到。wildcardshell通配符查询: ? 一个字符 * 多个字符，查询倒排索引中符合pattern的关键词。查询有两个字符的关键词的文档{ “query”: { “wildcard”: { “title”: “??” } }, “highlight”: { “fields”: { “title”: {}, “content”: {} } }}prefix前缀查询，查询倒排索引中符合pattern的关键词。{ “query”: { “prefix”: { “title”: “我” } }, “highlight”: { “fields”: { “title”: {}, “content”: {} } }}regexp正则表达式查询，查询倒排索引中符合pattern的关键词。查询含有2 ~ 3 个字符的关键词的文档{ “query”: { “regexp”: { “title”: “.{2,3}” } }, “highlight”: { “fields”: { “title”: {}, “content”: {} } }}bool布尔查询通过 bool链接多个查询组合：must：必须全满足must_not：必须全不满足should：满足一个即可{ “query”: { “bool”: { “must”: { “match”: { “title”: “绝对要有我们” } }, “must_not”: { “term”: { “title”: “绝对不能有我” } }, “should”: [ { “match”: { “content”: “我们” } }, { “multi_match”: { “query”: “满足”, “fields”: [“title”, “content”] } }, { “match_phrase”: { “title”: “一个即可” } } ], “filter”: { “range”: { “created_at”: { “lt”: “2020-12-05 12:00:00”, “gt”: “2019-01-05 12:00:00” } } } } }, “highlight”: { “fields”: { “title”: {}, “content”: {} } }}filterfilter 通常情况下会配合match之类的使用，对符合查询条件的数据进行过滤。{ “query”: { “bool”: { “must”: { “match_all”: {} }, “filter”: { “range”: { “created_at”: { “lt”: “2020-12-05 12:00:00”, “gt”: “2017-12-05 12:00:00” } } } } }}或者单独使用{ “query”: { “constant_score” : { “filter”: { “range”: { “created_at”: { “lt”: “2020-12-05 12:00:00”, “gt”: “2017-12-05 12:00:00” } } } } }}多个过滤条件：2017-12-05 12:00:00 <= created_at < 2020-12-05 12:00:00 and news_id >= 2{ “query”: { “constant_score” : { “filter”: { “bool”: { “must”: [ { “range”: { “created_at”: { “lt”: “2020-12-05 12:00:00”, “gt”: “2017-12-05 12:00:00” } } }, { “range”: { “news_id”: { “gte”: 2 } } } ] } } } }} ...

Elastic Search 学习笔记

Reference6.4最新版英文：https://www.elastic.co/guide/…中文：https://www.elastic.co/guide/…5.4中文：http://cwiki.apachecn.org/pag…DefinationDSL（Domain Specific Language）:Elasticsearch 定义的查询语言ES字段类型：https://blog.csdn.net/chengyu…APIStats API：获取索引统计信息（http://cwiki.apachecn.org/pag…）GET es-index_/_stats{ “_shards”: { “total”: 622, “successful”: 622, “failed”: 0 }, //返回的统计信息是索引级的聚合结果，具有primaries和total的聚合结果。其中primaries只是主分片的值，total是主分片和副本分片的累积值。 “all”: { “primaries”: { “docs”: { //文档和已删除文档（尚未合并的文档）的数量。注意，此值受刷新索引的影响。 “count”: 2932357017, “deleted”: 86610 }, “store”: { //索引的大小。 “size_in_bytes”: 2573317479532, }, “indexing”: {}, //索引统计信息，可以用逗号分隔的type列表组合，以提供文档级统计信息。 “get”: {}, // get api调用统计 “search”: {}, // search api 调用统计 }, “total”: { } }}Search API（两种形式）using a simple query string as a parameterGET es-index/search?q=eventid:OMGH5PageViewusing a request bodyGET es-index/_search{ “query”: { “term”: { “eventid”: { “value”: “OMGH5PageView” } } }}Query DSLLeaf Query Clause: 叶查询子句Compound Query Clause: 复合查询子句DSL查询上下文query context在查询上下文中，回答的问题是：How well does this document match this query clause?除了判断一条数据记录(document)是否匹配查询条件以外，还要计算其相对于其他记录的匹配程度，通过_score进行记录。filter context在查询上下文中，回答的问题是：Does this document match this query clause?仅判断document是否匹配，不计算_score一般用来过滤结构化数据, e.g. timestamp是否在2017-2018范围内，status是否是published频繁使用的过滤器会被Elasticsearch自动缓存，可提高性能查询时，可先使用filter过滤操作过滤数据，然后使用query查询匹配数据查询结果字段过滤fields：字段过滤 script_fields：可对原始数据进行计算"fields": [“eh”], //仅返回eh字段"script_fields": { “test”: { “script”: “doc[’eh’].value2” }} // 返回eh字段值2的数据并命名为test字段查询过滤：querybool 组合过滤器{ “bool” : { “must” : {}, // 所有的语句都必须匹配，相当于SQL中的and “must_not” : {}, // 所有的语句都不能匹配，相当于SQL中的not “should” : {}, // 至少有一个语句要匹配，相当于SQL中的OR “filter” : {}, // }}filtered过滤器{ “filtered”: { “query”: {}, “filter”: {} // 在filter中进行数据过滤，然后再去query中进行匹配 }}match和termmatch（模糊匹配）：先检查字段类型是否是analyzed，如果是，则先分词，再去去匹配token；如果不是，则直接去匹配token。term（精确匹配）：直接去匹配token。terms: 多项查询{ terms : { user: [’tony’, ‘kitty’ ] } }range范围过滤对于date类型字段的范围选择可以使用 Date Math{ “range” : { “born” : { “gte”: “01/01/2012”, “lte”: “2013”, “format”: “dd/MM/yyyy||yyyy” } } }{ “range” : { “timestamp” : { “gte”: “now-6d/d”, // Date Math “lte”: “now/d”, // Date Math “time_zone”: “+08:00” // 时区 } } }exists 该条记录是否存在某个字段{ “exists” : { “field” : “user” }}wildcard: 通配符查询（对分词进行匹配查询）Note that this query can be slow, as it needs to iterate over many terms. In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ?wildcard查询性能较差，尽量避免使用或？开头来进行wildcard匹配prefix: 前缀查询regexp：正则表达式查询Tipsvalue带-的特殊处理value带了-，则默认会被切词，导致搜索结果不准确。解决办法之一就是在字段那里加个.rawterm: {status:‘pre-active’} => term: {status.raw: ‘pre-active’}sortGET es-index_*/_search{ “fields” : [“eventid”, “logtime”], “query”: { “term”: { “eventid”: { “value”: “OMGH5PageView” } } }, “sort”: [ { “logtime”: { “order”: “asc” } } ]}聚合aggregationdate_histogram（和 histogram 一样）默认只会返回文档数目非零的 buckets。即使 buckets中没有文档我们也想返回。可以通过设置两个额外参数来实现这种效果：“min_doc_count” : 0, // 这个参数强制返回空 buckets。“extended_bounds” : { // 强制返回整年 “min” : “2014-01-01”, “max” : “2014-12-31”}查询返回结果参数took: 查询返回的时间（单位：毫秒）time_out: 查询是否超时_shards: 描述查询分片的信息，包括：查询了多少分片，成功的分片数量，失败的分片数量等hits：搜索的结果total: 满足查询条件的文档数max_score: hits: 满足条件的文档_score: 文档的匹配程度 ...