关于elasticsearch:ElasticSearch聚合应该怎么学

大家好，我是咔咔不期速成，日拱一卒

ElasticSearch致力于搜寻的同时，也提供了聚合实时剖析数据的性能，聚合能够实现把简单的数据进行一系列计算后得出咱们想要的数据。

尽管聚合的性能与搜寻齐全不同，但应用的数据结构是完全相同的，因而聚合的执行速度很快，也就是说在一次申请中对雷同数据能够同时进行搜寻+过滤、剖析。

在ElasticSearch中聚合共分为四大类：

Bucket Aggregation：分桶类型，一些列满足特定条件的文档汇合
Metric Aggregation：指标剖析类型，对数据进行数学运算，例如求最大、小值
Pipeline Aggregation：管道剖析类型，曾经聚合的后果进行二次聚合
Matix Aggregation：矩阵剖析类型，反对对多个字段操作并提供一个后果矩阵
先从简开始，看一下Bucket、Metric这两种类型，Bucket实现的后果就是MySQL中group关键字的应用，Metric则是MySQL中max、min函数的应用。

一、Buckert Aggregation
介绍

通过上图可得悉将数据分为了三个桶，第一个桶统计的是身高小于300，第二个桶统计的是身高大于600，第三个桶统计的是身高在300到600之间的，在这个案例中就是依据不同的身高分到不同的桶中。

应用聚合剖析机制还能够依照年龄、地理位置、性别、薪资范畴、订单增长状况、工作岗位散布等。只有有肯定共同点的数据都可应用聚合进行归档解决。

常见的Bucket分桶策略

terms：依照term来分桶，如果是text类型则会依照分词后的后果进行分桶
range：指定数值的范畴来设定分桶规定
data range：指定日期的范畴来设定分桶规定
histogram：固定的距离来来设定分桶规定
data histogram：针对日期的直方图或柱状图
Terms
依据目的地进行分桶

post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{

"destcountry_term":{  "terms": {    "field": "DestCountry"  }}

},
"profile":"true"
}
从返回后果中看到依据目的地将航班信息进行了归类解决，同时也会发现在ElasticSearch中如果不手动定义size值都会默认只返回10条后果

"aggregations" : {

"destcountry_term" : {  "doc_count_error_upper_bound" : 0,  "sum_other_doc_count" : 3187,  "buckets" : [    {      "key" : "IT",      "doc_count" : 2371    },    {      "key" : "US",      "doc_count" : 1987    },    {      "key" : "CN",      "doc_count" : 1096    },    {      "key" : "CA",      "doc_count" : 944    },    {      "key" : "JP",      "doc_count" : 774    },    {      "key" : "RU",      "doc_count" : 739    },    {      "key" : "CH",      "doc_count" : 691    },    {      "key" : "GB",      "doc_count" : 449    },    {      "key" : "AU",      "doc_count" : 416    },    {      "key" : "PL",      "doc_count" : 405    }  ]}

}
Range
想要查问平均价格在300以下、300~600之间、大于600的案例

post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{

"avgticketprice_range":{  "range": {    "field": "AvgTicketPrice",    "ranges": [      {"to":300},      {"from":300,"to":600},      {"from":600}    ]  }}

}
}
返回后果如下，能够三条后果都依据不同的区间设置了key值

"aggregations" : {

"avgticketprice_range" : {  "buckets" : [    {      "key" : "*-300.0",      "to" : 300.0,      "doc_count" : 1816    },    {      "key" : "300.0-600.0",      "from" : 300.0,      "to" : 600.0,      "doc_count" : 4115    },    {      "key" : "600.0-*",      "from" : 600.0,      "doc_count" : 7128    }  ]}

}
能够通过设置keyed:true，使每个区间都返回一个特定的名字

post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{

"avgticketprice_range":{  "range": {    "field": "AvgTicketPrice",    "keyed":"true",    "ranges": [      {"to":300},      {"from":300,"to":600},      {"from":600}    ]  }}

}
}
能够好好的跟上一个案例比照一下区别

"aggregations" : {

"avgticketprice_range" : {  "buckets" : {    "*-300.0" : {      "to" : 300.0,      "doc_count" : 1816    },    "300.0-600.0" : {      "from" : 300.0,      "to" : 600.0,      "doc_count" : 4115    },    "600.0-*" : {      "from" : 600.0,      "doc_count" : 7128    }  }}

}
当然也能够指定区间的名字

post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{

"avgticketprice_range":{  "range": {    "field": "AvgTicketPrice",    "keyed":"true",    "ranges": [      {"key":"小于300","to":300},      {"key":"300到600之间","from":300,"to":600},      {"key":"大于600","from":600}    ]  }}

}
}
返回后果

"aggregations" : {

"avgticketprice_range" : {  "buckets" : {    "小于300" : {      "to" : 300.0,      "doc_count" : 1816    },    "300到600之间" : {      "from" : 300.0,      "to" : 600.0,      "doc_count" : 4115    },    "大于600" : {      "from" : 600.0,      "doc_count" : 7128    }  }}

}
Data Range
通过指定日期的范畴来设定分桶规定，如对timestamp字段依照设定的时间段来分桶。

post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{

"data_range_timestamp":{  "date_range":{    "field":"timestamp",    "format":"yyyy-MM",    "ranges":[      {"from":"2022-01","to":"2022-02"},      {"from":"2022-02","to":"2022-03"}    ]  }}

}
}
返回后果，思考一下如果想要设置固定的key值应该怎么设置呢？还有要留神的是日期格局yyyy-MM-dd HH:mm:ss

"aggregations" : {

"data_range_timestamp" : {  "buckets" : [    {      "key" : "2022-01-2022-02",      "from" : 1.6409952E12,      "from_as_string" : "2022-01",      "to" : 1.6436736E12,      "to_as_string" : "2022-02",      "doc_count" : 9580    },    {      "key" : "2022-02-2022-03",      "from" : 1.6436736E12,      "from_as_string" : "2022-02",      "to" : 1.6460928E12,      "to_as_string" : "2022-03",      "doc_count" : 1837    }  ]}

}
Historgram
直方图，以固定距离的策略来宰割数据，如对AvgTicketPrice字段依照100的距离进行分桶

interval ：每次距离50
min_doc_count ：存在的文档数起码是0条
extended_bounds ：此值只有当min_doc_count 为0时才具备意义
在实现时你会发现extended_bounds不过滤桶。extended_bounds.min高于从文档中提取的值，那么文档依然会规定第一个存储段将是什么（对于extended_bounds.max和最初一个存储段也是如此）。为了过滤桶，您应该将直方图聚合嵌套在范畴过滤器聚合中，并应用适当的从/到设置

post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{

"price_histogram":{  "histogram": {    "field": "AvgTicketPrice",    "interval": 50,    "min_doc_count":"0",    "extended_bounds":{      "min":0,      "max":600    }  }}

}
}
返回后果

"aggregations" : {

"price_histogram" : {  "buckets" : [    {      "key" : 0.0,      "doc_count" : 0    },    {      "key" : 50.0,      "doc_count" : 0    },    {      "key" : 100.0,      "doc_count" : 380    },    {      "key" : 150.0,      "doc_count" : 369    },    {      "key" : 200.0,      "doc_count" : 398    }  ]}

}
Data histogram
针对日期的直方图或者柱状图，是时序数据分析中罕用的聚合剖析类型，如对timestamp字段依照月的距离进行分桶

post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{

"timestamp_data_histogram":{  "date_histogram": {    "field": "timestamp",    "interval": "month",    "min_doc_count": 0,    "format": "yyyy-MM-dd",    "extended_bounds": {      "min": "2021-10-10",      "max": "2022-01-19"    }  }}

}
}
返回后果

"aggregations" : {

"timestamp_data_histogram" : {  "buckets" : [    {      "key_as_string" : "2021-10-01",      "key" : 1633046400000,      "doc_count" : 0    },    {      "key_as_string" : "2021-11-01",      "key" : 1635724800000,      "doc_count" : 0    },    {      "key_as_string" : "2021-12-01",      "key" : 1638316800000,      "doc_count" : 1642    },    {      "key_as_string" : "2022-01-01",      "key" : 1640995200000,      "doc_count" : 9580    },    {      "key_as_string" : "2022-02-01",      "key" : 1643673600000,      "doc_count" : 1837    }  ]}

}
二、嵌套查问
上文中列举了五种分桶的实现，在理论开发中只是繁多的进行聚合查问是非常少的，大多状况下都是会进行嵌套操作。

先依据机票进行分桶后，再对分桶后的数据取总数、最小值、最大值、平均值、总和

post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{

"price_range":{  "range": {    "field": "AvgTicketPrice",    "ranges": [      {"to":300},      {"from":300,"to":600},      {"from":600}    ]  },  "aggs":{    "price_status":{      "stats": {        "field": "AvgTicketPrice"      }    }  }}

}
}
返回后果（返回后果截取显示了）

"aggregations" : {

"price_range" : {  "buckets" : [    {      "key" : "*-300.0",      "to" : 300.0,      "doc_count" : 1816,      "price_status" : {        "count" : 1816,        "min" : 100.0205307006836,        "max" : 299.9529113769531,        "avg" : 212.5348257619379,        "sum" : 385963.2435836792      }    }  ]}

}
还有更多的操作期待咱们去开掘，先把根底的搞定，不期速成，日拱一卒

保持学习、保持写作、保持分享是咔咔从业以来所秉持的信念。愿文章在偌大的互联网上能给你带来一点帮忙，我是咔咔，下期见。