本文代码基于prometheus 2.19.2剖析
基本概念
什么是tsdb
Time Series DBMS are designed to efficiently collect, store and query various time series with high transaction volumes
prometheus 基本概念
sample 数据点
type sample struct { t int64 v float64}
- sample代表一个数据点
- size:16byte: 蕴含 1个8byte int64工夫戳和1个8byte float64 value
Label 标签
type Label struct { Name, Value string}
- 一对label 比方 job="ec2"
Labels 标签组
type Labels []Label
- 就是metric 一个指标的所有tag values
vector 向量
type Vector []Sample
- vector 向量,是samples的别名,然而所有sample具备雷同timestamp ,罕用作instance_query的后果
Scalar 标量
type Scalar struct { T int64 V float64}
- Scalar 标量 间接查个数字
Series 数据流
type Series struct { Metric labels.Labels `json:"metric"` Points []Point `json:"values"`}
- Series 是一个metric的数据流
Matrix 矩阵
type Matrix []Series
- Matrix是series的切片,个别的range_query返回的后果
数据点压缩
- prometheus 参考facebook-gorilla dod压缩timestamp ,xor压缩value,在两个小时的block里从16byte压缩到1.37byte,压缩比例高达11.6
dod 压缩timestamp
Delta1 Delta2 Delta3T1 --------> T2 --------> T3 --------> T4
- 原理:依据时序数据库特点,采集距离根本稳固,所以 ts的delta基本相同,采纳不等长距离压缩,应用较少bit存储代替原来的16byte 64bite存储达到压缩的目标
- 如上图所示: March 24,2015 02:01:02 02:02:02 02:03:02 三个工夫戳
March 24,2015 02:00:00 | 62 | '10':-2 | '0' |bit length 64 14 9 1 压缩率: (64+14+9+1)/(64*4)=0.34375
- 假如现实状况采集稳固两头没有丢点,前面timestamp都能够用'0'来填补,压缩率会大大晋升
- 应用dod而不间接应用detal我认为还是再次压缩
- 而且D的区间更多会落在后面几个里
xor 压缩value
- 原理:时序数据库相邻点变动不大,采纳异或压缩float64的前缀和后缀0个数
- 我认为xor压缩成果取决于series曲线稳定状况,越激烈压缩成果越差,越平滑压缩成果越好
存储过程
- prome将最近的热数据存储在内存中叫做head block
series内存数据结构 memSeries
// memSeries is the in-memory representation of a series. None of its methods// are goroutine safe and it is the caller's responsibility to lock it.type memSeries struct { sync.RWMutex ref uint64 lset labels.Labels mmappedChunks []*mmappedChunk headChunk *memChunk chunkRange int64 firstChunkID int nextAt int64 // Timestamp at which to cut the next chunk. sampleBuf [4]sample pendingCommit bool // Whether there are samples waiting to be committed to this series. app chunkenc.Appender // Current appender for the chunk. memChunkPool *sync.Pool txs *txRing}
其中最重要的三个字段
- ref 相当于这个series的uid :在getOrCreate函数中给出了明确的解释,应用递增的正整数作为ref,而没有应用hash因为hash random且不利于索引
func (h *Head) getOrCreate(hash uint64, lset labels.Labels) (*memSeries, bool, error) { // Just using `getOrSet` below would be semantically sufficient, but we'd create // a new series on every sample inserted via Add(), which causes allocations // and makes our series IDs rather random and harder to compress in postings. s := h.series.getByHash(hash, lset) if s != nil { return s, false, nil } // Optimistically assume that we are the first one to create the series. id := atomic.AddUint64(&h.lastSeriesID, 1) return h.getOrCreateWithID(id, hash, lset)}
- lset 这个series 的label map
- headChunk 即memChunk 在这个series在这段时间内的数据汇合
type memChunk struct { chunk chunkenc.Chunk minTime, maxTime int64}
哈希表
- 既然memSeries代表series ,那么如何通过一堆label最快地找到对应的
series
呢?哈希表显然是最佳的计划。基于label计算一个哈希值,保护一张哈希值与memSeries
的映射表,如果产生哈希碰撞的话,则间接用label进行匹配。因而,Prometheus有必要在内存中保护如下所示的两张哈希表,从而无论利用ref
还是label
都能很快找到对应的memSeries
{ series map[uint64]*memSeries // ref到memSeries的映射 hashes map[uint64][]*memSeries // labels的哈希值到memSeries的映射}
锁的粒度
- golang中的map不是并发平安的,而Prometheus中又有大量对于
memSeries
的增删操作,如果在读写上述构造时简略地用一把大锁锁住,显然无奈满足性能要求 - prometheus的解决办法就是拆分锁
const ( // DefaultStripeSize is the default number of entries to allocate in the stripeSeries hash map. DefaultStripeSize = 1 << 14)// stripeSeries locks modulo ranges of IDs and hashes to reduce lock contention.// The locks are padded to not be on the same cache line. Filling the padded space// with the maps was profiled to be slower – likely due to the additional pointer// dereferences.type stripeSeries struct { size int series []map[uint64]*memSeries hashes []seriesHashmap locks []stripeLock seriesLifecycleCallback SeriesLifecycleCallback}type stripeLock struct { sync.RWMutex // Padding to avoid multiple locks being on the same cache line. _ [40]byte}
初始化head的时候 生成16384个小哈希表,如果想依据ref找到memSeries
只须要把ref
对16384取模找到对应的series[x],只须要lock[x],从而大大降低了读写memSeries
时对锁的抢占造成的耗费,晋升读写吞吐量
func (s *stripeSeries) getByHash(hash uint64, lset labels.Labels) *memSeries { i := hash & uint64(s.size-1) s.locks[i].RLock() series := s.hashes[i].get(hash, lset) s.locks[i].RUnlock() return series}
留神看这里 取模的操作应用的是&而不是% 这是因为位运算(&)效率要比取模运算(%)高很多,次要起因是位运算间接对内存数据进行操作,不须要转成十进制,因而处理速度十分快
a % b == a & (b - 1) 前提:b 为 2^n
倒排索引
下面咱们看到怎么依据一个metric的全副label 找到对应的memSeries ,然而更通用的是label含糊匹配查问
含糊匹配
比方 node_exportor中的cpu指标 node_cpu_seconds_total 指定cpu="1"代表第二个外围,同时还有7种mode,甚至 Prometheus反对在指定label时应用正则表达式:node_cpu_seconds_total{mode=~"user|system"}
面对如此简单的查问,必须要要引入倒排索引来解决这个问题
数据结构
// MemPostings holds postings list for series ID per label pair. They may be written// to out of order.// ensureOrder() must be called once before any reads are done. This allows for quick// unordered batch fills on startup.type MemPostings struct { mtx sync.RWMutex m map[string]map[string][]uint64 ordered bool}
索引过程
假如prometheus抓取到一个新的series node_cpu_seconds_total{mode="user",cpu="0",instance="1.1.1.1:9100"} 他的ref为x
在初始化memSeries后,更新哈希表后,还须要对倒排索引进行更新
func (h *Head) getOrCreateWithID(id, hash uint64, lset labels.Labels) (*memSeries, bool, error) { s := newMemSeries(lset, id, h.chunkRange, &h.memChunkPool) s, created, err := h.series.getOrSet(hash, s) if err != nil { return nil, false, err } if !created { return s, false, nil } h.metrics.seriesCreated.Inc() atomic.AddUint64(&h.numSeries, 1) h.postings.Add(id, lset) ......}
- 次要函数addFor
MemPostings.m["__name__"]["node_cpu_seconds_total"]={..,x,..}MemPostings.m["mode"]["user"]={..,x,..}MemPostings.m["cpu"]["0"]={..,x,..}MemPostings.m["instance"]["1.1.1.1:9100"]={..,x,..}
func (p *MemPostings) addFor(id uint64, l labels.Label) { nm, ok := p.m[l.Name] if !ok { nm = map[string][]uint64{} p.m[l.Name] = nm } list := append(nm[l.Value], id) nm[l.Value] = list if !p.ordered { return } // There is no guarantee that no higher ID was inserted before as they may // be generated independently before adding them to postings. // We repair order violations on insert. The invariant is that the first n-1 // items in the list are already sorted. for i := len(list) - 1; i >= 1; i-- { if list[i] >= list[i-1] { break } list[i], list[i-1] = list[i-1], list[i] }}
- 匹配过程如下
假如MemPostings.m["__name__"]["node_cpu_seconds_total"]={1,2,3,5,7,8}MemPostings.m["mode"]["user"]={10,2,3,4,6,8}求 node_cpu_seconds_total{mode="user"}即是求交加--> {1,2,3,5,7,8} & {10,2,3,4,6,8} = {2,3,8}
然而如果每个label pair蕴含的series
足够多,那么对多个label pair的series
做交加也将是十分耗时的操作。那么能不能进一步优化呢?事实上,只有放弃每个label pair里蕴含的series有序就能够了,这样就能将复杂度从指数级霎时降落到线性级。
长久化存储
上面是prometheus data目录的状况
drwxr-xr-x 3 root root 4096 Jul 14 23:01 01ED6XF97X5B5RKJMB6E6CRG85drwxr-xr-x 3 root root 4096 Jul 15 05:00 01ED7J1R0E4BZ398TE48CC0WJQdrwxr-xr-x 3 root root 4096 Jul 15 11:00 01ED86M8WZ4J66F4M842WDWC69drwxr-xr-x 3 root root 4096 Jul 15 11:00 01ED86N02VPNDEPS694CEZY2DPdrwxr-xr-x 2 root root 4096 Jul 15 11:00 chunks_head-rw-r--r-- 1 root root 0 May 20 18:29 lock-rw-r--r-- 1 root root 100001 Jul 15 11:30 queries.activedrwxr-xr-x 3 root root 4096 Jul 15 11:30 wal
wal
Prometheus在将采集到的数据真正写入内存之前,会首先存入WAL(Write Ahead Log)
中。因为WAL
是寄存在磁盘中的,相当于对内存中的监控数据做了一个齐全的备份,即便Prometheus解体这部分的数据也不至于失落。当Prometheus重启之后,它首先会将WAL
的内容加载到内存中,从而完满复原到解体之前的状态,接着再开始新数据的抓取
# wal目录-rw-r--r-- 1 root root 128M Jul 15 13:59 00005153-rw-r--r-- 1 root root 128M Jul 15 14:10 00005154-rw-r--r-- 1 root root 128M Jul 15 14:20 00005155-rw-r--r-- 1 root root 128M Jul 15 14:31 00005156-rw-r--r-- 1 root root 128M Jul 15 14:41 00005157-rw-r--r-- 1 root root 128M Jul 15 14:52 00005158-rw-r--r-- 1 root root 96M Jul 15 15:00 00005159-rw-r--r-- 1 root root 128M Jul 15 15:10 00005160-rw-r--r-- 1 root root 128M Jul 15 15:21 00005161-rw-r--r-- 1 root root 128M Jul 15 15:31 00005162-rw-r--r-- 1 root root 128M Jul 15 15:42 00005163-rw-r--r-- 1 root root 128M Jul 15 15:52 00005164-rw-r--r-- 1 root root 128M Jul 15 16:03 00005165-rw-r--r-- 1 root root 41M Jul 15 16:06 00005166drwxr-xr-x 2 root root 4.0K Jul 15 15:00 checkpoint.00005152
- headAppender外面mock了事务的commit办法,在这里能够看到首先执行
a.log()
如果呈现谬误则先回滚不会写入内存,那么这个a.log()
就是向wal中写入
func (a *headAppender) Commit() error { if err := a.log(); err != nil { //nolint: errcheck a.Rollback() // Most likely the same error will happen again. return errors.Wrap(err, "write to WAL") } defer a.head.metrics.activeAppenders.Dec() defer a.head.putAppendBuffer(a.samples) defer a.head.putSeriesBuffer(a.sampleSeries) defer a.head.iso.closeAppend(a.appendID) total := len(a.samples) var series *memSeries for i, s := range a.samples { series = a.sampleSeries[i] series.Lock() ok, chunkCreated := series.append(s.T, s.V, a.appendID, a.head.chunkDiskMapper) series.cleanupAppendIDsBelow(a.cleanupAppendIDsBelow) series.pendingCommit = false series.Unlock() if !ok { total-- a.head.metrics.outOfOrderSamples.Inc() } if chunkCreated { a.head.metrics.chunks.Inc() a.head.metrics.chunksCreated.Inc() } } a.head.metrics.samplesAppended.Add(float64(total)) a.head.updateMinMaxTime(a.mint, a.maxt) return nil}
block
期将内存中的数据长久化到磁盘是正当的。每一个Block存储了对应工夫窗口内的所有数据,包含所有的series
,samples
以及相干的索引构造。Block目录的具体内容如下:
drwxr-xr-x 2 root root 4.0K Jul 9 17:00 chunks-rw-r--r-- 1 root root 92M Jul 9 17:01 index-rw-r--r-- 1 root root 906 Jul 9 17:01 meta.json-rw-r--r-- 1 root root 9 Jul 9 17:01 tombstones
其中meta.json 蕴含了以后Block的元数据信息,其内容如下:
{ "ulid": "01ECSCWB8KEP5BVX7R4NAVRV36", "minTime": 1594209600000, "maxTime": 1594274400000, "stats": { "numSamples": 1232542974, "numSeries": 287055, "numChunks": 10363468 }, "compaction": { "level": 3, "sources": [ "01ECQF1KCYM6F6CH40XSPH24GY", "01ECQNXAMYR96DXBNSNTX0CMV2", "01ECQWS1WYZ3CNTMMHY92X65DZ", "01ECR3MS4ZAHZT7ZC63CST0NKA", "01ECRAGGCYXZWBMW1BKQB47MBK", "01ECRHC7MYVKWTBCB1Y9QMDH86", "01ECRR7YWZ0MTVYBGRYHE9N8ZQ", "01ECRZ3P64VF8BF4W4WCSNE05S", "01ECS5ZDCY3C05AMWEJE8N4A6J" ], "parents": [ { "ulid": "01ECR3NCHN36VP7GYF4KSAVZPW", "minTime": 1594209600000, "maxTime": 1594231200000 }, { "ulid": "01ECRR8HW7BCVXWFDDNJN9PCBB", "minTime": 1594231200000, "maxTime": 1594252800000 }, { "ulid": "01ECSCVRGEAPH4VK58HC41E0WJ", "minTime": 1594252800000, "maxTime": 1594274400000 } ] }, "version": 1}
ulid
:用于辨认这个Block的编号,它与Block的目录名统一
minTime
和maxTime
:示意这个Block存储的数据的工夫窗口
stats
:示意这个Block蕴含的sample
, series
以及chunks
数目
compaction
:这个Block的压缩信息,因为随着工夫的流逝,多个Block也会压缩合并造成更大的Block。level
字段示意了压缩的次数,刚从内存长久化的Block的level
为1,每被联结压缩一次,子Block的level
就会在父Block的根底上加一,而sources
字段则蕴含了形成以后这个Block的所有先人Block的ulid
。事实上,对于level >= 2
的Block,还会有一个parent
字段,蕴含了它的父Block的ulid
以及工夫窗口信息。
chunks
是一个子目录,蕴含了若干个从000001
开始编号的文件,个别每个文件大小的下限为512M。文件中存储的就是在工夫窗口[minTime,maxTime]以内的所有samples
,实质上就是对于内存中符合要求的memChunk
的长久化。index
文件存储了索引相干的内容,尽管长久化后的数据被读取的概率是比拟低的,然而仍然存在被读取的可能。这样一来,如何尽快地从Block中读取时序数据就显得尤为重要了,而疾速读取索引并且基于索引查找时序数据则是放慢整体查问效率的要害。为了达到这一指标,存储索引信息的index
文件在设计上就显得比较复杂了。
┌────────────────────────────┬─────────────────────┐│ magic(0xBAAAD700) <4b> │ version(1) <1 byte> │├────────────────────────────┴─────────────────────┤│ ┌──────────────────────────────────────────────┐ ││ │ Symbol Table │ ││ ├──────────────────────────────────────────────┤ ││ │ Series │ ││ ├──────────────────────────────────────────────┤ ││ │ Label Index 1 │ ││ ├──────────────────────────────────────────────┤ ││ │ ... │ ││ ├──────────────────────────────────────────────┤ ││ │ Label Index N │ ││ ├──────────────────────────────────────────────┤ ││ │ Postings 1 │ ││ ├──────────────────────────────────────────────┤ ││ │ ... │ ││ ├──────────────────────────────────────────────┤ ││ │ Postings N │ ││ ├──────────────────────────────────────────────┤ ││ │ Label Index Table │ ││ ├──────────────────────────────────────────────┤ ││ │ Postings Table │ ││ ├──────────────────────────────────────────────┤ ││ │ TOC │ ││ └──────────────────────────────────────────────┘ │└──────────────────────────────────────────────────┘
具体请看 prometheus index 文件