关于prometheus:prometheus-本地存储解析及其使用的那些黑科技

Time Series DBMS are designed to efficiently collect, store and query various time series with high transaction volumes

type sample struct {
    t int64
    v float64
}

sample 代表一个数据点
size:16byte: 蕴含 1 个 8byte int64 工夫戳和 1 个 8byte float64 value

type Label struct {Name, Value string}

一对 label 比方 job=”ec2″

type Labels []Label

就是 metric 一个指标的所有 tag values

type Vector []Sample

vector 向量, 是 samples 的别名, 然而所有 sample 具备雷同 timestamp , 罕用作 instance_query 的后果

type Scalar struct {
    T int64
    V float64
}

Scalar 标量间接查个数字

type Series struct {
    Metric labels.Labels `json:"metric"`
    Points []Point       `json:"values"`}

Series 是一个 metric 的数据流

type Matrix []Series

Matrix 是 series 的切片，个别的 range_query 返回的后果

prometheus 参考 facebook-gorilla dod 压缩 timestamp ,xor 压缩 value, 在两个小时的 block 里从 16byte 压缩到 1.37byte，压缩比例高达 11.6

    Delta1       Delta2        Delta3
T1 --------> T2 --------> T3 --------> T4

原理：依据时序数据库特点，采集距离根本稳固，所以 ts 的 delta 基本相同，采纳不等长距离压缩，应用较少 bit 存储代替原来的 16byte 64bite 存储达到压缩的目标
如上图所示: March 24,2015 02:01:02 02:02:02 02:03:02 三个工夫戳

March 24,2015 02:00:00 | 62 | '10':-2 | '0' |
bit length  64          14       9       1 
      压缩率: (64+14+9+1)/(64*4)=0.34375

假如现实状况采集稳固两头没有丢点，前面 timestamp 都能够用 ’0’ 来填补，压缩率会大大晋升
应用 dod 而不间接应用 detal 我认为还是再次压缩
而且 D 的区间更多会落在后面几个里

原理: 时序数据库相邻点变动不大，采纳异或压缩 float64 的前缀和后缀 0 个数

我认为 xor 压缩成果取决于 series 曲线稳定状况，越激烈压缩成果越差，越平滑压缩成果越好

prome 将最近的热数据存储在内存中叫做 head block

// memSeries is the in-memory representation of a series. None of its methods
// are goroutine safe and it is the caller's responsibility to lock it.
type memSeries struct {
    sync.RWMutex

    ref           uint64
    lset          labels.Labels
    mmappedChunks []*mmappedChunk
    headChunk     *memChunk
    chunkRange    int64
    firstChunkID  int

    nextAt        int64 // Timestamp at which to cut the next chunk.
    sampleBuf     [4]sample
    pendingCommit bool // Whether there are samples waiting to be committed to this series.

    app chunkenc.Appender // Current appender for the chunk.

    memChunkPool *sync.Pool

    txs *txRing
}

其中最重要的三个字段

ref 相当于这个 series 的 uid : 在 getOrCreate 函数中给出了明确的解释，应用递增的正整数作为 ref，而没有应用 hash 因为 hash random 且不利于索引

func (h *Head) getOrCreate(hash uint64, lset labels.Labels) (*memSeries, bool, error) {
    // Just using `getOrSet` below would be semantically sufficient, but we'd create
    // a new series on every sample inserted via Add(), which causes allocations
    // and makes our series IDs rather random and harder to compress in postings.
    s := h.series.getByHash(hash, lset)
    if s != nil {return s, false, nil}

    // Optimistically assume that we are the first one to create the series.
    id := atomic.AddUint64(&h.lastSeriesID, 1)

    return h.getOrCreateWithID(id, hash, lset)
}

lset 这个 series 的 label map
headChunk 即 memChunk 在这个 series 在这段时间内的数据汇合

type memChunk struct {
    chunk            chunkenc.Chunk
    minTime, maxTime int64
}

既然 memSeries 代表 series，那么如何通过一堆 label 最快地找到对应的 series 呢？哈希表显然是最佳的计划。基于 label 计算一个哈希值，保护一张哈希值与 memSeries 的映射表，如果产生哈希碰撞的话，则间接用 label 进行匹配。因而，Prometheus 有必要在内存中保护如下所示的两张哈希表，从而无论利用 ref 还是 label 都能很快找到对应的memSeries

{series map[uint64]*memSeries // ref 到 memSeries 的映射
    hashes map[uint64][]*memSeries // labels 的哈希值到 memSeries 的映射}

golang 中的 map 不是并发平安的，而 Prometheus 中又有大量对于 memSeries 的增删操作，如果在读写上述构造时简略地用一把大锁锁住，显然无奈满足性能要求
prometheus 的解决办法就是拆分锁

const (
    // DefaultStripeSize is the default number of entries to allocate in the stripeSeries hash map.
    DefaultStripeSize = 1 << 14
)

// stripeSeries locks modulo ranges of IDs and hashes to reduce lock contention.
// The locks are padded to not be on the same cache line. Filling the padded space
// with the maps was profiled to be slower – likely due to the additional pointer
// dereferences.
type stripeSeries struct {
    size                    int
    series                  []map[uint64]*memSeries
    hashes                  []seriesHashmap
    locks                   []stripeLock
    seriesLifecycleCallback SeriesLifecycleCallback
}

type stripeLock struct {
    sync.RWMutex
    // Padding to avoid multiple locks being on the same cache line.
    _ [40]byte
}

初始化 head 的时候生成 16384 个小哈希表，如果想依据 ref 找到 memSeries 只须要把 ref 对 16384 取模找到对应的 series[x]，只须要 lock[x]，从而大大降低了读写 memSeries 时对锁的抢占造成的耗费，晋升读写吞吐量

func (s *stripeSeries) getByHash(hash uint64, lset labels.Labels) *memSeries {i := hash & uint64(s.size-1)

    s.locks[i].RLock()
    series := s.hashes[i].get(hash, lset)
    s.locks[i].RUnlock()

    return series
}

留神看这里取模的操作应用的是 & 而不是 % 这是因为位运算 (&) 效率要比取模运算 (%) 高很多，次要起因是位运算间接对内存数据进行操作，不须要转成十进制，因而处理速度十分快

 a % b == a & (b - 1) 前提：b 为 2^n

下面咱们看到怎么依据一个 metric 的全副 label 找到对应的 memSeries，然而更通用的是 label 含糊匹配查问

比方 node_exportor 中的 cpu 指标 node_cpu_seconds_total 指定 cpu=”1″ 代表第二个外围，同时还有 7 种 mode，甚至 Prometheus 反对在指定 label 时应用正则表达式：node_cpu_seconds_total{mode=~”user|system”}

面对如此简单的查问，必须要要引入倒排索引来解决这个问题

// MemPostings holds postings list for series ID per label pair. They may be written
// to out of order.
// ensureOrder() must be called once before any reads are done. This allows for quick
// unordered batch fills on startup.
type MemPostings struct {
    mtx     sync.RWMutex
    m       map[string]map[string][]uint64
    ordered bool
}

假如 prometheus 抓取到一个新的 series node_cpu_seconds_total{mode=”user”,cpu=”0″,instance=”1.1.1.1:9100″} 他的 ref 为 x
在初始化 memSeries 后，更新哈希表后，还须要对倒排索引进行更新

func (h *Head) getOrCreateWithID(id, hash uint64, lset labels.Labels) (*memSeries, bool, error) {s := newMemSeries(lset, id, h.chunkRange, &h.memChunkPool)

    s, created, err := h.series.getOrSet(hash, s)
    if err != nil {return nil, false, err}
    if !created {return s, false, nil}

    h.metrics.seriesCreated.Inc()
    atomic.AddUint64(&h.numSeries, 1)

    h.postings.Add(id, lset)
    ......
}

次要函数 addFor

MemPostings.m["__name__"]["node_cpu_seconds_total"]={..,x,..}
MemPostings.m["mode"]["user"]={..,x,..}
MemPostings.m["cpu"]["0"]={..,x,..}
MemPostings.m["instance"]["1.1.1.1:9100"]={..,x,..}

func (p *MemPostings) addFor(id uint64, l labels.Label) {nm, ok := p.m[l.Name]
    if !ok {nm = map[string][]uint64{}
        p.m[l.Name] = nm
    }
    list := append(nm[l.Value], id)
    nm[l.Value] = list

    if !p.ordered {return}
    // There is no guarantee that no higher ID was inserted before as they may
    // be generated independently before adding them to postings.
    // We repair order violations on insert. The invariant is that the first n-1
    // items in the list are already sorted.
    for i := len(list) - 1; i >= 1; i-- {if list[i] >= list[i-1] {break}
        list[i], list[i-1] = list[i-1], list[i]
    }
}

匹配过程如下

假如
MemPostings.m["__name__"]["node_cpu_seconds_total"]={1,2,3,5,7,8}
MemPostings.m["mode"]["user"]={10,2,3,4,6,8}
求 node_cpu_seconds_total{mode="user"}即是
求交加 -->   {1,2,3,5,7,8} & {10,2,3,4,6,8} = {2,3,8}

然而如果每个 label pair 蕴含的 series 足够多，那么对多个 label pair 的 series 做交加也将是十分耗时的操作。那么能不能进一步优化呢？事实上，只有放弃每个 label pair 里蕴含的 series 有序就能够了，这样就能将复杂度从指数级霎时降落到线性级。

上面是 prometheus data 目录的状况

drwxr-xr-x 3 root root   4096 Jul 14 23:01 01ED6XF97X5B5RKJMB6E6CRG85
drwxr-xr-x 3 root root   4096 Jul 15 05:00 01ED7J1R0E4BZ398TE48CC0WJQ
drwxr-xr-x 3 root root   4096 Jul 15 11:00 01ED86M8WZ4J66F4M842WDWC69
drwxr-xr-x 3 root root   4096 Jul 15 11:00 01ED86N02VPNDEPS694CEZY2DP
drwxr-xr-x 2 root root   4096 Jul 15 11:00 chunks_head
-rw-r--r-- 1 root root      0 May 20 18:29 lock
-rw-r--r-- 1 root root 100001 Jul 15 11:30 queries.active
drwxr-xr-x 3 root root   4096 Jul 15 11:30 wal

Prometheus 在将采集到的数据真正写入内存之前，会首先存入 WAL（Write Ahead Log） 中。因为 WAL 是寄存在磁盘中的，相当于对内存中的监控数据做了一个齐全的备份，即便 Prometheus 解体这部分的数据也不至于失落。当 Prometheus 重启之后，它首先会将 WAL 的内容加载到内存中，从而完满复原到解体之前的状态，接着再开始新数据的抓取

# wal 目录
-rw-r--r-- 1 root root 128M Jul 15 13:59 00005153
-rw-r--r-- 1 root root 128M Jul 15 14:10 00005154
-rw-r--r-- 1 root root 128M Jul 15 14:20 00005155
-rw-r--r-- 1 root root 128M Jul 15 14:31 00005156
-rw-r--r-- 1 root root 128M Jul 15 14:41 00005157
-rw-r--r-- 1 root root 128M Jul 15 14:52 00005158
-rw-r--r-- 1 root root  96M Jul 15 15:00 00005159
-rw-r--r-- 1 root root 128M Jul 15 15:10 00005160
-rw-r--r-- 1 root root 128M Jul 15 15:21 00005161
-rw-r--r-- 1 root root 128M Jul 15 15:31 00005162
-rw-r--r-- 1 root root 128M Jul 15 15:42 00005163
-rw-r--r-- 1 root root 128M Jul 15 15:52 00005164
-rw-r--r-- 1 root root 128M Jul 15 16:03 00005165
-rw-r--r-- 1 root root  41M Jul 15 16:06 00005166
drwxr-xr-x 2 root root 4.0K Jul 15 15:00 checkpoint.00005152

headAppender 外面 mock 了事务的 commit 办法，在这里能够看到首先执行 a.log() 如果呈现谬误则先回滚不会写入内存，那么这个 a.log() 就是向 wal 中写入

func (a *headAppender) Commit() error {if err := a.log(); err != nil {
        //nolint: errcheck
        a.Rollback() // Most likely the same error will happen again.
        return errors.Wrap(err, "write to WAL")
    }

    defer a.head.metrics.activeAppenders.Dec()
    defer a.head.putAppendBuffer(a.samples)
    defer a.head.putSeriesBuffer(a.sampleSeries)
    defer a.head.iso.closeAppend(a.appendID)

    total := len(a.samples)
    var series *memSeries
    for i, s := range a.samples {series = a.sampleSeries[i]
        series.Lock()
        ok, chunkCreated := series.append(s.T, s.V, a.appendID, a.head.chunkDiskMapper)
        series.cleanupAppendIDsBelow(a.cleanupAppendIDsBelow)
        series.pendingCommit = false
        series.Unlock()

        if !ok {
            total--
            a.head.metrics.outOfOrderSamples.Inc()}
        if chunkCreated {a.head.metrics.chunks.Inc()
            a.head.metrics.chunksCreated.Inc()}
    }

    a.head.metrics.samplesAppended.Add(float64(total))
    a.head.updateMinMaxTime(a.mint, a.maxt)

    return nil
}

期将内存中的数据长久化到磁盘是正当的。每一个 Block 存储了对应工夫窗口内的所有数据，包含所有的 series，samples 以及相干的索引构造。Block 目录的具体内容如下：

drwxr-xr-x 2 root root 4.0K Jul  9 17:00 chunks
-rw-r--r-- 1 root root  92M Jul  9 17:01 index
-rw-r--r-- 1 root root  906 Jul  9 17:01 meta.json
-rw-r--r-- 1 root root    9 Jul  9 17:01 tombstones

其中 meta.json 蕴含了以后 Block 的元数据信息，其内容如下：

{
    "ulid": "01ECSCWB8KEP5BVX7R4NAVRV36",
    "minTime": 1594209600000,
    "maxTime": 1594274400000,
    "stats": {
        "numSamples": 1232542974,
        "numSeries": 287055,
        "numChunks": 10363468
    },
    "compaction": {
        "level": 3,
        "sources": [
            "01ECQF1KCYM6F6CH40XSPH24GY",
            "01ECQNXAMYR96DXBNSNTX0CMV2",
            "01ECQWS1WYZ3CNTMMHY92X65DZ",
            "01ECR3MS4ZAHZT7ZC63CST0NKA",
            "01ECRAGGCYXZWBMW1BKQB47MBK",
            "01ECRHC7MYVKWTBCB1Y9QMDH86",
            "01ECRR7YWZ0MTVYBGRYHE9N8ZQ",
            "01ECRZ3P64VF8BF4W4WCSNE05S",
            "01ECS5ZDCY3C05AMWEJE8N4A6J"
        ],
        "parents": [
            {
                "ulid": "01ECR3NCHN36VP7GYF4KSAVZPW",
                "minTime": 1594209600000,
                "maxTime": 1594231200000
            },
            {
                "ulid": "01ECRR8HW7BCVXWFDDNJN9PCBB",
                "minTime": 1594231200000,
                "maxTime": 1594252800000
            },
            {
                "ulid": "01ECSCVRGEAPH4VK58HC41E0WJ",
                "minTime": 1594252800000,
                "maxTime": 1594274400000
            }
        ]
    },
    "version": 1
}

ulid：用于辨认这个 Block 的编号，它与 Block 的目录名统一

minTime和maxTime：示意这个 Block 存储的数据的工夫窗口

stats：示意这个 Block 蕴含的 sample, series 以及 chunks 数目

compaction：这个 Block 的压缩信息，因为随着工夫的流逝，多个 Block 也会压缩合并造成更大的 Block。level字段示意了压缩的次数，刚从内存长久化的 Block 的 level 为 1，每被联结压缩一次，子 Block 的 level 就会在父 Block 的根底上加一，而 sources 字段则蕴含了形成以后这个 Block 的所有先人 Block 的 ulid。事实上，对于level >= 2 的 Block，还会有一个 parent 字段，蕴含了它的父 Block 的 ulid 以及工夫窗口信息。

chunks是一个子目录，蕴含了若干个从 000001 开始编号的文件，个别每个文件大小的下限为 512M。文件中存储的就是在工夫窗口 [minTime，maxTime] 以内的所有 samples，实质上就是对于内存中符合要求的memChunk 的长久化。
index文件存储了索引相干的内容，尽管长久化后的数据被读取的概率是比拟低的，然而仍然存在被读取的可能。这样一来，如何尽快地从 Block 中读取时序数据就显得尤为重要了，而疾速读取索引并且基于索引查找时序数据则是放慢整体查问效率的要害。为了达到这一指标，存储索引信息的 index 文件在设计上就显得比较复杂了。

┌────────────────────────────┬─────────────────────┐
│ magic(0xBAAAD700) <4b>     │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │                 Symbol Table                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                    Series                    │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index 1                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index N                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings 1                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings N                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │               Label Index Table              │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Postings Table               │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      TOC                     │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

具体请看 prometheus index 文件

关于prometheus:prometheus-本地存储解析及其使用的那些黑科技

本文代码基于 prometheus 2.19.2 剖析

基本概念

什么是 tsdb

prometheus 基本概念

sample 数据点

Label 标签

Labels 标签组

vector 向量

Scalar 标量

Series 数据流

Matrix 矩阵

数据点压缩

dod 压缩 timestamp

xor 压缩 value

存储过程

series 内存数据结构 memSeries

哈希表

锁的粒度

倒排索引

含糊匹配

数据结构

索引过程

长久化存储

wal

block