关于数据库:一场由TiCDC异常引发的GC不干活导致的Tikv硬盘使用问题

1次阅读

共计 9478 个字符,预计需要花费 24 分钟才能阅读完成。

当大家看到这个题目时,就曾经晓得了上面几点:

(1)出了撒子问题?ok,集群所有 KV 节点存储硬盘应用 80% 以上,凌晨触发频繁报警,搞 DB 的兄弟们还能不能欢快的睡觉?

(2)谁搞的?ok,GC 不干活。

(3)真正导致问题的起因是啥?OK,TiCDC 背锅。

完了,看到这里本篇文章就根本介绍完了,会不会被劝退?然而老铁千万别走,且听我将细节娓娓道来,前面有“干货”,我会阐明我到底做了啥子操作升高硬盘应用;GC 为啥不干活?GC 的原理是啥?还有为啥是 TiCDC 背锅。

凌晨的报警和解决

早晨 11 点收到多条 tikv 集群硬盘空间报警,爬起来关上电脑看看到底出了啥问题,个别 TiKV 应用磁盘空间个别都有上面几个起因:

  • 原本硬盘空间就比拟缓和,是不是有哪个业务又早晨“偷偷”灌数据了?这种状况个别看下 grafana 的 overview 监控就能定位。
  • tikv 日志暴涨,之前遇到过集群降级后 tikv 日志因为静默 reigon 开启导致的日志量过大。这个看下 tikv 日志产生状况,如果是日志适量,间接 rm 日志即可,另外 tikv 反对配置日志保留策略来主动删除日志。
  • 其余状况

本次就是“其余状况”的类型,这个集群原本应用方就多,写入也没有突增啥的,tikv 日志也没有啥异样,因为晓得是测试集群,所以就是删除了一些 2021 年的 tikv 日志先将 KV 存储升高到 80% 以内不报警了。

第二天 …….

好好拾掇下测试 TiDB 集群,解决 Tikv 存储问题,靠 2 种办法:

(1)找资源扩容 KV

(2)看看是不是有表过大或者其余异样。

因为是测试集群,不焦急间接扩容,万一是有一些大表能够清理下数据啥的,资源就有了。所以我先通过上面的命令来统计库表资源应用状况(不太准,能够参考):

如果不波及分区表用上面的形式查看表的应用状况:select TABLE_SCHEMA,TABLE_NAME,TABLE_ROWS,(DATA_LENGTH+INDEX_LENGTH)/1024/1024/1024 as table_size from tables order by table_size desc limit 20;
另外 partition 表提供了分区表和非分区表的资源应用状况(咱们应用分区表较多):select TABLE_SCHEMA,TABLE_NAME,PARTITION_NAME,TABLE_ROWS,(DATA_LENGTH+INDEX_LENGTH)/1024/1024/1024 as table_size from information_schema.PARTITIONS order by table_size desc limit 20;

另外官网也提供了接口来统计 table 应用状况(比拟精确):

tidb-ctl table disk-usage --database(-d) [database name] --table(-t) [table name]

先看下 SQL 的统计后果,我发现有些表都几亿,并且有些表 rows 为 0,居然也有 size,因为我之前常常遇到统计信息的不准导致的 tables 零碎表信息不准的状况,为此,我居然回收了下集群所有 table 的统计信息。

而后做的操作就是跟业务沟通,删除一个没用的分区,或者 turncate 一些 table 数据,而后我就期待着资源回收了。然而我都等了半个多小时,在没有多少新写入的状况下,region 数量和磁盘空间丝毫没有动静,难道 GC 出了问题?

OK,那我就先从 mysql.tidb 表看下 GC 的状况,不看不晓得,一看吓一跳啊,我的 gc-save-point 居然还在 2021 年的 4 月份,那集群中到底累积了多少 MVCC 版本??我的天。

剖析和解决 GC 的问题

通过上图曾经能够发现是 GC 的问题,咱们须要找出 GC 出了什么问题。个别通过日志和监控。先查看 tidb server 的日志:

{"level":"INFO","time":"2022/03/16 22:55:06.268 +08:00","caller":"gc_worker.go:230","message":"[gc worker] quit","uuid":"5fdb15468e00003"}
{"level":"INFO","time":"2022/03/16 22:55:22.603 +08:00","caller":"gc_worker.go:202","message":"[gc worker] start","uuid":"5fe4ce7c7200002"}
{"level":"INFO","time":"2022/03/16 22:58:22.629 +08:00","caller":"gc_worker.go:608","message":"[gc worker] there's another service in the cluster requires an earlier safe point. gc will continue with the earlier one","uuid":"5fe4ce7c7200002","ourSafePoint":431867062548692992,"minSafePoint":424280964185194498}
{"level":"INFO","time":"2022/03/16 22:58:22.637 +08:00","caller":"gc_worker.go:329","message":"[gc worker] starts the whole job","uuid":"5fe4ce7c7200002","safePoint":424280964185194498,"concurrency":5}
{"level":"INFO","time":"2022/03/16 22:58:22.639 +08:00","caller":"gc_worker.go:1036","message":"[gc worker] start resolve locks","uuid":"5fe4ce7c7200002","safePoint":424280964185194498,"concurrency":5}
{"level":"INFO","time":"2022/03/16 22:58:41.268 +08:00","caller":"gc_worker.go:1057","message":"[gc worker] finish resolve locks","uuid":"5fe4ce7c7200002","safePoint":424280964185194498,"regions":81662}
{"level":"INFO","time":"2022/03/16 22:59:22.617 +08:00","caller":"gc_worker.go:300","message":"[gc worker] there's already a gc job running, skipped","leaderTick on":"5fe4ce7c7200002"}
{"level":"INFO","time":"2022/03/16 23:00:21.331 +08:00","caller":"gc_worker.go:701","message":"[gc worker] start delete ranges","uuid":"5fe4ce7c7200002","ranges":0}
{"level":"INFO","time":"2022/03/16 23:00:21.331 +08:00","caller":"gc_worker.go:750","message":"[gc worker] finish delete ranges","uuid":"5fe4ce7c7200002","num of ranges":0,"cost time":"677ns"}
{"level":"INFO","time":"2022/03/16 23:00:21.336 +08:00","caller":"gc_worker.go:773","message":"[gc worker] start redo-delete ranges","uuid":"5fe4ce7c7200002","num of ranges":0}
{"level":"INFO","time":"2022/03/16 23:00:21.336 +08:00","caller":"gc_worker.go:802","message":"[gc worker] finish redo-delete ranges","uuid":"5fe4ce7c7200002","num of ranges":0,"cost time":"4.016µs"}
{"level":"INFO","time":"2022/03/16 23:00:21.339 +08:00","caller":"gc_worker.go:1562","message":"[gc worker] sent safe point to PD","uuid":"5fe4ce7c7200002","safe point":424280964185194498}
{"level":"INFO","time":"2022/03/16 23:08:22.638 +08:00","caller":"gc_worker.go:608","message":"[gc worker] there's another service in the cluster requires an earlier safe point. gc will continue with the earlier one","uuid":"5fe4ce7c7200002","ourSafePoint":431867219835092992,"minSafePoint":424280964185194498}
{"level":"INFO","time":"2022/03/16 23:08:22.647 +08:00","caller":"gc_worker.go:329","message":"[gc worker] starts the whole job","uuid":"5fe4ce7c7200002","safePoint":424280964185194498,"concurrency":5}
{"level":"INFO","time":"2022/03/16 23:08:22.648 +08:00","caller":"gc_worker.go:1036","message":"[gc worker] start resolve locks","uuid":"5fe4ce7c7200002","safePoint":424280964185194498,"concurrency":5}
{"level":"INFO","time":"2022/03/16 23:08:43.659 +08:00","caller":"gc_worker.go:1057","message":"[gc worker] finish resolve locks","uuid":"5fe4ce7c7200002","safePoint":424280964185194498,"regions":81647}
{"level":"INFO","time":"2022/03/16 23:09:22.616 +08:00","caller":"gc_worker.go:300","message":"[gc worker] there's already a gc job running, skipped","leaderTick on":"5fe4ce7c7200002"}
{"level":"INFO","time":"2022/03/16 23:10:22.616 +08:00","caller":"gc_worker.go:300","message":"[gc worker] there's already a gc job running, skipped","leaderTick on":"5fe4ce7c7200002"}
{"level":"INFO","time":"2022/03/16 23:10:23.727 +08:00","caller":"gc_worker.go:701","message":"[gc worker] start delete ranges","uuid":"5fe4ce7c7200002","ranges":0}
{"level":"INFO","time":"2022/03/16 23:10:23.727 +08:00","caller":"gc_worker.go:750","message":"[gc worker] finish delete ranges","uuid":"5fe4ce7c7200002","num of ranges":0,"cost time":"701ns"}
{"level":"INFO","time":"2022/03/16 23:10:23.729 +08:00","caller":"gc_worker.go:773","message":"[gc worker] start redo-delete ranges","uuid":"5fe4ce7c7200002","num of ranges":0}
{"level":"INFO","time":"2022/03/16 23:10:23.730 +08:00","caller":"gc_worker.go:802","message":"[gc worker] finish redo-delete ranges","uuid":"5fe4ce7c7200002","num of ranges":0,"cost time":"710ns"}
{"level":"INFO","time":"2022/03/16 23:10:23.733 +08:00","caller":"gc_worker.go:1562","message":"[gc worker] sent safe point to PD","uuid":"5fe4ce7c7200002","safe point":424280964185194498}
{"level":"INFO","time":"2022/03/16 23:18:22.647 +08:00","caller":"gc_worker.go:608","message":"[gc worker] there's another service in the cluster requires an earlier safe point. gc will continue with the earlier one","uuid":"5fe4ce7c7200002","ourSafePoint":431867377121492992,"minSafePoint":424280964185194498}

通过日志发现 GC 始终等晚期的 safe point:424280964185194498

there's another service in the cluster requires an earlier safe point. gc will continue with the earlier one","uuid":"5fe4ce7c7200002","ourSafePoint":431870050977185792,"minSafePoint":424280964185194498

查看下该 TSO 对应的零碎工夫,发现跟下面 mysql.tidb 截图的 tikv-gc-safe-point 对应的工夫:2021-04-16 00:17:13 统一。

$ tiup ctl:v5.4.0 pd -u http://10.203.178.96:2379 tso 424280964185194498
system:  2021-04-16 00:17:13.934 +0800 CST
logic:   2

须要执行 pd-ctl service-gc-safepoint –pd 命令查问以后的 GC safepoint 与 service GC safepoint,发现卡住的 safe point 跟 ticdc 的 safe\_point: 424280964185194498 统一。

$ tiup ctl:v5.4.0 pd -u http://10.xxxx.96:2379 service-gc-safepoint
{
  "service_gc_safe_points": [
    {
      "service_id": "gc_worker",
      "expired_at": 9223372036854775807,
      "safe_point": 431880793651412992
    },
    {
      "service_id": "ticdc",
      "expired_at": 1618503433,
      "safe_point": 424280964185194498
    }
  ],
  "gc_safe_point": 424280964185194498
}

查看该集群的 ticdc 工作,发现卡住的 safe point 跟 changefeed id 为 dxl-xxxx-task 这个同步工作的 TSO 统一:

到此终于找到 GC 问题的起因。ticdc 同步工作没有及时删除导致的。解决方案:删除 ticdc 中该同步工作

tiup cdc:v5.4.0 cli changefeed remove --pd='http://10.xxxx.96:2379' --changefeed-id=dxl-replication-task

等一段时间察看 safe-point 的变动,发现曾经开始 GC: 

不过又卡在了另一个同步工作 (ticdc-demo-xxxx-test-shbt-core) 上,持续 remove

tiup cdc:v5.4.0 cli changefeed remove --pd='http://10.xxxx.96:2379' --changefeed-id=ticdc-demo-xxxx-test-shbt-core

删除后,期待较长时间(几个小时),最终 GC safe point 恢复正常:

GC 回收结束后单 KV 存储又之前的 800G 升高到 200G 左右,大家也能够从下图中左边的 cf size 的监控看进去,write CF 从之前的 3.2T 升高到 790G 左右,降落的最多,大家晓得对于 tikv 的列簇来说,write cf 是存数据的版本信息 (MVCC) 的,这个的降落也印证了 TiCDC 导致的 MVCC 版本过多的问题,当初问题齐全解决:

问题尽管解决结束了,然而 Ticdc 默认有个 24 小时的 GC hold 的工夫,超过这个工夫,就不会再影响 GC,为啥有下面的问题,我也征询了官网的产研同学:OK,首先定性:”是个 BUG“,这个问题在 4.0.13, 5.0.2, 5.1.0 这几个版本中修复了,因为测试集群,版本从 3.0 一路降级而来,因该是用了老版本踩到了坑,修复 PR 见前面的链接:https://github.com/pingcap/ti…

GC 原理

尽管讲明确了 ticdc->gc-> 硬盘报警的问题解决,然而咱们还须要对 GC 的原理进行下简略的阐明。

TiDB server 架构中的 GC 模块

GC 作为 TiDB Server 的模块,次要的作用就是清理不再须要的旧数据。

大家都晓得 Tikv 底层应用是基于 LSM 的数据架构,这种数据存储的形式在做数据 update 时,不会更新原来的数据,而是从新写一份新的数据,并且给每份数据都给定一个事务工夫戳,这种机制也是咱们常常提到的 MVCC(多版本并发管制)机制,为了防止同一条记录的版本过多,TiDB 设定了一个 tikv\_gc\_life\_time 来管制版本的保留工夫(默认 10min),并且每通过 tikv\_gc\_run\_interval 这个工夫距离,就会触发 GC 的操作,革除历史版本的记录,上面具体聊下 GC 的流程。

GC 的流程

GC 都是由 GC Leader 触发的,一个 tidb server 的 cluster 会选举出一个 GC leader,具体是哪个节点能够查看 mysql.tidb 表,gc leader 只是一个身份状态,真正 run RC 的是该 Leader 节点的 gc worker,另外 GC 的流程其实从文章下面的日志能够看进去。

  • (1)Resolve Locks 阶段

从英文翻译过去很直白:该阶段次要是清理锁,通过 GC leader 日志能够看出,有 start resolve locks;finish resolve locks 的操作,并且都跟着具体的 safepoint,其实这个 safepoint 就是一个工夫戳,该阶段会扫描所有的 region,只有在这个工夫戳之前由 2PC 产生的锁,该提交的提交,该回滚的回滚。

  • (2)Delete Ranges 阶段

该阶段次要解决由 drop/truncate 等操作产生的大量间断数据的删除。tidb 会将要删除的 range 首先记录到 mysql.gc\_delete\_range,而后拿这些曾经记录的 range 找对应的 sst 文件,调用 RocksDB 的 UnsafeDestroyRange 接口进行物理删除,删除后的 range 会记录到 mysql.gc\_delete\_range\_done 表。从日志中也能够看到 start delete ranges;finish delete ranges 的操作。

  • (3)Do GC 阶段

这个阶段次要是清理由 DML 操作产生的大量历史数据版本,因为默认的保留工夫为 10min,那么我一张元数据表一年都没有更新过,难道也会被清理?不要缓和,GC 会对每个 key 都保留 safe point 前的最初一次写入时的数据(也就是说,就算表数据不变动,GC 也不会“误清理”数据的)。另外这部分由 DML 产生的 GC,不会马上开释空间,须要 RocksDB 的 compaction 来真正的回收空间,对了还可能波及“大量空 region 的 merge”(如何回收空 region 的解决方案)。这个阶段的日志有:start redo-delete ranges;finish delete ranges

  • (4)发送 safe point 给 PD,完结本轮的 GC。日志为:sent safe point to PD

GC 的监控查看

波及到 GC 的次要看上面的几个监控即可:

tikv-detail 中的 GC 模块 ->GC task:跟上文所形容,如果执行了 drop/truncate 操作就看看上面监控中 total-unafe-destroy-range 的 metric 是否有值,另外日常的 DML 的 GC,能够看看 total-gc-keys 的 task 状况。

tikv-detail 中的 GC 模块 ->GC task duration:

不同 gc task 的删除耗时。

 

tikv-detail 中的 GC 模块 ->GC speed:

依照 keys/ s 的旧 MVCC 版本的删除速度。

 

PS: 想理解更多的 GC 内容能够查看官网文档或者具体浏览 asktug 中坤爷和娇姐写的 GC 三部曲:第一篇为『GC 原理浅析』,第二篇为『GC 监控及日志解读』,而最初一篇则为『GC 解决案例 & FAQ』。

原文作者:@代晓磊_360 发表于:2022-03-28
原文链接:https://tidb.io/blog/36c58d32

正文完
 0