乐趣区

关于mysql:故障分析-记一次-MTS-并行复制导致的死锁排查

作者:刘开洋

爱可生交付服务团队北京 DBA,对数据库及周边技术有浓重的学习趣味,喜爱看书,谋求技术。

本文起源:原创投稿

* 爱可生开源社区出品,原创内容未经受权不得随便应用,转载请分割小编并注明起源。


前段时间在客户现场发现一个奇怪的锁问题,顺便拿来和大家分享一下。

景象

MySQL 版本是 8.0.18,在从库的线程期待连贯中观测到的景象是这样的:

mysql> select * from threads;
+-----------+---------------------------------------------+------------+----------------+------------------+------------------+--------------------+---------------------+------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------+------------------+------+--------------+---------+-----------------+--------------+----------------+
| THREAD_ID | NAME                                        | TYPE       | PROCESSLIST_ID | PROCESSLIST_USER | PROCESSLIST_HOST | PROCESSLIST_DB     | PROCESSLIST_COMMAND | PROCESSLIST_TIME | PROCESSLIST_STATE                           | PROCESSLIST_INFO            | PARENT_THREAD_ID | ROLE | INSTRUMENTED | HISTORY | CONNECTION_TYPE | THREAD_OS_ID | RESOURCE_GROUP |
+-----------+---------------------------------------------+------------+----------------+------------------+------------------+--------------------+---------------------+------------------+---------------------------------------------+----------------------------------------------------------------------------------------------------------+------------------+------+--------------+---------+-----------------+--------------+----------------+
······
|        58 | thread/sql/event_scheduler   | FOREGROUND |          8    | NULL                | NULL                 | NULL       | Sleep         |                 NULL | Waiting on empty queue                               | NULL                                                           |                       1 | NULL | YES         | YES     | NULL            |        34147 | SYS_default    |
|        59 | thread/sql/signal_handler    | BACKGROUND |          NULL | NULL                | NULL                 | NULL      | NULL           |                NULL | NULL                                                              | NULL                                                           |                       1 | NULL | YES          | YES     | NULL            |        34148 | SYS_default    |
|        60 | thread/sql/compress_gtid_table| FOREGROUND |         10   | NULL                | NULL                 | NULL      | Daemon     |         33670997 | Suspending                                                   | NULL                                                           |                       1 | NULL | YES          | YES     | NULL            |        34149 | SYS_default    |
|        61 | thread/mysqlx/acceptor_network| BACKGROUND |         NULL | NULL               | NULL                 | NULL       | NULL          |                NULL | NULL                                                               | NULL                                                          |                     40 | NULL | YES          | YES     | NULL            |        34150 | SYS_default    |
|  46627729 | thread/mysqlx/worker          | BACKGROUND |         NULL | NULL                | NULL                 | NULL      | NULL            |               NULL | NULL                                                                | NULL                                                         |                 NULL | NULL | YES          | YES     | NULL            |        34132 | SYS_default    |
|  57810934 | thread/sql/one_connection     | FOREGROUND |     57808740 | proxy_monitor | 10.108.76.139 | NULL       | Query         |                 718 | Waiting for commit lock                                 | set global read_only=on                           |                 NULL | NULL | YES          | YES     | SSL/TLS       |        76945 | USR_default    |
|  47295992 | thread/sql/slave_io          | FOREGROUND |      47294234 | root                 | localhost            | NULL       | Connect     |         6285905 | Waiting for master to send event                 | NULL                                                           |         47294592 | NULL | YES          | YES     | NULL            |        75880 | SYS_default    |
|  47295993 | thread/sql/slave_sql         | FOREGROUND |      47294235 | root                 | localhost            | NULL       | Query         |            64768 | Waiting for dependent transaction to commit | NULL                                                         |         47294592 | NULL | YES          | YES     | NULL            |        75881 | SYS_default    |
|  47295994 | thread/sql/slave_worker      | FOREGROUND |       47294236 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for global read lock                           | INSERT INTO `dcp_deci_acctflow_info` (······ |         47295993 | NULL | YES          | YES     | NULL            |        75882 | SYS_default    |
|  47295995 | thread/sql/slave_worker    | FOREGROUND |       47294237 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for commit lock                                  | NULL                                                          |         47295993 | NULL | YES          | YES     | NULL            |        75883 | SYS_default    |
|  47295996 | thread/sql/slave_worker    | FOREGROUND |       47294238 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for preceding transaction to commit | NULL                                                          |         47295993 | NULL | YES          | YES     | NULL            |        75884 | SYS_default    |
|  47295997 | thread/sql/slave_worker    | FOREGROUND |       47294239 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for global read lock                            | INSERT INTO `dcp_deci_acctflow_info` (······ |         47295993 | NULL | YES          | YES     | NULL            |        75885 | SYS_default    |
|  47295998 | thread/sql/slave_worker    | FOREGROUND |       47294240 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for global read lock                            | INSERT INTO `dcp_deci_acctflow_info` (······ |         47295993 | NULL | YES          | YES     | NULL            |        75886 | SYS_default    |
|  47295999 | thread/sql/slave_worker    | FOREGROUND |       47294241 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for global read lock                            | INSERT INTO `dcp_deci_acctflow_info` (······ |         47295993 | NULL | YES          | YES     | NULL            |        75887 | SYS_default    |
|  47296000 | thread/sql/slave_worker    | FOREGROUND |       47294242 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for an event from Coordinator           | NULL                                                           |         47295993 | NULL | YES          | YES     | NULL            |        75888 | SYS_default    |
|  47296001 | thread/sql/slave_worker    | FOREGROUND |       47294243 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for an event from Coordinator           | NULL                                                           |         47295993 | NULL | YES          | YES     | NULL            |        75889 | SYS_default    |
|  47296002 | thread/sql/slave_worker    | FOREGROUND |       47294244 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for an event from Coordinator           | NULL                                                           |         47295993 | NULL | YES          | YES     | NULL            |        75890 | SYS_default    |
|  47296003 | thread/sql/slave_worker    | FOREGROUND |       47294245 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for an event from Coordinator           | NULL                                                           |         47295993 | NULL | YES          | YES     | NULL            |        75891 | SYS_default    |
|  47296004 | thread/sql/slave_worker    | FOREGROUND |       47294246 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for an event from Coordinator           | NULL                                                           |         47295993 | NULL | YES          | YES     | NULL            |        75892 | SYS_default    |
|  47296005 | thread/sql/slave_worker    | FOREGROUND |       47294247 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for an event from Coordinator           | NULL                                                           |         47295993 | NULL | YES          | YES     | NULL            |        75893| SYS_default    |
|  47296006 | thread/sql/slave_worker    | FOREGROUND |       47294248 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for an event from Coordinator           | NULL                                                           |         47295993 | NULL | YES          | YES     | NULL            |        75894 | SYS_default    |
|  47296007 | thread/sql/slave_worker    | FOREGROUND |       47294249 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for an event from Coordinator           | NULL                                                           |         47295993 | NULL | YES          | YES     | NULL            |        75895 | SYS_default    |
|  47296008 | thread/sql/slave_worker    | FOREGROUND |       47294250 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for an event from Coordinator           | NULL                                                           |         47295993 | NULL | YES          | YES     | NULL            |        75896 | SYS_default    |
|  47296009 | thread/sql/slave_worker    | FOREGROUND |       47294251 | root                 | localhost            | NULL       | Query         |            67415 | Waiting for an event from Coordinator           | NULL                                                           |         47295993 | NULL | YES          | YES     | NULL            |        75897 | SYS_default    |
|  57811108 | thread/sql/one_connection  | FOREGROUND |       57808914 | proxy_monitor  | 10.108.76.140   | NULL     | Query          |              664 | Waiting for commit lock                                    | set global read_only=on                            |                 NULL | NULL | YES          | YES     | SSL/TLS        |        97442 | USR_default    |
|  57811279 | thread/sql/one_connection  | FOREGROUND |       57809085 | proxy_monitor  | 10.108.76.140   | NULL     | Query          |              610 | Waiting for commit lock                                    | set global read_only=on                            |                 NULL | NULL | YES          | YES     | SSL/TLS        |        98136 | USR_default    |
|  57811110 | thread/sql/one_connection  | FOREGROUND |       57808916 | proxy_monitor  | 10.108.76.139   | NULL     | Query          |              663 | Waiting for commit lock                                    | set global read_only=on                            |                 NULL | NULL | YES          | YES     | SSL/TLS        |        97419 | USR_default    |
······                                                                  
mysql> ^C
// 阐明:因为连贯靠近 2300 条,为不便大家看起来没有那么干燥,这里进行局部省略,大多数被省略的连接线程为 set global read_only 的连贯 

剖析

从下面的连贯能够观测到:这里存在着几个锁期待,有等全局读锁的,有等提交锁的,首先应该理清对应的锁期待程序,看看到底是“谁在等我,而我又在等谁”。

通过下面 threads 表的加锁线程,能够剖析到有三条能够抓源头的线程:

  • 等全局读锁的线程
  • 等提交锁的线程
  • 等 MTS 程序提交的线程

咱们从最容易剖析的全局读锁动手,以线程 47295994 为切入点进行追根溯源,看看线程 47295994 持有的锁是什么?这就须要借助 perfoemance_schema 库中的 metadata_locks,即咱们通常剖析 DDL 阻塞的元数据锁表进行剖析,

mysql> select * from perfoemance_schema.metadata_locks where THREAD_ID = 47295994;
+-------------+---------------+-------------+-------------+-----------------------+-----------+---------------+-------------+--------------+-----------------+----------------+
| OBJECT_TYPE | OBJECT_SCHEMA | OBJECT_NAME | COLUMN_NAME | OBJECT_INSTANCE_BEGIN | LOCK_TYPE | LOCK_DURATION | LOCK_STATUS | SOURCE       | OWNER_THREAD_ID | OWNER_EVENT_ID |
+-------------+---------------+-------------+-------------+-----------------------+-----------+---------------+-------------+--------------+-----------------+----------------+
| GLOBAL     | NULL           | NULL        | NULL        |     140000636724432 | INTENTION_EXCLUSIVE | EXPLICIT | PENDING  | sql_base.cc:3002 |    47295994 |      149780492 |
+-------------+---------------+-------------+-------------+-----------------------+-----------+---------------+-------------+--------------+-----------------+----------------+
1 rows in set (0.00 sec)

线程 47295994 中的 SQL 是执行了一条 insert 语句,没有察看到该线程持有锁,只是在等全局读锁,咱们接着剖析全局读锁。

通过元数据锁表进一步确认线程,接下来就看看谁持有了全局读锁:

mysql> select * from perfoemance_schema.metadata_locks where OBJECT_TYPE = 'GLOBAL' and LOCK_STATUS = 'GRANTED';
+-------------+---------------+-------------+-------------+-----------------------+-----------+---------------+-------------+--------------+-----------------+----------------+
| OBJECT_TYPE | OBJECT_SCHEMA | OBJECT_NAME | COLUMN_NAME | OBJECT_INSTANCE_BEGIN | LOCK_TYPE | LOCK_DURATION | LOCK_STATUS | SOURCE       | OWNER_THREAD_ID | OWNER_EVENT_ID |
+-------------+---------------+-------------+-------------+-----------------------+-----------+---------------+-------------+--------------+-----------------+----------------+
| GLOBAL      | NULL          | NULL        | NULL        |       139955200762880 | SHARED    | EXPLICIT      | GRANTED     | lock.cc:1014 |        57822380 |              3 |
| GLOBAL      | NULL          | NULL        | NULL        |       140006539051472 | SHARED    | EXPLICIT      | GRANTED     | lock.cc:1014 |        57822266 |              3 |
| GLOBAL      | NULL          | NULL        | NULL        |       139957819325888 | SHARED    | EXPLICIT      | GRANTED     | lock.cc:1014 |        57821979 |              3 |
······
+-------------+---------------+-------------+-------------+-----------------------+-----------+---------------+-------------+--------------+-----------------+----------------+
1104 rows in set (0.00 sec)

咱们挑几条记录拉回 threads 表中看看是否找到对应的执行 SQL:

mysql> select * from perfoemance_schema.threads where THREAD_ID in (57822380,57822266,57821979);
+-----------+---------------------------------------------+------------+----------------+------------------+------------------+--------------------+---------------------+------------------+--------------------------------+---------------------+------------------+------+--------------+---------+-----------------+--------------+----------------+
| THREAD_ID | NAME                     | TYPE       | PROCESSLIST_ID | PROCESSLIST_USER | PROCESSLIST_HOST | PROCESSLIST_DB     | PROCESSLIST_COMMAND | PROCESSLIST_TIME | PROCESSLIST_STATE               | PROCESSLIST_INFO              | PARENT_THREAD_ID | ROLE | INSTRUMENTED | HISTORY | CONNECTION_TYPE | THREAD_OS_ID | RESOURCE_GROUP |
+-----------+---------------------------------------------+------------+----------------+------------------+------------------+--------------------+---------------------+------------------+--------------------------------+---------------------+------------------+------+--------------+---------+-----------------+--------------+----------------+
|  57821979 | thread/sql/one_connection   | FOREGROUND |   57808986 | proxy_monitor    | 10.108.76.140   | NULL               | Query               |              664 | Waiting for commit lock      | set global read_only=on              |             NULL | NULL | YES          | YES     | SSL/TLS         |        97442 | USR_default    |
|  57822266 | thread/sql/one_connection   | FOREGROUND |   57819228 | proxy_monitor    | 10.108.76.139   | NULL               | Query               |              679 | Waiting for commit lock      | set global read_only=on              |             NULL | NULL | YES          | YES     | SSL/TLS         |        97434 | USR_default    |
|  57822380 | thread/sql/one_connection   | FOREGROUND |   57829246 | proxy_monitor    | 10.108.76.140   | NULL               | Query               |              694 | Waiting for commit lock      | set global read_only=on              |             NULL | NULL | YES          | YES     | SSL/TLS         |        97476 | USR_default    |
+-----------+---------------------------------------------+------------+----------------+------------------+------------------+--------------------+---------------------+------------------+-----------------------------+------------------------+------------------+------+--------------+---------+-----------------+--------------+----------------+
3 rows in set (0.00 sec)

这里通过屡次查问能够失去确认,全局读锁是 set global read_only=on 这条语句下发的,从 threads 表中,咱们还能找到下发这条语句的用户 proxy_monitor 和连贯 IP 10.108.76.139/140,便能够进一步确认下发全局读锁的利用或者客户端。

通过理解,原来客户应用的 shell 脚本对从库进行监控并配置只读,只有从库不是只读状态,脚本就不停公开发命令去进行配置,因为后面产生的锁阻塞,上千根 set global read_only 连贯在 MySQL 中产生期待。

在 threads 表中,咱们还察看到一个景象,是所有下发全局读锁的线程都在等 commit 锁,那 commit 锁是被谁持有的呢?咱们接着去元数据锁表中去检索:

mysql> select * from perfoemance_schema.metadata_lockswhere OBJECT_TYPE = 'COMMIT' and LOCK_STATUS = 'GRANTED';
+-------------+---------------+-------------+-------------+-----------------------+-----------+---------------+-------------+--------------+-----------------+----------------+
| OBJECT_TYPE | OBJECT_SCHEMA | OBJECT_NAME | COLUMN_NAME | OBJECT_INSTANCE_BEGIN | LOCK_TYPE | LOCK_DURATION | LOCK_STATUS | SOURCE       | OWNER_THREAD_ID | OWNER_EVENT_ID |
+-------------+---------------+-------------+-------------+-----------------------+-----------+---------------+-------------+--------------+-----------------+----------------+
| COMMIT      | NULL          | NULL        | NULL        |       140011172675968 | INTENTION_EXCLUSIVE  | EXPLICIT | GRANTED | handler.cc:1697 |    47295996 |      89520430 |
| COMMIT      | NULL          | NULL        | NULL        |       140011035402992 | INTENTION_EXCLUSIVE  | EXPLICIT | GRANTED | handler.cc:1697 |    47295995 |     113911056 |
+-------------+---------------+-------------+-------------+-----------------------+-----------+---------------+-------------+--------------+-----------------+----------------+
2 rows in set (0.00 sec)

看来全局读锁在等线程 47295995、47295996 手中的 commit 锁开释,而 47295994 在等全局读锁开释,这里还差一个知识点,那就是 线程 47295994 和 47295995、47295996 之间的期待关系是什么呢?

不晓得大家对“Waiting for preceding transaction to commit”这个期待状态有没有理解,它的意思是指该线程中的事务要期待同一个事务提交组中靠前的事务优先实现提交,即同组事务程序提交,在 MySQL 中有一个参数对从库的程序提交进行配置(slave_preserve_commit_order)。

在客户现场并没有对“线程号小的连贯中事务优先提交”这个问题进行验证,咱们稍后在本地环境中进行测试复现一下。

一“图”以蔽之

因为从库进行 set global read_only = ON; 时持有 MDL::global read lock,并进一步想要获取 MDL::global commit lock,而且在 LOCK_TYPE 中显示的是 MDL_SHARED 锁。slave_preserve_commit_order= 1 时,relay log 中事务的提交组中的程序会严格依照在 relay log 中呈现的程序提交,所以事务的执行和 set global read_only 语句取得两个锁都不是原子的,并行复制模式下按以下的程序就会呈现死锁:

  1. 以线程为例,线程 47295994、47295995 和 47295996 中的事务在并行复制的 relay log 中程序提交,这一点在现场并没有进行观测,复现时给出证据。
  2. 在从库中线程 47295996 执行较快,首先 47295996 线程取得 commit 锁,并进入 waiting for preceding transaction to commit 的状态,而后线程 47295995 也取得 commit 锁。
  3. 客户端执行 set global read_only , 进入 waiting for commit 的状态。
  4. 事务 a 的 insert 语句在取得 GRL 之后执行,那么事务 a 就进入 waiting for global read lock 的状态,也就是现场呈现的景象。
  5. 如果事务 a 的 insert 语句在取得 GRL 之前执行,同时取得 global commit 锁之后利用 Xid_event 提交事务,则进入 waiting for commit lock 的状态,即第二种死锁,也就是上面复现到的第二种死锁景象。

解决

  • 计划 1:kill 掉加全局读锁的线程,但因为客户环境中加全局读锁的线程为 proxy_monitor 下发的数量高达上千条,也不是很容易解决掉,代价小但耗时较长。
  • 计划 2:能够设置 slave_preserve_commit_order=0 敞开从库 binlog 的程序提交,防止这个死锁的呈现。敞开这个参数不需进行复制,只是影响并行复制的事务在从库的提交程序,对汇算、风控业务可能存在影响,但对最终的数据一致性并无影响,所以如果无特地要求从库的 binlog 程序必须与主库保持一致。
  • 在应用 shell 脚本进行从库只读的检测,能够批改下脚本逻辑:如果检测到从库没有敞开只读,能够先进行复制,再开启从库只读,从而不需更改从库事务提交一致性而防止再次发生相干问题。

从库执行以下语句均有可能产生下面的死锁状况:

set global read_only ; grant all(FLUSH PRIVILEGES) ; flush table with read lock ;

  1. 下面的语句均会加 global read lock,通过验证,这种死锁的呈现存在肯定几率,只有呈现在同组提交的 relay log 中靠后的事务回放实现期待程序提交的状况下,才会导致 slave_preserve_commit_order 的 MTS 复制执行时死锁。
  2. 倡议尽量避免业务顶峰在从库执行以上语句,如果防止不了,敞开 slave_preserve_commit_order 参数,这个操作须要依据业务状况而定,可能会因为时序问题存在事中影响业务一致性,但不影响数据的最终一致性。

复现两种死锁

应用 sysbench 模仿业务压力,从库开一个窗口进行语句输出,屡次执行 set global read_only=ON ; 才会卡住,观测输入:

kill 掉 set global read_only=ON ; 这条语句之后,复制恢复正常,咱们看看在连贯 hang 住时的状况。

死锁 1 - 客户现场

查看对应 performance_schema.threads 表中的信息依据事务的提交程序,确定对应线程提交程序,阐明客户现场线程号小的执行的事务提交程序在线程号大的执行事务之前。依据事务的提交程序,确定对应线程提交程序,阐明客户现场线程号小的执行的事务提交程序在线程号大的执行事务之前。

查看 performance_schema.metadata_locks 表中锁持有关系

加锁模型和下面剖析中的的死锁闭环雷同。

依据事务的提交程序,确定对应线程提交程序,阐明客户现场线程号小的执行的事务提交程序在线程号大的执行事务之前。

死锁 2

查看对应 performance_schema.threads 表中的信息

查看 performance_schema.metadata_locks 表中锁持无关

依据锁关系画出相应死锁闭环

除了必须的锁外,事务会额定申请 MDL_key::GLOBAL 锁的 MDL_INTENTION_EXCLUSIVE 锁;在事务提交前,会先申请 MDL_key::COMMIT 锁的 MDL_INTENTION_EXCLUSIVE 锁,这个锁是 S 锁,对于范畴锁来说,IX 锁和 S 锁是不兼容的。

依据事务的提交程序,确定对应线程提交程序

参考:

https://sq.163yun.com/blog/ar…
http://dbaplus.cn/news-11-187…
http://mysql.taobao.org/month…
bug 工单:https://bugs.mysql.com/bug.ph…

退出移动版