作者:王顺

爱可生 DBA 团队成员,在公司负责我的项目中解决数据库问题,喜爱学习技术,钻研技术问题。

本文起源:原创投稿

*爱可生开源社区出品,原创内容未经受权不得随便应用,转载请分割小编并注明起源。


问题背景:

第一天早晨 MySQL 数据库降级,版本从5.6.22到5.7.32,降级过程很顺利,业务验证失常。

两天后业务人员反馈数据库卡住了,sql 执行上来数据库没反馈。

景象:

连上数据库看看发现了什么……

mysql> show processlist;+-------+-------------+---------------------+-------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+| Id    | User        | Host                | db    | Command     | Time   | State                                                         | Info                                                                                                 |+-------+-------------+---------------------+-------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+|     1 | system user |                     | NULL  | Connect     | 138205 | Waiting for master to send event                              | NULL                                                                                                 ||     2 | system user |                     | NULL  | Connect     | 137998 | Slave has read all relay log; waiting for more updates        | NULL                                                                                                 ||  6891 | user05      | %:37029             | db01  | Query       | 110410 | Sending data                                                  | INSERT         INTO db01.tb_mrt_par_user_order_info_d     (        deal_date                           || 11528 | user05      | %:44043             | db01  | Query       |  91406 | Sending data                                                  | INSERT         INTO db01.tb_mrt_par_user_order_info_d     (        deal_date                           || 12126 | user02      | %:44966             | db02  | Query       |  89331 | Waiting for table metadata lock                               | ALTER TABLE db02.TB_MID_PAR_USER_PDT_ORDER_DAY TRUNCATE PARTITION P20210810                        || 20896 | repl        | 172.24.64.88:54437  | NULL  | Binlog Dump |  53695 | Master has sent all binlog to slave; waiting for more updates   | NULL                                                                                               || 26520 | user03      | 172.24.64.197:45044 | db03  | Sleep       |     80 |                                                               | NULL                                                                                               || 26886 | user04      | 172.24.64.73:39607  | db04  | Execute     |  29567 | Waiting for table metadata lock                               | TRUNCATE TABLE tb_par_user_develop                                                                 || 26887 | user04      | 172.24.64.73:39608  | db04  | Execute     |  29565 | Waiting for table metadata lock                               | TRUNCATE TABLE tb_par_user_accid_day                                                               |......                                                                                                                                                          || 28069 | user02      | %:41400             | db02  | Query       |  24872 | Waiting for table metadata lock                               | ALTER TABLE db02.TB_MID_PAR_USER_PDT_ORDER_DAY ADD PARTITION (PARTITION P20210811 VALUES IN ('202108 |......                                                                                                                                     || 33666 | user05      | 10.1.16.10:47068    | db04  | Sleep       |   2275 |                                                               | NULL                                                                                                 || 33788 | user04      | 172.24.64.73:50227  | db04  | Execute     |   1833 | Waiting for table metadata lock                               | TRUNCATE TABLE tb_par_user_accid_day                                                                 || 33791 | user04      | 172.24.64.73:50236  | db04  | Execute     |   1826 | Waiting for table metadata lock                               | TRUNCATE TABLE tb_par_user_develop                                                                   |......                                                                                                                                      || 34095 | user05      | 10.1.16.10:54601    | db04  | Query       |    710 | Waiting for table metadata lock                               | SELECT * FROM `tb_par_user_accid_day` LIMIT 0, 1000                                                  |......                                                     | show processlist                                                                                     |+-------+-------------+---------------------+-------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+70 rows in set (0.00 sec)

通过输入信息能够看到,有多个 session 呈现 Waiting for table metadata lock ,查看执行最久的两个 sq 别离是 i d为 6891 和 11528 的两个 insert 语句。

insert 波及的 schema 是db01 ,呈现 Waiting for table metadata lock 是 db02/db04 。

很奇怪,执行语句来自不同的 schema ,是谁持有 MDL 锁堵住前面的语句?想不通……

Waiting for table metadata lock 看到 MDL 锁期待信息,须要查看一下起因。

mysql> select * from performance_schema.metadata_locks;Empty set (0.00 sec)

返回空,没有查到 MDL 锁相干信息,查看环境发现 MDL 相干 instrument 并没有开启。

备注:

长期失效

批改performance_schema.setup_instrume nts表,但实例重启后,又会复原为默认值。

UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES'
WHERE NAME = 'wait/lock/metadata/sql/mdl';

永恒失效

在配置文件中设置

[mysqld]

performance-schema-instrument='wait/lock/metadata/sql/mdl=ON'

持续查看大事务

通过屡次执行 select * from INNODB_TRX 查看正在运行的事务,能够看到两个正在执行的 insert 语句始终执行未提交。

mysql> select * from INNODB_TRX\G*************************** 1. row ***************************                    trx_id: 328588354171408                 trx_state: RUNNING               trx_started: 2021-08-11 09:06:34     trx_requested_lock_id: NULL          trx_wait_started: NULL                trx_weight: 1755570       trx_mysql_thread_id: 11528                 trx_query: INSERT         INTO db01.tb_mrt_par_user_order_info_d     (        deal_date                          ……疏忽字段信息                 ,copartner_code                 trx_operation_state: NULL         trx_tables_in_use: 15         trx_tables_locked: 3442          trx_lock_structs: 1755570     trx_lock_memory_bytes: 159899856           trx_rows_locked: 91807791         trx_rows_modified: 0   trx_concurrency_tickets: 3903       trx_isolation_level: REPEATABLE READ         trx_unique_checks: 1    trx_foreign_key_checks: 1trx_last_foreign_key_error: NULL trx_adaptive_hash_latched: 0 trx_adaptive_hash_timeout: 0          trx_is_read_only: 0trx_autocommit_non_locking: 0*************************** 2. row ***************************                    trx_id: 328588354177792                 trx_state: RUNNING               trx_started: 2021-08-11 03:49:50     trx_requested_lock_id: NULL          trx_wait_started: NULL                trx_weight: 1856514       trx_mysql_thread_id: 6891                 trx_query: INSERT         INTO db01.tb_mrt_par_user_order_info_d     (        deal_date                         ……疏忽字段信息                    ,copartner_code              trx_operation_state: fetching rows         trx_tables_in_use: 15         trx_tables_locked: 3597          trx_lock_structs: 1856514     trx_lock_memory_bytes: 169681104           trx_rows_locked: 101871982         trx_rows_modified: 0   trx_concurrency_tickets: 2338       trx_isolation_level: REPEATABLE READ         trx_unique_checks: 1    trx_foreign_key_checks: 1trx_last_foreign_key_error: NULL trx_adaptive_hash_latched: 0 trx_adaptive_hash_timeout: 0          trx_is_read_only: 0trx_autocommit_non_locking: 02 rows in set, 2 warnings (0.06 sec)

看看残缺的 sql 语句是啥?

执行 show full processlist ;看一下残缺的执行语句。后面的排查只执行了 show processlist 没有看到实现的语句。

mysql> show full processlist;| Id    | User        | Host                | db                 | Command     | Time   | State                                                         | Info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |     1 | system user |                     | NULL               | Connect     | 141084 | Waiting for master to send event                              | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ||     2 | system user |                     | NULL               | Connect     | 140877 | Slave has read all relay log; waiting for more updates        | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ||  6891 | user05      | %:37029             | db01               | Query       | 113289 | Sending data                                                  | INSERT         INTO db01.tb_mrt_par_user_order_info_d     (                          ……疏忽字段信息                                                                        )    SELECT                             ……疏忽字段信息                                                                   FROM   (   SELECT DISTINCT        t1.cust_no           as cust_no                    ……疏忽字段信息  ID                             ,t2.PRODPRCname             as PRODPRCname                        FROM (select * from db02.tb_mid_par_user_day where deal_date=20210810) t1      left join (select * from db02.TB_MID_PAR_USER_PDT_ORDER_DAY where deal_date=20210810 and effdate<=SYSDATE() and expdate>=SYSDATE())t2            on t1.user_no=t2.user_no        left join db04.TB_PAR_USER_ACCID_DAY t3            on t1.user_no=t3.id_no        left join db04.TB_PAR_USER_EXTEND_INFO t4            on t1.user_no=t4.user_no        left join (select * from db04.TB_PDT_PRIVATENETWORK_DAY where deal_date=20210810) t5            on t4.USER_ATTR2=t5.net_id        left join (select * from db04.TB_PAR_USERCONTRACT_REL_DAY where deal_date=20210810 and EXPDATE>=SYSDATE() and EFFDATE<=SYSDATE() and STATUS='1' and BUSI_TYPE='1' and TENANCY_CODE='008' GROUP BY ID_NO,CONTRACT_NO ) t6            on t1.user_no=t6.id_no        left join (select * from db04.TB_PAR_USER_CONTACT_INFO where deal_date=20210810) t7            on t1.user_no=t7.user_no        left join (SELECT * from db04.TB_FIN_CONGRPUSER_RELA_DAY where deal_date=20210810) t8            on t1.user_no=t8.id_no        left join (select * from db04.TB_PAR_ACCT_DAY where deal_date=20210810 and account_type='5') t9            on t8.contract_no=t9.contract_no        left join (SELECT * from db04.TB_CDE_PAR_RUNCODE where tenancy_code=008) t10            on t1.USER_STAT_CODE=t10.run_code        left join db04.TB_PAR_USER_DEVELOP t11            on t11.user_no=t1.user_no        left join (SELECT * from db04.TB_RES_CHNL_ORG_INFO where deal_date=20210810) t12            on t11.develop_dep=t12.group_id        left join db04.TB_RES_LOGIN_NO t13            on t11.develop_name=t13.system_user_code        left join db04.TB_RES_STAFF_INFO t14            on t13.system_user_id=t14.STAFF_ID        /*group by         ……疏忽字段信息                               ) t   /*GROUP BY       ……疏忽字段信息       | 11528 | user05        | %:44043             | db01               | Query       |  94285 | Sending data                                                  | INSERT         INTO db01.tb_mrt_par_user_order_info_d     (                           ……疏忽字段信息                                                                                 )    SELECT                         ……疏忽字段信息                                                                          FROM   (   SELECT DISTINCT                t1.cust_no           as cust_no                        ……疏忽字段信息                           ,t2.PRODPRCname             as PRODPRCname                       FROM (select * from db02.tb_mid_par_user_day where deal_date=20210810) t1      left join (select * from db02.TB_MID_PAR_USER_PDT_ORDER_DAY where deal_date=20210810 and effdate<=SYSDATE() and expdate>=SYSDATE())t2            on t1.user_no=t2.user_no        left join db04.TB_PAR_USER_ACCID_DAY t3            on t1.user_no=t3.id_no        left join db04.TB_PAR_USER_EXTEND_INFO t4            on t1.user_no=t4.user_no        left join (select * from db04.TB_PDT_PRIVATENETWORK_DAY where deal_date=20210810) t5            on t4.USER_ATTR2=t5.net_id        left join (select * from db04.TB_PAR_USERCONTRACT_REL_DAY where deal_date=20210810 and EXPDATE>=SYSDATE() and EFFDATE<=SYSDATE() and STATUS='1' and BUSI_TYPE='1' and TENANCY_CODE='008' GROUP BY ID_NO,CONTRACT_NO ) t6            on t1.user_no=t6.id_no        left join (select * from db04.TB_PAR_USER_CONTACT_INFO where deal_date=20210810) t7            on t1.user_no=t7.user_no        left join (SELECT * from db04.TB_FIN_CONGRPUSER_RELA_DAY where deal_date=20210810) t8            on t1.user_no=t8.id_no        left join (select * from db04.TB_PAR_ACCT_DAY where deal_date=20210810 and account_type='5') t9            on t8.contract_no=t9.contract_no        left join (SELECT * from db04.TB_CDE_PAR_RUNCODE where tenancy_code=008) t10            on t1.USER_STAT_CODE=t10.run_code        left join db04.TB_PAR_USER_DEVELOP t11            on t11.user_no=t1.user_no        left join (SELECT * from db04.TB_RES_CHNL_ORG_INFO where deal_date=20210810) t12            on t11.develop_dep=t12.group_id        left join db04.TB_RES_LOGIN_NO t13            on t11.develop_name=t13.system_user_code        left join db04.TB_RES_STAFF_INFO t14            on t13.system_user_id=t14.STAFF_ID        /*group by         ……疏忽字段信息                       ) t   /*GROUP BY        ……疏忽字段信息            

通过残缺的 sql 咱们能够看到,inster 语句中呈现了多表关连子查问,查问波及的库有 db02 和 db04 。

从业务人员那理解到这两条 insert 是跑批工作,波及的查问语句在降级前20多分钟能跑出数据,降级后1天多都没出数据,这又是什么起因?

先复原业务

已确认两个跑批工作能够先终止从新再跑,终止不会影响线上业务。

kill 掉两个 insert,kill 过程很快,两个语句的查问语句并没有跑出数据,insert 始终没有开始执行。kill 执行胜利后,MDL 锁期待景象很快就隐没了。

起因剖析

依据后面业务人员的形容,降级前间接运行 insert 语句中的查问语句20分钟能够出数据,数据库降级后查问跑不出数据。测试环境降级后并没有验证这两个跑批。

想比照同一条 sql 在 5.6 和 5.7 环境的执行打算的区别,因 5.6 环境曾经降级,无奈进行比对。

以前遇到过同样 5.6 降级 5.7 后,sql 运行慢的是因为派生表问题。

5.7 默认 derived_merge = on ,会将派生表合并,而后在执行 where 条件过滤,然而这样会造成的景象是多个未经过滤的表间接合并,数据量过大 sql 执行速度变慢。

官网文档:https://dev.mysql.com/doc/ref...

解决方案

  1. 是在数据库中敞开这个优化器配置
  2. 是改写业务 sql

因须要尽快恢复业务应用和业务人员协商敞开 derived_merge 项,再次验证跑批运行失常。