共计 8832 个字符,预计需要花费 23 分钟才能阅读完成。
作者:余振兴
爱可生 DBA 团队成员,热衷技术分享、编写技术文档。
本文起源:原创投稿
* 爱可生开源社区出品,原创内容未经受权不得随便应用,转载请分割小编并注明起源。
背景信息
因为平安因素, 客户须要将 MySQL 降级到 8.0.26 版本, 但因为 8.0.26 的 一些术语的不兼容性变更
, 对于监控采集的工具 / 程序会出现异常, 针对这个状况,MySQL 官网也提供了解决方案,那就是新增了一个参数terminology_use_previous
,当将该参数设置为BEFORE_8_0_26
时,能够放弃 8.0.26 版本之前的术语模式,如仍旧放弃 master,slave 的术语模式,以下是官网文档 8.0.26 release note 的形容片段摘要
Incompatible Change: From MySQL 8.0.26, new aliases or replacement names are provided for most remaining identifiers that contain the terms“master”, which is changed to“source”;“slave”, which is changed to“replica”; and“mts”(for“multithreaded slave”), which is changed to“mta”(for“multithreaded applier”). Help text is also changed where applicable to use the new names.
If the incompatible changes do have an impact for you, you can set the new system variable terminology_use_previous to BEFORE_8_0_26 to make MySQL Server use the old versions of the names for the objects specified in the previous list. This enables monitoring tools that rely on the old names to continue working until they can be updated to use the new names. The system variable can be set with session scope to support individual functions, or global scope to be a default for all new sessions. When global scope is used, the slow query log contains the old versions of the names.
当降级到 8.0.26 实现后, 数据库开启失常监控采集, 频繁的触发了 MySQL crash,须要剖析是什么起因导致,以下的剖析日志均为测试环境模拟。
故障剖析
查看 crash 日志咱们能够看到 crash 时 MySQL 最初执行的是一条 select * from information_schema.processlist
命令,从 crash 的堆栈信息来看,根本就是执行一个一般的 sql 触发了 libstdc 这个库的异样,导致数据库发出信号间接 abort MySQL,集体不太懂代码。从错误信息中 basic_string::_S_construct null not valid
的关键字搜寻理解到,这是一个 C++ 的报错, 大抵含意是:一个实例化的逻辑谬误导致程序停止。在 what()函数中不能应用 NULL 来结构 basic_string 的对象.
terminate called after throwing an instance of 'std::logic_error'
what(): basic_string::_S_construct null not valid
10:11:30 UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x7f9924000c20
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f9980596d80 thread_stack 0x46000
/data/mysql/3320/base/bin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x2251b8e]
/data/mysql/3320/base/bin/mysqld(handle_fatal_signal+0x323) [0x10fc283]
/lib64/libpthread.so.0(+0xf5f0) [0x7f99908d25f0]
/lib64/libc.so.6(gsignal+0x37) [0x7f998eb1d337]
/lib64/libc.so.6(abort+0x148) [0x7f998eb1ea28]
/lib64/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f998f42da95]
/lib64/libstdc++.so.6(+0x5ea06) [0x7f998f42ba06]
/lib64/libstdc++.so.6(+0x5ea33) [0x7f998f42ba33]
/lib64/libstdc++.so.6(+0x5ec53) [0x7f998f42bc53]
/lib64/libstdc++.so.6(std::__throw_logic_error(char const*)+0x77) [0x7f998f480857]
/data/mysql/3320/base/bin/mysqld() [0xea8720]
/lib64/libstdc++.so.6(std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)+0x38) [0x7f998f48c998]
/data/mysql/3320/base/bin/mysqld(THD::proc_info(System_variables const&) const+0x43) [0xf35d33]
/data/mysql/3320/base/bin/mysqld() [0x10216b0]
/data/mysql/3320/base/bin/mysqld(Fill_process_list::operator()(THD*)+0x2c8) [0x1024b58]
/data/mysql/3320/base/bin/mysqld(Global_THD_manager::do_for_all_thd_copy(Do_THD_Impl*)+0x220) [0xebc6d0]
/data/mysql/3320/base/bin/mysqld() [0x1020a13]
/data/mysql/3320/base/bin/mysqld(do_fill_information_schema_table(THD*, TABLE_LIST*, Item*)+0x83) [0x1025923]
/data/mysql/3320/base/bin/mysqld(MaterializeInformationSchemaTableIterator::Init()+0x91) [0x14550a1]
/data/mysql/3320/base/bin/mysqld(Query_expression::ExecuteIteratorQuery(THD*)+0x345) [0x1085465]
/data/mysql/3320/base/bin/mysqld(Query_expression::execute(THD*)+0x2c) [0x10856cc]
/data/mysql/3320/base/bin/mysqld(Sql_cmd_dml::execute(THD*)+0x2c5) [0x101a885]
/data/mysql/3320/base/bin/mysqld(mysql_execute_command(THD*, bool)+0xac8) [0xfbc2c8]
/data/mysql/3320/base/bin/mysqld(dispatch_sql_command(THD*, Parser_state*)+0x410) [0xfc0b00]
/data/mysql/3320/base/bin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x1f10) [0xfc2e90]
/data/mysql/3320/base/bin/mysqld(do_command(THD*)+0x174) [0xfc3bf4]
/data/mysql/3320/base/bin/mysqld() [0x10ed858]
/data/mysql/3320/base/bin/mysqld() [0x2778d4c]
/lib64/libpthread.so.0(+0x7e65) [0x7f99908cae65]
/lib64/libc.so.6(clone+0x6d) [0x7f998ebe588d]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f9924019488): select * from information_schema.processlist
Connection ID (thread ID): 9
Status: NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
如果是程序逻辑的问题, 对于咱们就有点有力了, 但还是尝试再进一步采集相干信息观测, 咱们配置在 MySQL crash 时主动生成生成 coredump,通过整顿查看到以下信息 (片段有删减, 只截取了局部要害片段),在对_S_construct
构造做定义时呈现了__throw_logic_error
,根本和 carsh 日志中记录的信息统一(能力无限,未能详细分析其中代码逻辑),coredump 的生成形式可参考链接:https://opensource.actionsky….
Thread 1 (Thread 0x7f1ee80c9700 (LWP 12545)):
#0 0x00007f1efa7e3a01 in pthread_kill () from /lib64/libpthread.so.0
#1 0x00000000010fc2ad in handle_fatal_signal ()
#2 <signal handler called>
#3 0x00007f1ef8a33277 in raise () from /lib64/libc.so.6
#4 0x00007f1ef8a34968 in abort () from /lib64/libc.so.6
#5 0x00007f1ef93427d5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#6 0x00007f1ef9340746 in ?? () from /lib64/libstdc++.so.6
#7 0x00007f1ef9340773 in std::terminate() () from /lib64/libstdc++.so.6
#8 0x00007f1ef9340993 in __cxa_throw () from /lib64/libstdc++.so.6
#9 0x00007f1ef9395597 in std::__throw_logic_error(char const*) () from /lib64/libstdc++.so.6
#10 0x0000000000ea8720 in char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag) ()
#11 0x00007f1ef93a1778 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) () from /lib64/libstdc++.so.6
#12 0x0000000000f35d33 in THD::proc_info(System_variables const&) const ()
#13 0x00000000010216b0 in thread_state_info(THD*, THD*) ()
#14 0x0000000001024b58 in Fill_process_list::operator()(THD*) ()
#15 0x0000000000ebc6d0 in Global_THD_manager::do_for_all_thd_copy(Do_THD_Impl*) ()
#16 0x0000000001020a13 in fill_schema_processlist(THD*, TABLE_LIST*, Item*) ()
#17 0x0000000001025923 in do_fill_information_schema_table(THD*, TABLE_LIST*, Item*) ()
#18 0x00000000014550a1 in MaterializeInformationSchemaTableIterator::Init() ()
#19 0x0000000001085465 in Query_expression::ExecuteIteratorQuery(THD*) ()
#20 0x00000000010856cc in Query_expression::execute(THD*) ()
#21 0x000000000101a885 in Sql_cmd_dml::execute(THD*) ()
#22 0x0000000000fbc2c8 in mysql_execute_command(THD*, bool) ()
#23 0x0000000000fc0b00 in dispatch_sql_command(THD*, Parser_state*) ()
#24 0x0000000000fc2e90 in dispatch_command(THD*, COM_DATA const*, enum_server_command) ()
#25 0x0000000000fc3bf4 in do_command(THD*) ()
#26 0x00000000010ed858 in handle_connection ()
#27 0x0000000002778d4c in pfs_spawn_thread ()
#28 0x00007f1efa7dee25 in start_thread () from /lib64/libpthread.so.0
#29 0x00007f1ef8afbbad in clone () from /lib64/libc.so.6
通过一系列排除式验证,发现如果不思考兼容监控采集的正确性,将 terminology_use_previous
参数设置为默认值 NONE
时,不会触发 crash 的景象, 也就是说是因为该参数的配置导致的异样行为,并且通过一一排除,根本确定以下 2 个关键点
- terminology_use_previous 必须设置为 BEFORE_8_0_26
- 必须有对 PROCESSLIST 表的拜访申请
接下来关注 crash 时记录的对 processlist 表的查问操作,咱们查看官网文档对 processlist 表的阐明能够看到,对 processlist 表的查问会持有 global mutex
,频繁的查问其实会对性能有肯定影响,但实践上并不会造成 crash,当然官网也提供了一个参数performance_sche ma_show_processlist
来实现无锁版的过程状态采集
The default SHOW PROCESSLIST implementation iterates across active threads from within the thread manager while holding a global mutex. This has negative performance consequences, particularly on busy systems. The alternative SHOW PROCESSLIST implementation is based on the Performance Schema processlist table. This implementation queries active thread data from the Performance Schema rather than the thread manager and does not require a mutex.
The SHOW PROCESSLIST statement provides process information by collecting thread data from all active threads. The performance_schema_show_processlist variable determines which SHOW PROCESSLIST implementation to use:
The alternative SHOW PROCESSLIST implementation is based on the Performance Schema processlist table. This implementation queries active thread data from the Performance Schema rather than the thread manager and does not require a mutex.
尝试开启 performance_schema_show_processlist=on
参数及设置 terminology_use_previous=BEFORE_8_0_26
来躲避global mutex
,但仍旧能稳固复现 crash 景象,而应用 performance_schema.threads 表做会话状态的查问则并不会触发 crash,所以实际上尽管官网文档中提到设置 performance_schema_show_processlist=on 和查问 performance_schema.threads 都是不加锁的实现, 但两者的实现形式应该是有差别的。
几番验证下来仍旧定位不到问题所在, 尝试搜寻下 MySQL bug 列表看是否有相似 bug,以及浏览 release note 看是否新版本有对该问题的修复或形容, 失去的状况是并未有相干的 bug 的形容, 但在 8.0.27 的 release note 中提到一句有相关性的信息,当并发的拜访 INFORMATION_SCHEMA.PROCESSLIST 表时, 会导致 MySQL 故障
(该阐明后的 Bug #32625376 是 MySQL 外部 BUG 号,咱们无奈查看到具体信息)
A SELECT query on the INFORMATION_SCHEMA.PROCESSLIST view with concurrent MySQL Server load caused a failure. (Bug #32625376)
尝试在 8.0.26 版本开多个会话,同时执行select * from INFORMATION_SCHEMA.PROCESSLIST
,并且敞开监控采集的烦扰, 果然稳固复现 crash 景象,(用 show processlist 也可复现)。看来的确是这个问题,于是开始将 MySQL 版本升级到 8.0.27,再进行验证,遗憾的是,当降级到 8.0.27 当前仍旧能够复现这个问题,看来并未彻底解决。
于是尝试给 MySQL 提个 BUG 阐明, 很快失去了官网回复, 确认该景象已在外部确认为 BUG,会在行将公布的 8.0.28 版本做修复
故障总结
- 本次集体剖析因能力无限并未彻底理解该 BUG 的深层次起因,但从最近的几个版本来看, 因为一些术语的变动,实对 MySQL 的应用和兼容性上造成了肯定影响
- 对于一些比拟诡异的景象能够尝试搜寻 MySQL BUG 列表或向官网提 issue,也行就能疾速失去后果
- 对于监控采集程序, 如果须要采集会话信息, 举荐应用 performance_schema.threads 表
- 对会话信息的采集频率不要太高,尤其会话较多的状况下,始终会对性能有肯定影响
- 本文中波及的 bug 在单个会话或多个会话低频的拜访 processlist 表都不会触发 crash