背景
近期线上MySQL 5.7.20集群不定期(多则三周,短则一两天)呈现主库mysql crash、触发主从切换问题,堆栈信息如下;
从堆栈信息能够显著看出,在调用 try_acquire_lock_impl
时触发的crash。
剖析
在官网Bug库未搜到相似问题,转而从代码库动手,搜到对应的BUG —— 8bc828b982f678d6b57c1853bbe78080c8f84e84:
BUG#26502135: MYSQLD SEGFAULTS IN MDL_CONTEXT::TRY_ACQUIRE_LOCK_IMPLANALYSIS:=========Server sometimes exited when multiple threads tried toacquire and release metadata locks simultaneously (forexample, necessary to access a table). The same problemcould have occurred when new objects were registered/deregistered in Performance Schema.The problem was caused by a bug in LF_HASH - our lock freehash implementation which is used by metadata lockingsubsystem in 5.7 branch. In 5.5 and 5.6 we only use LF_HASHin Performance Schema Instrumentation implementation. Sofor these versions, the problem was limited to P_S.The problem was in my_lfind() function, which searches forthe specific hash element by going through the elementslist. During this search it loads information about elementchecked such as key pointer and hash value into localvariables. Then it confirms that they are not corrupted byconcurrent delete operation (which will set pointer to 0)by checking if element is still in the list. The lattercheck did not take into account that compiler (andprocessor) can reorder reads in such a way that load of keypointer will happen after it, making result of the checkinvalid.FIX:====This patch fixes the problem by ensuring that no suchreordering can take place. This is achieved by usingmy_atomic_loadptr() which contains compiler and processormemory barriers for the check mentioned above and othersimilar places.The default (for non-Windows systems) implementation ofmy_atomic*() relies on old __sync intrisics and implementsmy_atomic_loadptr() as read-modify operation. To avoidscalability/performance penalty associated with addition ofmy_atomic_loadptr()'s we change the my_atomic*() to usenewer __atomic intrisics when available. This new defaultimplementation doesn't have such a drawback.
大体含意是:
当多个线程别离同时获取、开释metadata locks时,或者在 Performance Schema 中注册/撤销新的object时,可能会触发该问题,导致 mysql server crash。
该问题是 LF_HASH(Lock-Free Extensible Hash Tables) 的BUG引起的,那么 LF_HASH 用在什么中央呢?
- 在5.5、5.6中只用在 Performance Schema Instrumentation 模块。
- 在5.7中也用于metadata加锁模块。
问题出在my_lfind() 函数中,该函数针对cursor->prev
的判断未思考CAS,该patch通过应用 my_atomic_loadptr()
解决了该问题:
diff --git a/mysys/lf_hash.c b/mysys/lf_hash.cindex dc019b07bd9..3a3f665a4f1 100644--- a/mysys/lf_hash.c+++ b/mysys/lf_hash.c@@ -1,4 +1,4 @@-/* Copyright (c) 2006, 2016, Oracle and/or its affiliates. All rights reserved.+/* Copyright (c) 2006, 2017, Oracle and/or its affiliates. All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by@@ -83,7 +83,8 @@ retry: do { /* PTR() isn't necessary below, head is a dummy node */ cursor->curr= (LF_SLIST *)(*cursor->prev); _lf_pin(pins, 1, cursor->curr);- } while (*cursor->prev != (intptr)cursor->curr && LF_BACKOFF);+ } while (my_atomic_loadptr((void**)cursor->prev) != cursor->curr &&+ LF_BACKOFF); for (;;) { if (unlikely(!cursor->curr))@@ -97,7 +98,7 @@ retry: cur_hashnr= cursor->curr->hashnr; cur_key= cursor->curr->key; cur_keylen= cursor->curr->keylen;- if (*cursor->prev != (intptr)cursor->curr)+ if (my_atomic_loadptr((void**)cursor->prev) != cursor->curr) { (void)LF_BACKOFF; goto retry;
解决
查看change log,该问题在5.7.22版本修复的:
A server exit could result from simultaneous attempts by multiple threads to register and deregister metadata Performance Schema objects, or to acquire and release metadata locks. (Bug #26502135)
降级内核版本到5.7.29,之后巡检1个月,该问题未再呈现,问题解决。
PS:
篇幅无限,在后续文章中会独自剖析 MDL、LF_HASH 源码,敬请关注。
欢送关注我的微信公众号【MySQL数据库技术】。
知乎 - 数据库技术 专栏: https://zhuanlan.zhihu.com/my...
思否/segmentfault: https://segmentfault.com/u/db...
开源中国/oschina: https://my.oschina.net/dbtech
掘金: https://juejin.im/user/5e9d3e...
博客园/cnblogs: https://www.cnblogs.com/dbtech