乐趣区

关于数据库:Oracle案例一次gc-buffer-busy-acquire诊断

本案例来自某客户两节点 rac 的一次生产故障,景象是大面积的 gc buffer busy acquire 导致业务瘫痪。
首先查看 1 节点 AWR 头部信息和 load profile:


1 节点 AWR
失去的要害信息点:

对于 LCPU 256 的零碎,AAS=13379.42/59.91=223,阐明零碎十分忙碌或者遇到了异样期待。

sessions 异样增长好几倍,DB CPU/DB Time 占比非常低,阐明是遇到了异样期待。

其余指标都还算失常。

既然是遇到异样期待,那么就看看 top event 局部:

1 节点 AWR

能够看到大量的 wait class 为 Cluster 的 session,top event 也看到大量的 gc buffer busy acquire 期待事件,该期待事件十分常见就不独自解释了,粗略计算 cluster 期待事件占据了 dbtime 的 90% 左右。

当遇到大量 Cluster 期待事件的时候,必须先看看 RAC Statistics:

1 节点 AWR

能够看到每秒传输的 block 以及 message 都不多,流量也并不大,所以齐全没有必要去查看 SQL ordered by Cluster Wait Time 局部,持续往下看发现 Avg global cache cr block receive time (ms)过高,达到了 1473ms。判断此次大量的 cluster 期待是因为承受远端实例发送 cr block 过慢导致。

gc cr block receive time = gc cr block (flush time + build time + send time)
从公式能够看出是远端实例的 gc cr block flush time /build time/send time 呈现了问题,所以此时须要去看看 2 节点 AWR 的 RAC Statistics:

2 节点 AWR

能够看到 Avg global cache cr block flush time (ms)十分高,对于 current block flush redo 的行为有很多的介绍,这里就不解释了。对于 cr block flush redo 的行为,通常在须要从远端实例的 current block 结构 cr block 时才会产生。

Normally CR block buffer processing does not include the‘gc cr block flush time’. However, when a CR buffer is cloned from a current buffer that has redo pending, a log flush for a CR block transfer is required. A high percentage is indicative of hot blocks with frequent read after write access.
对于 current/cr flush time 提早较高,通常有两种可能:

LGWR 写性能差;

LGWR 被阻塞。

所以下一步思路是间接去看看 2 节点 AWR 的 Background Wait Events 和 Wait Event Histogram 查看 LGWR 的写性能如何,是否稳固。

2 节点 AWR


2 节点 AWR

能够看到 2 节点 LGWR 写性能十分稳固,并且提早也失常。那么就不是 LGWR 写性能的问题,很有可能是 LGWR 被阻塞。

2 节点 AWR

通过 top event 看到了 2 节点有大量期待是期待日志切换实现,阐明的确 LGWR 受到了阻塞,这个时候是时候去剖析 ash 了,能够间接过滤其余信息去独自查看 LGWR 的 ash 信息。

SQL> select to_char(sample_time,'yyyy-mm-dd hh24:mi:ss'),program,session_id,event,seq#,BLOCKING_SESSION,BLOCKING_INST_ID from m_ash where program like '%LGWR%' and inst_id=2 order by 1;
TO_CHAR(SAMPLE_TIME PROGRAM                        SESSION_ID EVENT                                SEQ# BLOCKING_SESSION BLOCKING_INST_ID
------------------- ------------------------------ ---------- ------------------------------ ---------- ---------------- ----------------
...
2020-03-07 15:44:40 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:41 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:42 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:43 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:44 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:45 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:46 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:47 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:48 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:49 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:50 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:51 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:52 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:53 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:54 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:55 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:56 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:57 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
2020-03-07 15:44:58 oracle@sxdb02 (LGWR)                  913 enq: CF - contention                 3850             2224                1
...

能够看到从 15:44:40 开始,LGWR 就开始期待 CF 队列,并且始终继续十分长的工夫,阻塞会话是节点 1 的 sid 为 2224 的会话。联合之前 AWR 的剖析这里猜想是在 15:44:40 进行了日志切换,因为日志切换须要去读取管制文件。那么上面看看 blocking session 的信息。

SQL> select to_char(sample_time,'yyyy-mm-dd hh24:mi:ss'),program,session_id,event,seq#,BLOCKING_SESSION,BLOCKING_INST_ID from m_ash where session_id=2224 and inst_id=1 order by 1;
TO_CHAR(SAMPLE_TIME PROGRAM                   SESSION_ID EVENT                                SEQ# BLOCKING_SESSION BLOCKING_INST_ID
------------------- ------------------------- ---------- ------------------------------ ---------- ---------------- ----------------
...
2020-03-07 15:37:41 oracle@sxdb01 (M000)            2224 control file sequential read          479
2020-03-07 15:37:42 oracle@sxdb01 (M000)            2224                                       566
2020-03-07 15:37:43 oracle@sxdb01 (M000)            2224 control file sequential read          632
2020-03-07 15:37:44 oracle@sxdb01 (M000)            2224 control file sequential read          684
2020-03-07 15:37:45 oracle@sxdb01 (M000)            2224                                       736
2020-03-07 15:37:46 oracle@sxdb01 (M000)            2224                                       781
2020-03-07 15:37:47 oracle@sxdb01 (M000)            2224                                       824
2020-03-07 15:37:48 oracle@sxdb01 (M000)            2224                                       865
2020-03-07 15:37:49 oracle@sxdb01 (M000)            2224                                       915
2020-03-07 15:37:50 oracle@sxdb01 (M000)            2224                                      1031
2020-03-07 15:37:51 oracle@sxdb01 (M000)            2224                                      1183
2020-03-07 15:37:52 oracle@sxdb01 (M000)            2224 control file sequential read         1304
2020-03-07 15:37:53 oracle@sxdb01 (M000)            2224                                      1400
2020-03-07 15:37:54 oracle@sxdb01 (M000)            2224                                      1481
2020-03-07 15:37:55 oracle@sxdb01 (M000)            2224 control file sequential read         1631
2020-03-07 15:37:56 oracle@sxdb01 (M000)            2224 control file sequential read         1834
2020-03-07 15:37:57 oracle@sxdb01 (M000)            2224 control file sequential read         1947
2020-03-07 15:37:58 oracle@sxdb01 (M000)            2224 control file sequential read         2052
2020-03-07 15:37:59 oracle@sxdb01 (M000)            2224 control file sequential read         2159
2020-03-07 15:38:00 oracle@sxdb01 (M000)            2224 control file sequential read         2269
2020-03-07 15:38:01 oracle@sxdb01 (M000)            2224 control file sequential read         2404
2020-03-07 15:38:02 oracle@sxdb01 (M000)            2224 control file sequential read         2517
2020-03-07 15:38:03 oracle@sxdb01 (M000)            2224 control file sequential read         2672
2020-03-07 15:38:04 oracle@sxdb01 (M000)            2224                                      2801
2020-03-07 15:38:05 oracle@sxdb01 (M000)            2224                                      2857
2020-03-07 15:38:06 oracle@sxdb01 (M000)            2224                                      2866
2020-03-07 15:38:07 oracle@sxdb01 (M000)            2224 control file sequential read         2893
2020-03-07 15:38:08 oracle@sxdb01 (M000)            2224                                      3007
2020-03-07 15:38:09 oracle@sxdb01 (M000)            2224 control file sequential read         3111
2020-03-07 15:38:10 oracle@sxdb01 (M000)            2224                                      3184
2020-03-07 15:38:11 oracle@sxdb01 (M000)            2224 control file sequential read         3218
2020-03-07 15:38:12 oracle@sxdb01 (M000)            2224 control file sequential read         3263
2020-03-07 15:38:13 oracle@sxdb01 (M000)            2224                                      3304
..

这里看到阻塞过程为 1 节点的 M000,从 15:37 分就开始持有 CF 锁始终在读取管制文件,继续了十分久的工夫,导致 2 节点日志切换时,2 节点 LGWR 无奈持有 CF 锁。M000 为 MMON 过程的 slave 过程,对于 MMON 过程咱们晓得通常都是跟 AWR 无关,为什么会一直的读取管制文件呢?

联合 diag 产生的 systemstate dump 里去查看 M000 的 short_stack 信息:

Short stack dump: ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pread()+12<-pread()+112<-skgfqio()+532<-ksfd_skgfqio()+756<-ksfd_io()+676<-ksfdread()+640<-kfk_ufs_sync_io()+416<-kfk_submit_io()+260<-kfk_io1()+916<-kfk_transitIO()+2512<-kfioSubmitIO()+408<-kfioRequestPriv()+220<-kfioRequest()+472<-ksfd_kfioRequest()+444<-ksfd_osmio()+2956<-ksfd_io()+1868<-ksfdread()+640<-kccrbp()+496<-kccrec_rbl()+296<-kccrec_read_write()+1680<-kccrrc()+1072<-krbm_cleanup_map()+28<-kgghstmap()+92<-krbm_cleanup_backup_records()+1100<-kraalac_slave_action()+1016<-kebm_slave_main()+744<-ksvrdp()+1928<-opirip()+1972<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380
kraalac_slave_action ->krbm_cleanup_backup_records-> krbm_cleanup_map -> kccrrc

这里能够看到 M000 过后在清理备份记录而去读取的管制文件,从 M000 trace 或者 ash 里 M000 的 action name 能够看到过后 m000 的 action 为 Monitor FRA Space,阐明是 MMON 发动 slave 去做 Monitor FRA Space,而 FRA 空间有余所以触发的清理一些 FRA 里的备份记录从而读取的管制文件。
晓得了 M000 为何要去读取管制文件,那么下一个问题就是为什么会读那么久呢?管制文件过大?管制文件读取过慢?还是从 systemstate dump 中找到了答案:

 SO: 0x22e13908e0, type: 10, owner: 0x23012ace68, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
     proc=0x23012ace68, name=FileOpenBlock, file=ksfd.h LINE:6688 ID:, pg=0
    (FOB) 22e13908e0 flags=2560 fib=22fde0d288 incno=0 pending i/o cnt=0
     fname=+FASTDG/sxboss/controlfile/current.256.907844431
     fno=1 lblksz=16384 fsiz=118532

管制文件大小为 lblksz fsiz=16384118532=1.8g,为何管制文件会那么大呢?

因为 control_file_record_keep_time 设置为了 365 天。

解决方案:

1. 依据备份策略正当设置 control_file_record_keep_time;

2. 重建管制文件。

该案例失去的播种就是平时对 MMON 过程的作用理解过少,通过 KST trace 跟踪 MMON,发现 MMON 的作用十分十分多,并不只是与 AWR 相干。KST trace 跟踪到的 MMON 的 action 如下:

(MMON) : (infrequent action) : acnum=&#91;133] comment=&#91;deferred controlfile autobackup action]     
(MMON) : (infrequent action) : acnum=&#91;150] comment=&#91;recovery area alert action]     
(MMON) : (infrequent action) : acnum=&#91;167] comment=&#91;undo usage]       
(MMON) : (infrequent action) : acnum=&#91;171] comment=&#91;Block Cleanout Optim, Undo Segment Scan]   
(MMON) : (infrequent action) : acnum=&#91;175] comment=&#91;Flashback Archive RAC Health Check]    
(MMON) : (infrequent action) : acnum=&#91;178] comment=&#91;tune undo retention lob]     
(MMON) : (infrequent action) : acnum=&#91;179] comment=&#91;MMON Periodic LOB MQL Selector]    
(MMON) : (infrequent action) : acnum=&#91;180] comment=&#91;MMON Periodic LOB Spc Analyze ]   
(MMON) : (infrequent action) : acnum=&#91;183] comment=&#91;tablespace alert monitor]      
(MMON) : (infrequent action) : acnum=&#91;197] comment=&#91;OLS Cleanup]       
(MMON) : (infrequent action) : acnum=&#91;205] comment=&#91;Sample Shared Server Activity]     
(MMON) : (infrequent action) : acnum=&#91;212] comment=&#91;Compute cache stats in background]    
(MMON) : (infrequent action) : acnum=&#91;213] comment=&#91;SPM: Auto-purge expired SQL plan baselines]   
(MMON) : (infrequent action) : acnum=&#91;214] comment=&#91;SPM: Check SMB size]     
(MMON) : (infrequent action) : acnum=&#91;215] comment=&#91;SPM: Delete excess sqllog$ batches]    
(MMON) : (infrequent action) : acnum=&#91;219] comment=&#91;KSXM Advance DML Frequencies]     
(MMON) : (infrequent action) : acnum=&#91;220] comment=&#91;KSXM Broadcast DML Frequencies]     
(MMON) : (infrequent action) : acnum=&#91;225] comment=&#91;Cleanup client cache server state in background]  
(MMON) : (infrequent action) : acnum=&#91;226] comment=&#91;MMON TSM Cleanup]      
(MMON) : (infrequent action) : acnum=&#91;296] comment=&#91;alert message cleanup]      
(MMON) : (infrequent action) : acnum=&#91;297] comment=&#91;alert message purge]      
(MMON) : (infrequent action) : acnum=&#91;298] comment=&#91;AWR Auto Flush Task]     
(MMON) : (infrequent action) : acnum=&#91;299] comment=&#91;AWR Auto Purge Task]     
(MMON) : (infrequent action) : acnum=&#91;300] comment=&#91;AWR Auto DBFUS Task]     
(MMON) : (infrequent action) : acnum=&#91;301] comment=&#91;AWR Auto CPU USAGE Task]    
(MMON) : (infrequent action) : acnum=&#91;305] comment=&#91;Advisor delete expired tasks]     
(MMON) : (infrequent action) : acnum=&#91;313] comment=&#91;run-once action driver]      
(MMON) : (infrequent action) : acnum=&#91;319] comment=&#91;metrics monitoring]       
(MMON) : (infrequent action) : acnum=&#91;322] comment=&#91;sql tuning hard kill defense]    
(MMON) : (infrequent action) : acnum=&#91;323] comment=&#91;autotask status check]      
(MMON) : (infrequent action) : acnum=&#91;324] comment=&#91;Maintain AWR Baseline Thresholds Task]    
(MMON) : (infrequent action) : acnum=&#91;325] comment=&#91;WCR: Record Action Switcher]     
(MMON) : (infrequent action) : acnum=&#91;331] comment=&#91;WCR: Replay Action Switcher]     
(MMON) : (infrequent action) : acnum=&#91;338] comment=&#91;SQL Monitoring Garbage Collector]     
(MMON) : (infrequent action) : acnum=&#91;344] comment=&#91;Coordinator autostart timeout]
(MMON) : (infrequent action) : acnum=&#91;348] comment=&#91;ADR Auto Purge Task]     
(MMON) : (infrequent action) : acnum=&#91;41] comment=&#91;reload failed KSPD callbacks]     
(MMON) : (infrequent action) : acnum=&#91;75] comment=&#91;flashcache object keep monitor]     
(MMON) : (interrupt action) : acnum=&#91;108] comment=&#91;Scumnt mount lock]      
(MMON) : (interrupt action) : acnum=&#91;109] comment=&#91;Poll system events broadcast channel]    
(MMON) : (interrupt action) : acnum=&#91;20] comment=&#91;KSB action for ksbxic() calls]    
(MMON) : (interrupt action) : acnum=&#91;2] comment=&#91;KSB action for X-instance calls]    
(MMON) : (interrupt action) : acnum=&#91;306] comment=&#91;MMON Remote action Listener]     
(MMON) : (interrupt action) : acnum=&#91;307] comment=&#91;MMON Local action Listener]     
(MMON) : (interrupt action) : acnum=&#91;308] comment=&#91;MMON Completion Callback Dispatcher]     
(MMON) : (interrupt action) : acnum=&#91;309] comment=&#91;MMON set edition interrupt action]    
(MMON) : (interrupt action) : acnum=&#91;341] comment=&#91;Check for sync messages from other instances]  
(MMON) : (interrupt action) : acnum=&#91;343] comment=&#91;Check for autostart messages from other instances]  
(MMON) : (interrupt action) : acnum=&#91;350] comment=&#91;Process staged incidents]      
(MMON) : (interrupt action) : acnum=&#91;351] comment=&#91;DDE MMON action to schedule async action slaves] 
(MMON) : (interrupt action) : acnum=&#91;39] comment=&#91;MMON request for RLB metrics]    
(MMON) : (requested action) : acnum=&#91;314] comment=&#91;shutdown MMON]       
(MMON) : (requested action) : acnum=&#91;315] comment=&#91;MMON DB open]      
(MMON) : (requested action) : acnum=&#91;321] comment=&#91;ADDM (KEH)]       
(MMON) : (requested action) : acnum=&#91;347] comment=&#91;Job Autostart action force]     
(MMON) : (requested action) : acnum=&#91;349] comment=&#91;Schedule slave to update incident meter]   
(MMON) : (requested action) : acnum=&#91;63] comment=&#91;SGA memory tuning parameter update]    
(MMON) : (timeout action) : acnum=&#91;0] comment=&#91;Monitor Cleanup]       
(MMON) : (timeout action) : acnum=&#91;11] comment=&#91;Update shared pool advice stats]    
(MMON) : (timeout action) : acnum=&#91;154] comment=&#91;Flashback Marker]       
(MMON) : (timeout action) : acnum=&#91;172] comment=&#91;Block Cleanout Optim, Rac specific code]   
(MMON) : (timeout action) : acnum=&#91;173] comment=&#91;BCO:]        
(MMON) : (timeout action) : acnum=&#91;1] comment=&#91;Update KGSTM Translation]      
(MMON) : (timeout action) : acnum=&#91;3] comment=&#91;KSB action for bast checking]    
(MMON) : (timeout action) : acnum=&#91;42] comment=&#91;reconfiguration MMON action]      
(MMON) : (timeout action) : acnum=&#91;63] comment=&#91;SGA memory tuning parameter update]    
(MMON) : (timeout action) : acnum=&#91;69] comment=&#91;SGA memory tuning]

。。。
大略有几百种 action,平时对 MMON 关注还是太少了。

墨天轮原文链接:https://www.modb.pro/db/17494…(复制链接至浏览器查看)

对于作者
李翔宇,云和恩墨西区交付技术顾问,长期服务挪动运营商行业客户,相熟 Oracle 性能优化,故障诊断,非凡复原。

退出移动版