关于数据库:Oracle案例SMON回滚异常导致实例Crash

某省电力系统的一个 4 节点 rac，2 节点在早上的时候 crash。

Wed Nov 03 07:59:05 2021
SMON: Restarting fast_start parallel rollback
Exception &#091;type: SIGSEGV, Address not mapped to object] &#091;ADDR:0x505E9323] &#091;PC:0x98214CA, kgegpa()+40] &#091;flags: 0x0, count: 1]
Exception &#091;type: SIGSEGV, Address not mapped to object] &#091;ADDR:0x505E9323] &#091;PC:0x981FF3E, kgebse()+776] &#091;flags: 0x2, count: 2]
Exception &#091;type: SIGSEGV, Address not mapped to object] &#091;ADDR:0x505E9323] &#091;PC:0x981FF3E, kgebse()+776] &#091;flags: 0x2, count: 2]
Wed Nov 03 07:59:07 2021
PMON (ospid: 100111): terminating the instance due to error 474
Wed Nov 03 07:59:07 2021
System state dump requested by (instance=2, osid=100111 (PMON)), summary=&#091;abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_diag_100469_20211103075907.trc
Instance terminated by PMON, pid = 100111

当 SMON 从新发动并行回滚时，实例被 PMON 终止，这里有一个暗藏谬误，经常被忽视，PMON (ospid: 100111): terminating the instance due to error 474

&#091;oracle@dbtest ~]$ oerr ora 474
00474, 00000, "SMON process terminated with error"
// *Cause:  The system cleanup process died
// *Action: Warm start instance

阐明是 PMON 发现 SMON 异样而终止的实例。那么看看 SMON trace 文件：

*** 2021-11-03 07:59:05.420
Parallel Transaction recovery caught error 30317 
SMON: Restarting fast_start parallel rollback
 
*** 2021-11-03 07:59:06.283
Exception &#091;type: SIGSEGV, Address not mapped to object] &#091;ADDR:0x505E9323] &#091;PC:0x98214CA, kgegpa()+40] &#091;flags: 0x0, count: 1]
DDE previous invocation failed before phase II
DDE was called in a 'No Invocation Mode'
----- Start Diag Diagnostic Dump -----
Diag diagnostic dump is performed due to an error in the diagfw code during error handling.
DDE is switched to protected mode during the diagnostic dump to prevent recursive errors in the error hadnling code.
Dump error and call stack for the diagnostic dump:
 
*** 2021-11-03 07:59:06.306
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0)
----- Error Stack Dump -----
 
*** 2021-11-03 07:59:06.306
Exception &#091;type: SIGSEGV, Address not mapped to object] &#091;ADDR:0x505E9323] &#091;PC:0x981FF3E, kgebse()+776] &#091;flags: 0x2, count: 2]
Not recording an ORA-7445 (previous attmpet failed)
Warning! DDE is invoked in protected mode. DDE call is aborted.
Warning! DDE is invoked in protected mode. DDE call is aborted.
 
*** 2021-11-03 07:59:06.308
Exception &#091;type: SIGSEGV, Address not mapped to object] &#091;ADDR:0x505E9323] &#091;PC:0x981FF3E, kgebse()+776] &#091;flags: 0x2, count: 2]
Not recording an ORA-7445 (previous attmpet failed)
ssexhd: crashing the process...
Background_Core_Dump = partial
ksdbgcra: writing core file to directory '/u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/cdump'

SMON 在从新发动回滚时异样之后过程被 crash 了，最终 PMON 终止了实例。持续往前查看 alert 日志发现从 6 点 52 分 46 秒开始报一些异样之后，就开始了 blockrecover 并随同一些 ora-00600[4193]谬误。

Wed Nov 03 06:51:58 2021
Archived Log entry 220723 added for thread 2 sequence 33832 ID 0xbc36136c dest 1:
Wed Nov 03 06:52:46 2021
Exception &#091;type: SIGSEGV, Address not mapped to object] &#091;ADDR:0xE559064] &#091;PC:0x98214CA, kgegpa()+40] &#091;flags: 0x0, count: 1]
Exception &#091;type: SIGSEGV, Address not mapped to object] &#091;ADDR:0xE559064] &#091;PC:0x981FF3E, kgebse()+776] &#091;flags: 0x2, count: 2]
Exception &#091;type: SIGSEGV, Address not mapped to object] &#091;ADDR:0xE559064] &#091;PC:0x981FF3E, kgebse()+776] &#091;flags: 0x2, count: 2]
Wed Nov 03 06:52:48 2021
Block recovery from logseq 33833, block 2030926 to scn 16125071688907
Recovery of Online Redo Log: Thread 2 Group 8 Seq 33833 Reading mem 0
  Mem# 0: +FDATADG/newepmfdp/onlinelog/group_8.278.1083929519
Block recovery completed at rba 33833.2031098.16, scn 3754.1764459724
Block recovery from logseq 33832, block 4457757 to scn 16125071688920
Recovery of Online Redo Log: Thread 2 Group 7 Seq 33832 Reading mem 0
  Mem# 0: +FDATADG/newepmfdp/onlinelog/group_7.279.1083929515
Block recovery interrupted at rba 33832.4613427.16
Block recovery from logseq 33832, block 4613427 to scn 16125071688920
Recovery of Online Redo Log: Thread 2 Group 7 Seq 33832 Reading mem 0
  Mem# 0: +FDATADG/newepmfdp/onlinelog/group_7.279.1083929515
Block recovery interrupted at rba 33832.4769190.16
Block recovery from logseq 33832, block 4769190 to scn 16125071688920
...
Wed Nov 03 07:00:04 2021
Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_m000_84504.trc  (incident=320530):
ORA-00600: internal error code, arguments: &#091;4193], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;]
Incident details in: /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/incident/incdir_320530/epmfdp2_m000_84504_i320530.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Wed Nov 03 07:06:09 2021
Sweep &#091;inc]&#091;320530]: completed
Wed Nov 03 07:06:16 2021
Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/incident/incdir_320530/epmfdp2_m000_84504_i320530.trc:
ORA-12751: cpu time or run time policy violation
ORA-12751: cpu time or run time policy violation
ORA-00600: internal error code, arguments: &#091;4193], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;]
Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/incident/incdir_320530/epmfdp2_m000_84504_i320530.trc:
ORA-12751: cpu time or run time policy violation
ORA-12751: cpu time or run time policy violation
ORA-00600: internal error code, arguments: &#091;4193], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;]
Wed Nov 03 07:06:22 2021
Dumping diagnostic data in directory=&#091;cdmp_20211103070622], requested by (instance=2, osid=84504 (M000)), summary=&#091;incident=320530].
Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_m000_84504.trc:
ORA-12751: cpu time or run time policy violation
ORA-00600: internal error code, arguments: &#091;4193], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;]
Wed Nov 03 07:06:28 2021
Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_m000_84504.trc:
ORA-12751: cpu time or run time policy violation
ORA-00600: internal error code, arguments: &#091;4193], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;], &#091;]
Wed Nov 03 07:11:57 2021
Sweep &#091;inc2]&#091;320530]: completed
Wed Nov 03 07:31:57 2021
Suspending MMON action 'undo usage' for 82800 seconds

对于 ora-00600[4193]，常常搞复原的太相熟不过了。不过个别在异样断电强制 open 数据库的时候遇到，失常运行的环境很少见，其含意如下：

Meaning: seq# mismatch while adding an undo record to an undo block. This 
         is done by the application of redo. 
---------------------------------------------------------------------------
Argument Description:
 
    a. (ktubhseq): undo record seq# - this is the seq# of the block that 
                                      this undo record WILL BE APPLIED TO. 
                                      This is from the Undo Block. It is 
                                      NOT the seq# of the undo block itself.
                                       
    b. (ktudbseq): redo RECORD seq# - this is the seq# number in the block 
                                      that this redo WILL BE APPLIED TO. 
                                      This is from the Redo Record. 
                                      
When we try to apply redo to an undo block (forward changes are made by 
the application of redo to a block) we check that the seq# in the undo 
record matches the seq# in the redo record. These seq# should be the 
same because when we apply a redo record we must apply it to the 
correct version of the block. We can only apply a redo record to a 
block that contains the same seq# as in the redo record. 
 
If the seq# do not match then this error is raised. This implies some 
kind of block corruption in either the redo or the undo block. 
                                      
ksesic2(OERI(4193), ksenrg(ubh-&gt;ktubhseq), ksenrg(db-&gt;ktudbseq));
 
struct ktubh
{
  kxid  ktubhxid;      /* txid of tx currently using or last used this block */
  ub2   ktubhseq;                              /* undo block sequence number */
  ub1   ktubhcnt;    /* high water mark record index, number of undo entries */
  ub1   ktubhirb;  /* rollback record index, rec index to start the rollback */
  ub1   ktubhicl;  /* collecting record index, rec index to start retrieving col info */
  ub1   ktubhflg;                                                 /* dummy */
  ub2   ktubhidx&#091;1];     /* byte offset of record in block, grows at runtime */
};
 
struct ktudb   Kernel Transaction Undo Data operation Block (redo)
{
  ub2    ktudbsiz;                                          /* size of entry */
  ub2    ktudbspc;                 /* verification: space left in undo block */
  ub2    ktudbflg;            /* flag to indicate the kind of redo operation */
  kxid   ktudbxid;                                          /* current tx id */
  ub2    ktudbseq;                                  /* block sequence number */
  ub1    ktudbrec;                       /* new record index for this change */
};

其实质就是 ktubhseq 和 ktudbseq 不统一，如果是 ktudbrec 和 ktubhcnt 不统一则为 ORA-00600[4194]，ktubh 是 undo block 中的构造，ktudb 是 redo 记录 undo 块操作时会记录的信息。


 Undo Data
 
 struct kcbh, 20 bytes                      @0       
    ub1 type_kcbh                           @0       
    ub1 frmt_kcbh                           @1       
    ub1 spare1_kcbh                         @2       
    ub1 spare2_kcbh                         @3       
    ub4 rdba_kcbh                           @4       
    ub4 bas_kcbh                            @8       
    ub2 wrp_kcbh                            @12      
    ub1 seq_kcbh                            @14      
    ub1 flg_kcbh                            @15      
    ub2 chkval_kcbh                         @16      
    ub2 spare3_kcbh                         @18      
 
 struct ktubh, 142 bytes                    @20      
    struct ktubhxid, 8 bytes                @20      
    ub2 ktubhseq                            @28      
    ub1 ktubhcnt                            @30      
    ub1 ktubhirb                            @31      
    ub1 ktubhicl                            @32      
    ub1 ktubhflg                            @33      
    ub2 ktubhidx&#091;64]                        @34      
 
 ub1 freespace&#091;38]                          @162     
 
 ub1 undodata&#091;7988]                         @200     
 
 ub4 tailchk                                @8188

回到案例，从 trace 中能够看到 ktubh 和 ktudb 的信息。

ktudb 记录如下：

xid: 0x750f.003.0093843a

uba: 0x1cd49528.ddee.01

seq: 0xddee

rec: 0x02

*** ktuc_diag_dmp: dump of current change vector
ktudb redo: siz: 232 spc: 8068 flg: 0x0012 seq: 0xddee rec: 0x02
            xid:  0x750f.003.0093843a  
ktubl redo: slt: 3 rci: 0 opc: 11.1 &#091;objn: 6729 objd: 6729 tsn: 1]
Undo type:  Regular undo        Begin trans    Last buffer split:  No 
Temp Object:  No 
Tablespace Undo:  No 
             0x00000000  prev ctl uba: 0x1cd49528.ddee.01 
prev ctl max cmt scn:  0x0eaa.692b5862  prev tx cmt scn:  0x0eaa.692b58c4 
txn start scn:  0xffff.ffffffff  logon user: 0  prev brb: 0  prev bcl: 0 BuExt idx: 0 flg2: 0
KDO undo record:
KTB Redo 
op: 0x04  ver: 0x01  
compat bit: 4 (post-11) padding: 1
op: L  itl: xid:  0x7607.013.00713adb uba: 0x1d001423.538a.15
                      flg: C---    lkc:  0     scn: 0x0eaa.68c33db2
KDO Op code: URP row dependencies Disabled
  xtype: XA flags: 0x00000000  bdba: 0x008011c0  hdba: 0x00801a02
itli: 1  ispac: 0  maxfr: 4858
tabn: 0 slot: 78(0x4e) flag: 0x2c lock: 0 ckix: 23
ncol: 10 nnew: 1 size: 0
col  8: &#091; 6]  c5 02 0b 2a 4d 5c

uba 指向的 undo 块 ktudb 信息如下：

xid: 0x750d.009.00abc7f4

seq: 0xd24

cnt: 0x4d

********************************************************************************
UNDO BLK:  
xid: 0x750d.009.00abc7f4  seq: 0xd24 cnt: 0x4d  irb: 0x1   icl: 0x0   flg: 0x0000
 
 Rec Offset      Rec Offset      Rec Offset      Rec Offset      Rec Offset
---------------------------------------------------------------------------
0x01 0x1f8c     0x02 0x1f18     0x03 0x1ebc     0x04 0x1e48     0x05 0x1dec     
0x06 0x1d78     0x07 0x1d1c     0x08 0x1cc0     0x09 0x1c64     0x0a 0x1bf0     
0x0b 0x1b94     0x0c 0x1b20     0x0d 0x1ac4     0x0e 0x1a50     0x0f 0x19f4     
0x10 0x1980     0x11 0x1924     0x12 0x18b0     0x13 0x1854     0x14 0x17e0     
0x15 0x1784     0x16 0x1710     0x17 0x16b4     0x18 0x1640     0x19 0x15e4     
0x1a 0x1570     0x1b 0x1514     0x1c 0x14a0     0x1d 0x144c     0x1e 0x13d8     
0x1f 0x137c     0x20 0x1320     0x21 0x12c4     0x22 0x1250     0x23 0x11f4     
0x24 0x1180     0x25 0x1124     0x26 0x10b0     0x27 0x1054     0x28 0x0fe0     
0x29 0x0f84     0x2a 0x0f10     0x2b 0x0eb4     0x2c 0x0e40     0x2d 0x0de4     
0x2e 0x0d70     0x2f 0x0d14     0x30 0x0ca0     0x31 0x0c44     0x32 0x0be0     
0x33 0x0b84     0x34 0x0b10     0x35 0x0ab4     0x36 0x0a40     0x37 0x09e4     
0x38 0x0970     0x39 0x0914     0x3a 0x08a0     0x3b 0x0844     0x3c 0x07d0     
0x3d 0x0774     0x3e 0x0700     0x3f 0x06a4     0x40 0x0630     0x41 0x05d4     
0x42 0x0560     0x43 0x0504     0x44 0x0490     0x45 0x0434     0x46 0x03c0     
0x47 0x0364     0x48 0x02f0     0x49 0x0294     0x4a 0x0220     0x4b 0x01c4     
0x4c 0x0150     0x4d 0x00f4

这里看到 undo 块中曾经有一个流动事务了，这就不仅仅是 seq 不统一的问题了，更重大的是一个新事务竟然去找了一个存在流动事务的 undo 块来应用。咱们晓得 undo 段头会预调配最多 5 个 block 到 free block pool 里作为下一个事务筹备，当须要调配 undo 块时，会去寻找 free block pool 外面的块。

看看报错事务指向的 undo 段头块(段头等为 0x750f)：

  TRN CTL:: seq: 0xddee chd: 0x0003 ctl: 0x001d inc: 0x00000000 nfb: 0x0001
            mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe)
            uba: 0x1cd49528.ddee.01 scn: 0x0eaa.692b5862
Version: 0x01
  FREE BLOCK POOL::
    uba: 0x1cd49528.ddee.01 ext: 0x16a6 spc: 0x1f84  
    uba: 0x00000000.ddc3.3e ext: 0x167b spc: 0x4e2   
    uba: 0x00000000.dc0b.4b ext: 0x47ab spc: 0x320   
    uba: 0x00000000.7727.48 ext: 0xd13 spc: 0x326   
    uba: 0x00000000.0000.00 ext: 0x0  spc: 0x0     
  TRN TBL::
 
  index  state cflags  wrap#    uel         scn            dba            parent-xid    nub     stmt_num    cmt
  ------------------------------------------------------------------------------------------------
   0x00    9    0x00  0x9416bd  0x0014  0x0eaa.692b7f1d  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893564
   0x01    9    0x00  0x93546c  0x000f  0x0eaa.692b5bde  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893563
   0x02    9    0x00  0x9454fb  0x000b  0x0eaa.692b622a  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893563
   0x03    9    0x00  0x93842a  0x0019  0x0eaa.692b58c4  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893563
   0x04    9    0x00  0x932b09  0x0007  0x0eaa.692b7254  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893564
   0x05    9    0x00  0x931008  0x0010  0x0eaa.692b5d5f  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893563
   0x06    9    0x00  0x948b77  0x0021  0x0eaa.692b59f4  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893563
   0x07    9    0x00  0x93c1e6  0x0009  0x0eaa.692b7540  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893564
   0x08    9    0x00  0x93c4b5  0x000d  0x0eaa.692b5d8a  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893563
   0x09    9    0x00  0x94ac14  0x0000  0x0eaa.692b79f1  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893564
   0x0a    9    0x00  0x944b33  0x0012  0x0eaa.692b704e  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893564
   0x0b    9    0x00  0x9453f2  0x000c  0x0eaa.692b667a  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893563
   0x0c    9    0x00  0x9468e1  0x001e  0x0eaa.692b6a9d  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893563
   0x0d    9    0x00  0x94d5e0  0x0020  0x0eaa.692b5f85  0x00000000  0x0000.000.00000000  0x00000000   0x00000000  1635893563

能够看到 free block pool 里有一个块正好是 0x1cd49528。综上所述，undo 块 0x1cd49528 理论曾经调配给 undo 段头 0x750d，并且曾经有事务 xid: 0x750d.009.00abc7f4 正在应用该 undo 块，然而 undo 段头 0x750f 外面的 FREE BLOCK POOL 并没有把这个块清理掉，导致了 ORA-00600[4193]产生。

个别对于事务异样的问题会触发 blockrecover 操作，那 6 点 52 分 46 秒就呈现了 blockrecover，是否问题就是从这里开始的呢？PMON trace 显示 blockrecover 是从 6 点 52 分 49 秒开始，file 2608 block 144 即为 undo 段头 0x750f 的块地址：

*** 2021-11-03 06:52:49.027
Block recovery interrupted at rba 33832.4613427.16
Resuming block recovery (PMON) for file 2608 block 144
Block recovery from logseq 33832, block 4613427 to scn 16125071688920
 
*** 2021-11-03 06:52:49.031
Recovery of Online Redo Log: Thread 2 Group 7 Seq 33832 Reading mem 0
Block recovery interrupted at rba 33832.4769190.16
Resuming block recovery (PMON) for file 2608 block 144
Block recovery from logseq 33832, block 4769190 to scn 16125071688920
...

从一些前台过程 trace 里能够看到，6 点 52 分 51 秒的事务 TX-750F0003-0093843A 就曾经无奈进行。过程被 crash。

*** 2021-11-03 06:52:51.294
*** SESSION ID:(2049.549) 2021-11-03 06:52:51.294
*** CLIENT ID:() 2021-11-03 06:52:51.294
*** SERVICE NAME:(SYS$USERS) 2021-11-03 06:52:51.294
*** MODULE NAME:(OGG-REPYCJ-OPEN_DATA_SOURCE) 2021-11-03 06:52:51.294
*** ACTION NAME:() 2021-11-03 06:52:51.294
 
ksqgtl:newtx: res: 321161d7b0 TX-750F0003-0093843A flag: 0x47
 
*** 2021-11-03 06:52:52.294
Exception &#091;type: SIGSEGV, Address not mapped to object] &#091;ADDR:0x505E9323] &#091;PC:0x98214CA, kgegpa()+40] &#091;flags: 0x0, count: 1]
DDE previous invocation failed before phase II
DDE was called in a 'No Invocation Mode'
----- Start Diag Diagnostic Dump -----
Diag diagnostic dump is performed due to an error in the diagfw code during error handling.
DDE is switched to protected mode during the diagnostic dump to prevent recursive errors in the error hadnling code.
Dump error and call stack for the diagnostic dump:
 
*** 2021-11-03 06:52:52.319
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0)
----- Error Stack Dump -----
 
*** 2021-11-03 06:52:52.319
Exception &#091;type: SIGSEGV, Address not mapped to object] &#091;ADDR:0x505E9323] &#091;PC:0x981FF3E, kgebse()+776] &#091;flags: 0x2, count: 2]
Not recording an ORA-7445 (previous attmpet failed)
Warning! DDE is invoked in protected mode. DDE call is aborted.
Warning! DDE is invoked in protected mode. DDE call is aborted.
 
*** 2021-11-03 06:52:52.321
Exception &#091;type: SIGSEGV, Address not mapped to object] &#091;ADDR:0x505E9323] &#091;PC:0x981FF3E, kgebse()+776] &#091;flags: 0x2, count: 2]
Not recording an ORA-7445 (previous attmpet failed)
ssexhd: crashing the process...
Shadow_Core_Dump = PARTIAL

该事务 id 和之前剖析报错的 ORA-00600[4193]是一样的。阐明从 6 点 52 分 46 秒开始，undo 段头 0x750F 就无奈绑定事务了，起因就是 free block pool 里的 undo 块曾经不属于该 undo 段了，被事务 0x750d.009.00abc7f4 应用了。通过 redodump 能够验证这一点。

REDO RECORD - Thread:2 RBA: 0x008429.001ef4a5.0184 LEN: 0x00bc VLD: 0x01
SCN: 0x0eaa.692b839b SUBSCN:  6 11/03/2021 06:52:46
CHANGE #1 TYP:0 CLS:59949 AFN:2608 DBA:0x8c800090 OBJ:4294967295 SCN:0x0eaa.692b832d SEQ:1 OP:14.4 ENC:0 RBL:0
kteop redo - redo operation on extent map
   SETSTAT: exts:24996 blks:219215 lastmap:0x8c91f248 mapcnt:56
   UPDXNT: extent:5798 add:TRUE
CHANGE #2 TYP:0 CLS:7 AFN:2608 DBA:0x8c859230 OBJ:4294967295 SCN:0x0eaa.692b832d SEQ:1 OP:14.4 ENC:0 RBL:0
kteop redo - redo operation on extent map
   ADD: dba:0x1cd49528 len:8 at offset:1
  ADDRET: offset:1 ctime:0
CHANGE #3 TYP:0 CLS:13 AFN:115 DBA:0x1cc00002 OBJ:4294967295 SCN:0x0eaa.692b839b SEQ:1 OP:22.2 ENC:0 RBL:0
ktfbhredo - File Space Header Redo:
Header Opcode:
Save: No Pending Op
...
REDO RECORD - Thread:2 RBA: 0x008429.001efbb0.0050 LEN: 0x00bc VLD: 0x01
SCN: 0x0eaa.692b8484 SUBSCN:  4 11/03/2021 06:52:46
CHANGE #1 TYP:0 CLS:59945 AFN:2608 DBA:0x8c800080 OBJ:4294967295 SCN:0x0eaa.692b8427 SEQ:1 OP:14.4 ENC:0 RBL:0
kteop redo - redo operation on extent map
   SETSTAT: exts:25381 blks:221582 lastmap:0x8c8d46a8 mapcnt:57
   UPDXNT: extent:13691 add:TRUE
CHANGE #2 TYP:0 CLS:7 AFN:115 DBA:0x1cd3aa10 OBJ:4294967295 SCN:0x0eaa.692b8425 SEQ:1 OP:14.4 ENC:0 RBL:0
kteop redo - redo operation on extent map
   ADD: dba:0x1cd49528 len:8 at offset:189
  ADDRET: offset:189 ctime:0
CHANGE #3 TYP:0 CLS:13 AFN:115 DBA:0x1cc00002 OBJ:4294967295 SCN:0x0eaa.692b8484 SEQ:1 OP:22.2 ENC:0 RBL:0
ktfbhredo - File Space Header Redo:
Header Opcode:
Save: No Pending Op
...
REDO RECORD - Thread:2 RBA: 0x008429.001efd6f.0010 LEN: 0x02a4 VLD: 0x01
SCN: 0x0eaa.692b84b8 SUBSCN:  6 11/03/2021 06:52:46
CHANGE #1 TYP:0 CLS:59946 AFN:115 DBA:0x1cd49528 OBJ:4294967295 SCN:0x0eaa.692b84b8 SEQ:3 OP:5.1 ENC:0 RBL:0
ktudb redo: siz: 92 spc: 6690 flg: 0x0022 seq: 0x0d24 rec: 0x0f
            xid:  0x750d.009.00abc7f4
ktubu redo: slt: 9 rci: 14 opc: 11.1 objn: 6049924 objd: 6049924 tsn: 9
Undo type:  Regular undo       Undo type:  Last buffer split:  No
Tablespace Undo:  No
             0x00000000
KDO undo record:
KTB Redo
op: 0x02  ver: 0x01
compat bit: 4 (post-11) padding: 1
op: C  uba: 0x1cd49528.0d24.07
KDO Op code: DRP row dependencies Disabled
  xtype: XA flags: 0x00000000  bdba: 0x45707392  hdba: 0x09ab6222
itli: 2  ispac: 0  maxfr: 4858
tabn: 0 slot: 20(0x14)

该 undo 块一系列操作都是在 11/03/2021 06:52:46 实现，这显著不合乎 undo 块分配原则，狐疑是 bug，然而 mos 没有找到相似文档：

SCN: 0x0eaa.692b839b，undo 块 0x1cd49528 调配给到 undo 段头地址为 0x8c800090 的 undo 段(段头等为 0x750f)

SCN: 0x0eaa.692b8484，undo 块 0x1cd49528 调配到 undo 段头地址为 0x8c800080 的 undo 段(段头等为 0x750d)

SCN: 0x0eaa.692b84b8，事务 0x750d.009.00abc7f4 应用了该 undo 块

undo 块分配原则：

1.If the current extent has more free blocks, return the next free block.
2.Otherwise if the next extent has expired, wrap into the next extent and return the first block in that extent.
3.Otherwise get space from the Undo tablespace. If a free extent is available, allocate the extent to the transaction table and return the first block in the new extent.
4.Steal from an offlined transaction table. If a free extent is available, deallocate it from the offlined transaction table and add it to the current transaction table. Return the first free block from the new extent.
5.Steal from an onlined transaction table. The procedure is similar to (4).
6.Extend the file in the undo tablespace.If the files can be extended, add an extent to the current transaction table and return a block to the requestor
7.Reuse unexpired extents from own transaction table. If all extents are currently busy (i.e.contain uncommitted undo), go to the next step. Otherwise wrap into the next extent
8.Steal unexpired extents from offlined transaction tables. If fail try the onlined transaction tables.
9.If all else fails, report the ORA-1562:“Failed to extend rollback segment number %s”error.
 
Note:Steps 7 and 8 are skipped if the undo retention guarantee (10g and higher) is specified.

综上剖析咱们失去的解决办法有如下几种：

重建该实例的 undo 表空间

删除异常的 undo 段(段头等为 0x750f)

bbed 清空 undo 段 (段头等为 0x750f) 的 free block pool

那么故障的起因和解决方案剖析完了，还有一个很异样的问题，为什么简直同一时间 undo 块间断调配给两个 undo 段，当 undo block 调配给其余 undo 段应用的时候，以前的 undo 段的 free block pool 里没有清理掉呢？联合过后的 awr 来看有比较严重的 undo 争用，undo extent steal 也很频繁，再往前翻看 alert 发现频繁的 ORA-01628：

undo global data    ktusm_stealext_2    0    1,029,644    827,502
undo global data    ktusmupst: KSLBEGIN    0    44,653    83,762

Wed Nov 03 06:38:58 2021
Archived Log entry 220714 added for thread 2 sequence 33828 ID 0xbc36136c dest 1:
Wed Nov 03 06:39:51 2021
ORA-1628: max # extents  32765 reached for rollback segment _SYSSMU30353_261227539$ 
ORA-1628: max # extents  32765 reached for rollback segment _SYSSMU30353_261227539$ 
ORA-1628: max # extents  32765 reached for rollback segment _SYSSMU30353_261227539$ 
......
ORA-1628: max # extents  32765 reached for rollback segment _SYSSMU30353_261227539$ 
ORA-1628: max # extents  32765 reached for rollback segment _SYSSMU30353_261227539$ 
ORA-1628: max # extents  32765 reached for rollback segment
Wed Nov 03 06:41:12 2021
ORA-1628: max # extents  32765 reached for rollback segment _SYSSMU30292_4090763577$ 
Wed Nov 03 06:41:27 2021
Thread 2 advanced to log sequence 33830 (LGWR switch)
  Current log# 11 seq# 33830 mem# 0: +FDATADG/newepmfdp/onlinelog/group_11.305.1083929527
Wed Nov 03 06:41:31 2021
Archived Log entry 220716 added for thread 2 sequence 33829 ID 0xbc36136c dest 1:
Wed Nov 03 06:41:48 2021
ORA-1628: max # extents  32765 reached for rollback segment _SYSSMU30343_1212715771$ 
Errors in file /u01/app/oracle/diag/rdbms/newepmfdp/epmfdp2/trace/epmfdp2_j002_77822.trc:
ORA-12012: error on auto execute of job 1230
ORA-12008: error in materialized view refresh path
ORA-01628: max # extents (32765) reached for rollback segment _SYSSMU30343_1212715771$
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2809
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 3025
ORA-06512: at "SYS.DBMS_IREFRESH", line 689
ORA-06512: at "SYS.DBMS_REFRESH", line 195
ORA-06512: at line 1
Wed Nov 03 06:43:45 2021
ORA-1628: max # extents  32765 reached for rollback segment _SYSSMU30359_11512129$ 
Wed Nov 03 06:43:45 2021
ORA-1628: max # extents  32765 reached for rollback segment _SYSSMU30359_11512129$

我狐疑是频繁的 undo extent 扩大、膨胀、偷取导致的 undo 块调配错乱。因为没有对系统平时的负载做剖析，这里临时先给出一些粗略的后续倡议：

undo 设置为 undo retention guarantee

正当设置 undo 相干参数

正当设计事务，防止大事务产生

剖析 dba_undo_extents，很可能有过多的十分小的 undo extent

墨天轮原文链接：https://www.modb.pro/db/17514…（复制链接至浏览器查看)

对于作者
李翔宇，云和恩墨西区交付技术顾问，长期服务挪动运营商行业客户，相熟 Oracle 性能优化，故障诊断，非凡复原。