关于数据库:Oracle案例ohasd-crash-on-aix

本案例是福建某客户一套待上线零碎，操作系统版本为 AIX 7.2，集群版本为 12.2。几天前 2 节点 GI 忽然重启，本文将剖析其产生的起因。
因为是 GI 忽然重启，那么先看看 GI alert 日志：

2021-11-09 21:21:57.050 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:00.066 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:03.070 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:05.093 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:05.151 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:06.081 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:06.093 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:06.100 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:06.105 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:06.109 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:09.091 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:12.097 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:15.100 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:18.105 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:21.109 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:24.113 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:27.118 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:30.122 &#91;GPNPD(7537210)]CRS-2340: Errors occurred while processing received gpnp message. 
2021-11-09 21:22:33.126 &#91;OHASD(22348120)]Can't find  message for prod&#91;CRS], fac&#91;CRS], and msg id &#91;2119]
2021-11-09 21:22:33.130 &#91;ORAAGENT(17629684)]CRS-5822: Agent '/app/grid/product/12.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:2:1632} in /app/grid/base/diag/crs/p1bemdg2/crs/trace/ohasd_oraagent_grid.trc.
2021-11-09 21:22:33.130 &#91;CSSDMONITOR(13042246)]CRS-5822: Agent '/app/grid/product/12.2.0/grid/bin/cssdmonitor_root' disconnected from server. Details at (:CRSAGF00117:) {0:3:21405} in /app/grid/base/diag/crs/p1bemdg2/crs/trace/ohasd_cssdmonitor_root.trc.
2021-11-09 21:22:33.130 &#91;ORAROOTAGENT(38535474)]CRS-5822: Agent '/app/grid/product/12.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:1:1420} in /app/grid/base/diag/crs/p1bemdg2/crs/trace/ohasd_orarootagent_root.trc.
2021-11-09 21:22:33.130 &#91;CSSDAGENT(14418484)]CRS-5822: Agent '/app/grid/product/12.2.0/grid/bin/cssdagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:4:21329} in /app/grid/base/diag/crs/p1bemdg2/crs/trace/ohasd_cssdagent_root.trc.

能够看到从 21 点 21 分开始呈现 gpnp 通信异样，22 分 33 秒时 ohasd 栈的 4 个 agent 全副同时挂掉，这其实给了咱们很大的提醒，ohasd_orarootagent_root、ohasd_oraagent_grid、ohasd_orarootagent_root、ohasd_cssdagent_root 正好是 ohasd.bin 守护过程启动之后启动的 4 个 agent，它们负责保护(启动、敞开等等)ohasd 层面的资源，如果呈现同时挂掉的话，那么大概率是 ohasd.bin 呈现了异样 crash 掉了，所以下一步间接看 ohasd 的 trace 文件。

2021-11-09 21:21:50.060 :  OCRUTL:3085: prou_add_log_list_int:proath_new_entry is NULL for 2021-11-09 21:21:50.060: th_process_request: Completed op &#91;13] return &#91;12]
2021-11-09 21:21:50.060 :  OCRSRV:3085: th_select_w_f_r: Error processing request &#91;12]
2021-11-09 21:21:50.060 :  OCRUTL:3085: u_allocm: metactx &#91;111a7c2d0], ptr returned is NULL, size to lmmcalloc &#91;1040]
2021-11-09 21:21:50.060 :  OCRUTL:3085: prou_add_log_list_int:proath_new_entry is NULL for 2021-11-09 21:21:50.060: th_select_w_f_t: Processed request constr &#91;0000000000000069] retval &#91;12]
2021-11-09 21:21:57.032 :  OCRUTL:3342: u_allocm: metactx &#91;111a7c2d0], ptr returned is NULL, size to lmmcalloc &#91;4120]
  OCRSRV:3342: proas_get_value: Failure is malloc &#91;12]
  OCRAPI:2828: SLOS : SLOS: cat=-2, opn=malloc, dep=12, loc=addgrpnm1
unable to add group (system)
 
  OCRUTL:2828: u_fill_errorbuf: Error Info : &#91;&#91;SLOS: cat=-2, opn=malloc, dep=12, loc=addgrpnm1
unable to add group (system)

从 gpnpd 出现异常的工夫点，能够看到 ohasd 动态分配 4120 的内存失败，从 gipcd.trc 也能够看出通信异样的起因也是无奈分配内存。

2021-11-09 21:21:57.034 :  OCRCLI:1286: proac_get_value:&#91;SYSTEM.GPnP.profiles.peer.best]: Response message failed &#91;12] 
2021-11-09 21:21:57.034 :  OCRAPI:1286: a_read: failed to read key &#91;12]
2021-11-09 21:21:57.034 :    GPNP:1286:  procr_get_value_ext: OLR api procr_get_value_ext failed for key SYSTEM.GPnP.profiles.peer.best
2021-11-09 21:21:57.034 :    GPNP:1286:  procr_get_value_ext: OLR current boot level : 7
2021-11-09 21:21:57.034 :    GPNP:1286:  procr_get_value_ext: OLR error code    : 12
2021-11-09 21:21:57.034 :    GPNP:1286:  procr_get_value_ext: OLR error message : PROCL-12: Out of heap memory.
2021-11-09 21:21:57.034 :    GPNP:1286:  clsgpnp_getProfileEx: &#91;at clsgpnp.c:602] Result: (2) CLSGPNP_MEMORY. (:GPNP00140:) get cache profile failed, will try gpnpd call
2021-11-09 21:21:57.045 :  OCRCLI:1286: proac_get_value:&#91;SYSTEM.version.activeversion]: Response message failed &#91;12] 
2021-11-09 21:21:57.045 :  OCRAPI:1286: a_read: failed to read key &#91;12]
 default:1286: Failure 12 in trying to get AV value SYSTEM.version.activeversion
 
 default:1286: procr_get_value error 12 errorbuf : PROCL-12: Out of heap memory.

持续往下看 ohasd 的 trace：

2021-11-09 21:22:18.103 :  OCRUTL:2571: prou_add_log_list_int:proath_new_entry is NULL for 2021-11-09 21:22:18.102: th_select_w_f_t: Processed request constr &#91;0000000000000069] retval &#91;0]
  OCRAPI:3342: SLOS : SLOS: cat=-2, opn=malloc, dep=12, loc=addgrpnm1
unable to add group (system)
 
  OCRUTL:3342: u_fill_errorbuf: Error Info : &#91;&#91;SLOS: cat=-2, opn=malloc, dep=12, loc=addgrpnm1
unable to add group (system)
]]
2021-11-09 21:22:18.103 :  OCRUTL:2828: u_allocm: metactx &#91;111a7c2d0], ptr returned is NULL, size to lmmcalloc &#91;4120]
  OCRSRV:2828: proas_get_value: Failure is malloc &#91;12]
2021-11-09 21:22:18.104 :  OCRUTL:3085: u_allocm: metactx &#91;111a7c2d0], ptr returned is NULL, size to lmmcalloc &#91;4120]
  OCRSRV:3085: proas_get_value: Failure is malloc &#91;12]
2021-11-09 21:22:18.104 :  OCRSRV:2571: proas_cache_locked: Value not cached for key &#91;SYSTEM.network] : retval &#91;84]
  OCRAPI:3342: SLOS : SLOS: cat=-2, opn=malloc, dep=12, loc=addgrpnm1
unable to add group (system)
 
  OCRUTL:3342: u_fill_errorbuf: Error Info : &#91;&#91;SLOS: cat=-2, opn=malloc, dep=12, loc=addgrpnm1
unable to add group (system)
]]
2021-11-09 21:22:18.105 :  OCRUTL:3085: u_allocm: metactx &#91;111a7c2d0], ptr returned is NULL, size to lmmcalloc &#91;4120]
  OCRSRV:3085: proas_get_value: Failure is malloc &#91;12]
2021-11-09 21:22:18.107 :  OCRUTL:2571: u_allocm: metactx &#91;111a7c2d0], ptr returned is NULL, size to lmmcalloc &#91;280]
2021-11-09 21:22:18.107 :  OCRRAW:2571: prcreatei: could not allocate memory 3
...
2021-11-09 21:22:33.126 :  OCRUTL:2571: u_allocm: metactx &#91;111a7c2d0], ptr returned is NULL, size to lmmcalloc &#91;4120]
  OCRSRV:2571: proas_get_value: Failure is malloc &#91;12]
2021-11-09 21:22:33.126 :GIPCXCPT:2057:  gipclibMalloc: failed to allocate 40960 bytes, cowork 11234c978, ret gipcretOutOfMemory (28)
2021-11-09 21:22:33.126 :  OCRUTL:2828: u_allocm: metactx &#91;111a7c2d0], ptr returned is NULL, size to lmmcalloc &#91;4120]
  OCRSRV:2828: proas_get_value: Failure is malloc &#91;12]
2021-11-09 21:22:33.126 :GIPCXCPT:2057:  gipcWaitF &#91;prom_select : prom.c : 4591]: EXCEPTION&#91; ret gipcretOutOfMemory (28) ]  failed to wait on obj 112592dd0 &#91;000000000000002a] {gipcContainer : numObj 11, wobj 1125933b0, objFlags 0x0}, reqList 11234d420, nreq 1, creq 11234d4e8 timeout INFINITE, flags 0x0
2021-11-09 21:22:33.126 :  OCRUTL:2828: u_allocm: metactx &#91;111a7c2d0], ptr returned is NULL, size to lmmcalloc &#91;1040]
2021-11-09 21:22:33.126 :  OCRMSG:2057: prom_select: Failed to wait on container &#91;28]
2021-11-09 21:22:33.126 :  OCRUTL:2828: prou_add_log_list_int:proath_new_entry is NULL for 2021-11-09 21:22:33.126: prom_send: Before gipcSend
2021-11-09 21:22:33.126 :  OCRMSG:2057: GIPC error &#91;28] msg &#91;gipcretOutOfMemory]
2021-11-09 21:22:33.126 :  OCRUTL:2571: u_allocm: metactx &#91;111a7c2d0], ptr returned is NULL, size to lmmcalloc &#91;1040]
2021-11-09 21:22:33.126 :  OCRUTL:2571: prou_add_log_list_int:proath_new_entry is NULL for 2021-11-09 21:22:33.126: th_select_w_f_t: Processed request constr &#91;0000000000000069] retval &#91;0]
2021-11-09 21:22:33.126 :  OCRUTL:2057: u_allocm: metactx &#91;111a7c2d0], ptr returned is NULL, size to lmmcalloc &#91;2088]
2021-11-09 21:22:33.126 :  OCRUTL:2057: u_insert_error: Error allocating memory for hashelement.
  OCRSRV:2057: th_select: Failed to select &#91;24]
  OCRSRV:2057: th_select: Can no longer process local connections.  Setting OCR context to invalid
2021-11-09 21:22:33.126 :  OCRAPI:2057: procr_ctx_set_invalid: ctx is in state &#91;10].
2021-11-09 21:22:33.126 :  OCRAPI:2057: procr_ctx_set_invalid: ctx set to invalid
2021-11-09 21:22:33.126 :  OCRAPI:2057: procr_ctx_set_invalid: Aborting...
Trace file /app/grid/base/diag/crs/p1bemdg2/crs/trace/ohasd.trc
Oracle Database 12c Clusterware Release 12.2.0.1.0 - Production Copyright 1996, 2016 Oracle. All rights reserved.
    CLSB:1: Argument count (argc) for this daemon is 2
    CLSB:1: Argument 0 is: /app/grid/product/12.2.0/grid/bin/ohasd.bin
    CLSB:1: Argument 1 is: restart

能够看到大量的内存调配失败之后，ohasd.bin 被 abort 了，这就是导致 GI 重启的起因。因为 ohasd.bin 是 root 来保护的，所以查看一下 root 的 ulimit：

这里能够看到大量谬误的设置，包含导致此次故障的 memory limit。

墨天轮原文链接：https://www.modb.pro/db/17495…（复制链接至浏览器或点击文末浏览原文查看)

对于作者
李翔宇，云和恩墨西区交付技术顾问，长期服务挪动运营商行业客户，相熟 Oracle 性能优化，故障诊断，非凡复原。