环境:
软件:oracle11.2.0.4RAC ,操作系统aix6107,hacmp5.5
硬件:小机2、存储2、网络*2
架构:oracleRAC asm磁盘组利用双存储冗余架构构建冗余磁盘组;oracleRAC voteocr 利用HACMP 共享卷组方式创建磁盘心跳。
问题:最近每个1~2个星期报一次3D32B80D 和3C81E43F 错误。
错误代码如下:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
LABEL: TS_NIM_ERROR_STUCK_
Date/Time: Wed Mar 16 03:08:20 CST 2022
Sequence Number: 50626
Machine Id: 00F83B6A4C00
Node Id: newhisvhfs1
Class: S
Type: PERM
WPAR: Global
Resource Name: topsvcs
Description
NIM thread blocked
Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU
The system clock was set forward
User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention
The system clock was manually set forward
Recommended Actions
Examine I/O and memory activity on the system
Reduce load on the system
Tune virtual memory parameters
Call IBM Service if problem persists
Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O
Recommended Actions
Examine I/O and memory activity on the system
Reduce load on the system
Tune virtual memory parameters
Call IBM Service if problem persists
Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.41,7916
ERROR ID
6BUfAx.YECAW/uyc01cU08....................
REFERENCE CODE
Thread which was blocked
receive thread
Interval in seconds during which process was blocked
27900
Interface name
rhdisk33
LABEL: TS_NIM_ERROR_STUCK_
IDENTIFIER: 3D32B80D
Date/Time: Wed Mar 16 03:08:20 CST 2022
Sequence Number: 50625
Machine Id: 00F83B6A4C00
Node Id: newhisvhfs1
Class: S
Type: PERM
WPAR: Global
Resource Name: topsvcs
Description
NIM thread blocked
Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU
The system clock was set forward
User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention
The system clock was manually set forward
Recommended Actions
Examine I/O and memory activity on the system
Reduce load on the system
Tune virtual memory parameters
Call IBM Service if problem persists
Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O
Recommended Actions
Examine I/O and memory activity on the system
Reduce load on the system
Tune virtual memory parameters
Call IBM Service if problem persists
Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.41,7916
ERROR ID
6BUfAx.YECAW/fd/01cU08....................
REFERENCE CODE
Thread which was blocked
receive thread
Interval in seconds during which process was blocked
27899
Interface name
LABEL: TS_NIM_ERROR_STUCK_
IDENTIFIER: 3D32B80D
Date/Time: Wed Mar 16 03:08:20 CST 2022
Sequence Number: 50624
Machine Id: 00F83B6A4C00
Node Id: newhisvhfs1
Class: S
Type: PERM
WPAR: Global
Resource Name: topsvcs
Description
NIM thread blocked
Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU
The system clock was set forward
User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention
The system clock was manually set forward
Recommended Actions
Examine I/O and memory activity on the system
Reduce load on the system
Tune virtual memory parameters
Call IBM Service if problem persists
Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O
Recommended Actions
Examine I/O and memory activity on the system
Reduce load on the system
Tune virtual memory parameters
Call IBM Service if problem persists
Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.41,7916
ERROR ID
6BUfAx.YECAW/eBh.1cU08....................
REFERENCE CODE
Thread which was blocked
receive thread
Interval in seconds during which process was blocked
27900
Interface name
en15
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
集群心跳状态如下:
网络和磁盘心跳存在丢包现象
/#lssrc -ls topsvcs
Subsystem Group PID Status
topsvcs topsvcs 7143780 active
Network Name Indx Defd Mbrs St Adapter ID Group ID
net_ether_01_0 [ 0] 2 2 S 10.10.5.114 10.10.5.115
net_ether_01_0 [ 0] en17 0x419b608b 0x419b65c8
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 15 Current group: 5
Packets sent : 17778053 ICMP 8 Errors: 0 No mbuf: 0
Packets received: 23130783 ICMP 51 Dropped: 0
NIM's PID: 5308798
net_ether_02_0 [ 1] 2 2 S 10.10.10.3 10.10.10.4
net_ether_02_0 [ 1] en15 0x419b608d 0x419b65cc
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 32 Current group: 15
Packets sent : 17778315 ICMP 8 Errors: 0 No mbuf: 0
Packets received: 23111569 ICMP 52 Dropped: 0
NIM's PID: 6225934
diskhb_0 [ 2] 2 2 S 255.255.10.1 255.255.10.3
diskhb_0 [ 2] rhdisk33 0x8129fcfe 0x819b65cd
HB Interval = 2.000 secs. Sensitivity = 4 missed beats
Missed HBs: Total: 1727 Current group: 982
Packets sent : 8471066 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 8931503 ICMP 0 Dropped: 0
NIM's PID: 7864588
diskhb_1 [ 3] 2 2 S 255.255.10.0 255.255.10.2
diskhb_1 [ 3] rhdisk32 0x8129fcff 0x819b65ce
HB Interval = 2.000 secs. Sensitivity = 4 missed beats
Missed HBs: Total: 1716 Current group: 982
Packets sent : 8470911 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 8931415 ICMP 0 Dropped: 0
NIM's PID: 5963920
2 locally connected Clients with PIDs:
haemd(7930352) hagsd(7274814)
Fast Failure Detection available but off.
Dead Man Switch Enabled:
reset interval = 1 seconds
trip interval = 20 seconds
Client Heartbeating Disabled.
Configuration Instance = 4
Daemon employs no security
Segments pinned: Text Data.
Text segment size: 862 KB. Static data segment size: 1497 KB.
Dynamic data segment size: 5953. Number of outstanding malloc: 167
User time 1246 sec. System time 988 sec.
Number of page faults: 148. Process swapped out 0 times.
Number of nodes up: 2. Number of nodes down: 0.
查看系统nmon日志,在问题发生时点未发现明显的IO或者内存使用异常。
收起