有两台IBM的小型机运行了办公邮件系统,这两台服务器的HA版本为5.3,客户反映当两台主机gsoamail1(10.52.4.105)和gsoamail2(10.52.4.106)服务器都正常的时候切换资源组时可以正常切换,但当一台主机宕机时,另一台主机无法自动接管故障主机的资源组。
检查主机发现以下问题,gsoamail1(10.52.4.105)的errpt存在如下报错,而gsoamail2(10.52.4.106)则没有。
gsoamail1/tmp/apl #errpt|more
标识
时间戳记
T C 资源名
描述
0873CF9F
0131162513 T S tty0
TTYHOG 溢出
0873CF9F
0131162313 T S tty0
TTYHOG 溢出
0873CF9F
0131162013 T S tty0
TTYHOG 溢出
0873CF9F
0131161813 T S tty0
TTYHOG 溢出
0873CF9F
0131161613 T S tty0
TTYHOG 溢出
0873CF9F
0131161413 T S tty0
TTYHOG 溢出
0873CF9F
0131161213 T S tty0
TTYHOG 溢出
0873CF9F
0131161013 T S tty0
TTYHOG 溢出
gsoamail1/tmp/apl #errpt -a|more
---------------------------------------------------------------------------
标号:TTY_TTYHOG
标识:0873CF9F
日期/时间:
公元2013年01月31日
星期四
16时25分08秒
序号:
57660
机器标识:
00C2FF704C00
节点标识:
gsoamail1
类:
S
类型:
TEMP
资源名:
tty0
描述
TTYHOG 溢出
失败原因
处理器过载
推荐的操作
减少系统负荷。
减少串行口波特率
重复
重复数
999
第一个重复的时间
公元2013年01月31日
星期四
16时23分02秒
最后一个重复的时间
公元2013年01月31日
星期四
16时25分08秒
使用lssrc –ls topsvcs命令看到如下信息:
gsoamail1/tmp/apl #lssrc -ls topsvcs
Subsystem
Group
PID
Status
topsvcs
topsvcs
385494
active
Network Name
Indx Defd
Mbrs
St
Adapter ID
Group ID
net_ether_01_0 [ 0] 2 2 S 192.168.17.13
192.168.17.14
net_ether_01_0 [ 0] en5
0x38c95c0b
0x38c9dbd1
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 15 Current group: 15
Packets sent
: 4233130 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 6618604 ICMP 0 Dropped: 0
NIM's PID: 532624
net_ether_01_1 [ 1] 2 2 S 192.168.18.13
192.168.18.14
net_ether_01_1 [ 1] en6
0x38c95c0c
0x38c9dbd2
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 16 Current group: 16
Packets sent
: 4232608 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 6618145 ICMP 0 Dropped: 0
NIM's PID: 86210
net_ether_02_0 [ 2] 2 2 S 192.168.20.13
192.168.20.14
net_ether_02_0 [ 2] en4
0x38c95c0d
0x38c9dbd3
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 10 Current group: 10
Packets sent
: 4232883 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 6618019 ICMP 0 Dropped: 0
NIM's PID: 451000
net_ether_02_1 [ 3] 2 2 S 192.168.19.13
192.168.19.14
net_ether_02_1 [ 3] en3
0x38c95c0e
0x38c9dbd4
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 11 Current group: 11
Packets sent
: 4233009 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 6617901 ICMP 0 Dropped: 0
NIM's PID: 200830
rs232_0 [ 4] 2 0 D 255.255.0.0 #状态不正常
rs232_0 [ 4] tty0 Adapter state unknown
HB Interval = 2.000 secs. Sensitivity = 5 missed beats
2 locally connected Clients with PIDs:
haemd(184696) hagsd(442672)
Dead Man Switch Enabled:
reset interval = 1 seconds
trip
interval = 20 seconds
Client Heartbeating Disabled.
Configuration Instance = 9
Daemon employs no security
Segments pinned: Text Data.
Text segment size: 809 KB. Static data segment size: 1520 KB.
Dynamic data segment size: 4545. Number of outstanding malloc: 257
User time 391 sec. System time 254 sec.
Number of page faults: 117. Process swapped out 0 times.
Number of nodes up: 2. Number of nodes down: 0.
而双机正常的系统内的状态如下面红色标记处:
[app1:root:/#]lssrc -ls topsvcs
Subsystem
Group
PID
Status
topsvcs
topsvcs
241690
active
Network Name
Indx Defd
Mbrs
St
Adapter ID
Group ID
net_ether_01_0 [ 0] 2
2
S
192.168.18.11
192.168.18.12
net_ether_01_0 [ 0] en6
0x38c579e7
0x38c57aa5
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 181 Current group: 181
Packets sent
: 4490908 ICMP 112 Errors: 0 No mbuf: 0
Packets received: 7021195 ICMP 144 Dropped: 142
NIM's PID: 208988
net_ether_01_1 [ 1] 2
2
S
192.168.17.11
192.168.17.12
net_ether_01_1 [ 1] en5
0x38c579e8
0x38fd9425
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 227 Current group: 39
Packets sent
: 4490691 ICMP 130 Errors: 0 No mbuf: 0
Packets received: 7021515 ICMP 89 Dropped: 203
NIM's PID: 180690
net_ether_02_0 [ 2] 2
2
S
192.168.20.11
192.168.20.12
net_ether_02_0 [ 2] en4
0x38c579e9
0x38f067a4
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 226 Current group: 128
Packets sent
: 4490887 ICMP 135 Errors: 0 No mbuf: 0
Packets received: 7020968 ICMP 168 Dropped: 198
NIM's PID: 160190
net_ether_02_1 [ 3] 2
2
S
192.168.19.11
192.168.19.12
net_ether_02_1 [ 3] en3
0x38c579ea
0x38c57aa6
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 311 Current group: 311
Packets sent
: 4491083 ICMP 200 Errors: 0 No mbuf: 0
Packets received: 7020754 ICMP 77 Dropped: 270
NIM's PID: 156052
rs232_0 [ 4] 2 2 S 255.255.0.0
255.255.0.1
rs232_0 [ 4] tty0 0x80f0673a
0x8109687a
HB Interval = 2.000 secs. Sensitivity = 5 missed beats
Missed HBs: Total: 124 Current group: 0
Packets sent
: 3087618 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 3201908 ICMP 0 Dropped: 21
NIM's PID: 204938
2 locally connected Clients with PIDs:
haemd(245990) hagsd(213044)
Dead Man Switch Enabled:
reset interval = 1 seconds
trip
interval = 20 seconds
Client Heartbeating Disabled.
Configuration Instance = 18
Daemon employs no security
Segments pinned: Text Data.
Text segment size: 809 KB. Static data segment size: 1520 KB.
Dynamic data segment size: 4737. Number of outstanding malloc: 269
User time 407 sec. System time 282 sec.
Number of page faults: 1353. Process swapped out 0 times.
Number of nodes up: 2. Number of nodes down: 0.
1、请问串口心跳是否存在问题呢?
2、errpt中的报错是什么原因导致的,怎么可以排除?
3、假设串口心跳不通,走网络心跳,当两台主机都正常时,资源组可以切换,当一台主机宕机时,理论上资源组能否被另一个主机接管呢?
4、目前该问题可能的故障原因在哪里呢?