IBM HA5.3主机宕机时,资源组无法自动切换问题【已解决】

有两台IBM的小型机运行了办公邮件系统,这两台服务器的HA版本为5.3,客户反映当两台主机gsoamail1(10.52.4.105)和gsoamail2(10.52.4.106)服务器都正常的时候切换资源组时可以正常切换,但当一台主机宕机时,另一台主机无法自动接管故障主机的资源组。

       检查主机发现以下问题,gsoamail1(10.52.4.105)的errpt存在如下报错,而gsoamail2(10.52.4.106)则没有。

gsoamail1/tmp/apl #errpt|more

标识
时间戳记
T C 资源名
描述

0873CF9F
0131162513 T S tty0
TTYHOG 溢出

0873CF9F
0131162313 T S tty0
TTYHOG 溢出

0873CF9F
0131162013 T S tty0
TTYHOG 溢出

0873CF9F
0131161813 T S tty0

TTYHOG 溢出

0873CF9F
0131161613 T S tty0
TTYHOG 溢出

0873CF9F
0131161413 T S tty0
TTYHOG 溢出

0873CF9F
0131161213 T S tty0
TTYHOG 溢出

0873CF9F
0131161013 T S tty0
TTYHOG 溢出

gsoamail1/tmp/apl #errpt -a|more

---------------------------------------------------------------------------

标号:TTY_TTYHOG

标识:0873CF9F

日期/时间:
公元2013年01月31日
星期四
16时25分08秒

序号:
57660

机器标识:
00C2FF704C00

节点标识:
gsoamail1

类:
S

类型:
TEMP

资源名:
tty0

描述

TTYHOG 溢出

失败原因

处理器过载


推荐的操作


减少系统负荷。


减少串行口波特率

重复

重复数


999

第一个重复的时间

公元2013年01月31日
星期四
16时23分02秒

最后一个重复的时间

公元2013年01月31日
星期四
16时25分08秒

使用lssrc –ls topsvcs命令看到如下信息:

gsoamail1/tmp/apl #lssrc -ls topsvcs

Subsystem
Group
PID
Status


topsvcs
topsvcs
385494
active

Network Name
Indx Defd
Mbrs
St
Adapter ID
Group ID

net_ether_01_0 [ 0] 2 2 S 192.168.17.13
192.168.17.14

net_ether_01_0 [ 0] en5
0x38c95c0b
0x38c9dbd1

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 15 Current group: 15

Packets sent
: 4233130 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 6618604 ICMP 0 Dropped: 0

NIM's PID: 532624

net_ether_01_1 [ 1] 2 2 S 192.168.18.13
192.168.18.14

net_ether_01_1 [ 1] en6
0x38c95c0c
0x38c9dbd2

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 16 Current group: 16

Packets sent
: 4232608 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 6618145 ICMP 0 Dropped: 0

NIM's PID: 86210

net_ether_02_0 [ 2] 2 2 S 192.168.20.13
192.168.20.14

net_ether_02_0 [ 2] en4
0x38c95c0d
0x38c9dbd3

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 10 Current group: 10

Packets sent
: 4232883 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 6618019 ICMP 0 Dropped: 0

NIM's PID: 451000

net_ether_02_1 [ 3] 2 2 S 192.168.19.13
192.168.19.14

net_ether_02_1 [ 3] en3
0x38c95c0e
0x38c9dbd4

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 11 Current group: 11

Packets sent
: 4233009 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 6617901 ICMP 0 Dropped: 0

NIM's PID: 200830

rs232_0 [ 4] 2 0 D 255.255.0.0  #状态不正常

rs232_0 [ 4] tty0 Adapter state unknown

HB Interval = 2.000 secs. Sensitivity = 5 missed beats


2 locally connected Clients with PIDs:

haemd(184696) hagsd(442672)


Dead Man Switch Enabled:


reset interval = 1 seconds


trip
interval = 20 seconds


Client Heartbeating Disabled.


Configuration Instance = 9


Daemon employs no security


Segments pinned: Text Data.


Text segment size: 809 KB. Static data segment size: 1520 KB.


Dynamic data segment size: 4545. Number of outstanding malloc: 257


User time 391 sec. System time 254 sec.


Number of page faults: 117. Process swapped out 0 times.


Number of nodes up: 2. Number of nodes down: 0.

而双机正常的系统内的状态如下面红色标记处:

[app1:root:/#]lssrc -ls topsvcs

Subsystem
Group
PID
Status


topsvcs
topsvcs
241690
active

Network Name
Indx Defd
Mbrs
St
Adapter ID
Group ID

net_ether_01_0 [ 0] 2
2
S
192.168.18.11
192.168.18.12

net_ether_01_0 [ 0] en6
0x38c579e7
0x38c57aa5

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 181 Current group: 181

Packets sent
: 4490908 ICMP 112 Errors: 0 No mbuf: 0

Packets received: 7021195 ICMP 144 Dropped: 142

NIM's PID: 208988

net_ether_01_1 [ 1] 2
2
S
192.168.17.11
192.168.17.12

net_ether_01_1 [ 1] en5
0x38c579e8
0x38fd9425

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 227 Current group: 39

Packets sent
: 4490691 ICMP 130 Errors: 0 No mbuf: 0

Packets received: 7021515 ICMP 89 Dropped: 203

NIM's PID: 180690

net_ether_02_0 [ 2] 2
2
S
192.168.20.11
192.168.20.12

net_ether_02_0 [ 2] en4
0x38c579e9
0x38f067a4

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 226 Current group: 128

Packets sent
: 4490887 ICMP 135 Errors: 0 No mbuf: 0

Packets received: 7020968 ICMP 168 Dropped: 198

NIM's PID: 160190

net_ether_02_1 [ 3] 2
2
S
192.168.19.11
192.168.19.12

net_ether_02_1 [ 3] en3
0x38c579ea
0x38c57aa6

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 311 Current group: 311

Packets sent
: 4491083 ICMP 200 Errors: 0 No mbuf: 0

Packets received: 7020754 ICMP 77 Dropped: 270

NIM's PID: 156052

rs232_0 [ 4] 2 2 S 255.255.0.0
255.255.0.1

rs232_0 [ 4] tty0 0x80f0673a
0x8109687a

HB Interval = 2.000 secs. Sensitivity = 5 missed beats

Missed HBs: Total: 124 Current group: 0

Packets sent
: 3087618 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 3201908 ICMP 0 Dropped: 21

NIM's PID: 204938


2 locally connected Clients with PIDs:

haemd(245990) hagsd(213044)


Dead Man Switch Enabled:


reset interval = 1 seconds


trip
interval = 20 seconds


Client Heartbeating Disabled.


Configuration Instance = 18


Daemon employs no security


Segments pinned: Text Data.


Text segment size: 809 KB. Static data segment size: 1520 KB.


Dynamic data segment size: 4737. Number of outstanding malloc: 269


User time 407 sec. System time 282 sec.


Number of page faults: 1353. Process swapped out 0 times.


Number of nodes up: 2. Number of nodes down: 0.

1、请问串口心跳是否存在问题呢?

2、errpt中的报错是什么原因导致的,怎么可以排除?

3、假设串口心跳不通,走网络心跳,当两台主机都正常时,资源组可以切换,当一台主机宕机时,理论上资源组能否被另一个主机接管呢?

4、目前该问题可能的故障原因在哪里呢?

参与9

8同行回答

zjttt1981zjttt1981QA工程师123
dingdingdingdingdingdingdingdingding显示全部
dingdingdingdingdingdingdingdingding收起
IT培训教育 · 2014-01-03
浏览2986
mdl9630mdl9630系统运维工程师西安民航凯亚科技
学习了,增添知识!显示全部
学习了,增添知识!收起
轨道交通 · 2013-12-24
浏览3009
hai314615910hai314615910系统运维工程师xxxxxxxxxx
串口是确实有问题的,经过断电重新拔插后,串口心跳状态恢复正常,但是切换依然存在问题。后来检查发现是客户在一个主机上做了双机文件系统的扩展,但没有做同步,导致的。显示全部
串口是确实有问题的,经过断电重新拔插后,串口心跳状态恢复正常,但是切换依然存在问题。
后来检查发现是客户在一个主机上做了双机文件系统的扩展,但没有做同步,导致的。收起
系统集成 · 2013-09-24
浏览3105
flm20080704flm20080704系统工程师XXXX
我这边是停双机后检查,测试串口通信正常,查看topsvcs正常。重新启动双机正常,查看双机状态正常,报错没有继续报。显示全部
我这边是停双机后检查,测试串口通信正常,查看topsvcs正常。重新启动双机正常,查看双机状态正常,报错没有继续报。收起
IT其它 · 2013-08-15
浏览3100
mailboyvipmailboyvip学生北航
问题是否已解决??显示全部
问题是否已解决??收起
互联网服务 · 2013-06-14
浏览3150
flm20080704flm20080704系统工程师XXXX
顶一个,也遇到类似的情况。暂时没法停双机检查串口通信,查看smitty hacmp ->problem tools-> View Current State显示资源组和网络正常,串口  Network Name: net_rs232_01       State: DOWN。当前tty0波特率19200,异步卡5723显示可用。LABEL:...显示全部
顶一个,也遇到类似的情况。暂时没法停双机检查串口通信,查看smitty hacmp ->problem tools-> View Current State显示资源组和网络正常,串口  Network Name: net_rs232_01       State: DOWN。当前tty0波特率19200,异步卡5723显示可用。
LABEL:          TS_NIM_ERROR_STUCK_
IDENTIFIER:     3D32B80D

Date/Time:       Sun Jun  9 11:15:45 BEIST 2013
Sequence Number: 4672
Machine Id:      00C895704C00
Node Id:         ERCC_A
Class:           S
Type:            PERM
Resource Name:   topsvcs         

Description
NIM thread blocked

Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU

User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters
        Call IBM Service if problem persists

Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters
        Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.21,5943            
ERROR ID
6BUfAx.VBzgF/YCR./gQ.8....................
REFERENCE CODE
                                          
Thread which was blocked
receive thread
Interval in seconds during which process was blocked
          20
Interface name
tty0收起
IT其它 · 2013-06-13
浏览3092
zhanghhaizhanghhai系统工程师银科控股有限公司
先确认下串口线物理连接是否正常,然后主机上运行lsattrr -El tty0查看系统状态是否正常,如果正常再测试两台主机的串口通信是否正常.如果有条件可以换下串口线,再测试.显示全部
先确认下串口线物理连接是否正常,然后主机上运行lsattrr -El tty0查看系统状态是否正常,如果正常再测试两台主机的串口通信是否正常.如果有条件可以换下串口线,再测试.收起
证券 · 2013-03-05
浏览3106
20080107552008010755软件开发工程师交通运输部规划研究院
这个还真信不会显示全部
这个还真信不会收起
互联网服务 · 2013-02-06
浏览3047

提问者

hai314615910
系统运维工程师xxxxxxxxxx
擅长领域: 存储网络服务器

相关问题

相关资料

相关文章

问题状态

  • 发布时间:2013-01-31
  • 关注会员:1 人
  • 问题浏览:13063
  • 最近回答:2014-01-03
  • X社区推广