系统集成故障诊断系统维护 aix 5.3

IBM HA5.3主机宕机时，资源组无法自动切换问题【已解决】

有两台IBM的小型机运行了办公邮件系统，这两台服务器的HA版本为5.3，客户反映当两台主机gsoamail1（10.52.4.105）和gsoamail2（10.52.4.106）服务器都正常的时候切换资源组时可以正常切换，但当一台主机宕机时，另一台主机无法自动接管故障主机的资源组。

检查主机发现以下问题，gsoamail1（10.52.4.105）的errpt存在如下报错，而gsoamail2（10.52.4.106）则没有。

gsoamail1/tmp/apl #errpt|more

标识
时间戳记
T C 资源名
描述

0873CF9F
0131162513 T S tty0
TTYHOG 溢出

0873CF9F
0131162313 T S tty0
TTYHOG 溢出

0873CF9F
0131162013 T S tty0
TTYHOG 溢出

0873CF9F
0131161813 T S tty0

TTYHOG 溢出

0873CF9F
0131161613 T S tty0
TTYHOG 溢出

0873CF9F
0131161413 T S tty0
TTYHOG 溢出

0873CF9F
0131161213 T S tty0
TTYHOG 溢出

0873CF9F
0131161013 T S tty0
TTYHOG 溢出

gsoamail1/tmp/apl #errpt -a|more

---------------------------------------------------------------------------

标号：TTY_TTYHOG

标识：0873CF9F

日期／时间：
公元2013年01月31日
星期四
16时25分08秒

序号：
57660

机器标识：
00C2FF704C00

节点标识：
gsoamail1

类：
S

类型：
TEMP

资源名：
tty0

描述

TTYHOG 溢出

失败原因

处理器过载

推荐的操作

减少系统负荷。

减少串行口波特率

重复

重复数

999

第一个重复的时间

公元2013年01月31日
星期四
16时23分02秒

最后一个重复的时间

公元2013年01月31日
星期四
16时25分08秒

使用lssrc –ls topsvcs命令看到如下信息：

gsoamail1/tmp/apl #lssrc -ls topsvcs

Subsystem
Group
PID
Status

topsvcs
topsvcs
385494
active

Network Name
Indx Defd
Mbrs
St
Adapter ID
Group ID

net_ether_01_0 [ 0] 2 2 S 192.168.17.13
192.168.17.14

net_ether_01_0 [ 0] en5
0x38c95c0b
0x38c9dbd1

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 15 Current group: 15

Packets sent
: 4233130 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 6618604 ICMP 0 Dropped: 0

NIM's PID: 532624

net_ether_01_1 [ 1] 2 2 S 192.168.18.13
192.168.18.14

net_ether_01_1 [ 1] en6
0x38c95c0c
0x38c9dbd2

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 16 Current group: 16

Packets sent
: 4232608 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 6618145 ICMP 0 Dropped: 0

NIM's PID: 86210

net_ether_02_0 [ 2] 2 2 S 192.168.20.13
192.168.20.14

net_ether_02_0 [ 2] en4
0x38c95c0d
0x38c9dbd3

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 10 Current group: 10

Packets sent
: 4232883 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 6618019 ICMP 0 Dropped: 0

NIM's PID: 451000

net_ether_02_1 [ 3] 2 2 S 192.168.19.13
192.168.19.14

net_ether_02_1 [ 3] en3
0x38c95c0e
0x38c9dbd4

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 11 Current group: 11

Packets sent
: 4233009 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 6617901 ICMP 0 Dropped: 0

NIM's PID: 200830

rs232_0 [ 4] 2 0 D 255.255.0.0 #状态不正常

rs232_0 [ 4] tty0 Adapter state unknown

HB Interval = 2.000 secs. Sensitivity = 5 missed beats

2 locally connected Clients with PIDs:

haemd(184696) hagsd(442672)

Dead Man Switch Enabled:

reset interval = 1 seconds

trip
interval = 20 seconds

Client Heartbeating Disabled.

Configuration Instance = 9

Daemon employs no security

Segments pinned: Text Data.

Text segment size: 809 KB. Static data segment size: 1520 KB.

Dynamic data segment size: 4545. Number of outstanding malloc: 257

User time 391 sec. System time 254 sec.

Number of page faults: 117. Process swapped out 0 times.

Number of nodes up: 2. Number of nodes down: 0.

而双机正常的系统内的状态如下面红色标记处：

[app1:root:/#]lssrc -ls topsvcs

Subsystem
Group
PID
Status

topsvcs
topsvcs
241690
active

Network Name
Indx Defd
Mbrs
St
Adapter ID
Group ID

net_ether_01_0 [ 0] 2
2
S
192.168.18.11
192.168.18.12

net_ether_01_0 [ 0] en6
0x38c579e7
0x38c57aa5

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 181 Current group: 181

Packets sent
: 4490908 ICMP 112 Errors: 0 No mbuf: 0

Packets received: 7021195 ICMP 144 Dropped: 142

NIM's PID: 208988

net_ether_01_1 [ 1] 2
2
S
192.168.17.11
192.168.17.12

net_ether_01_1 [ 1] en5
0x38c579e8
0x38fd9425

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 227 Current group: 39

Packets sent
: 4490691 ICMP 130 Errors: 0 No mbuf: 0

Packets received: 7021515 ICMP 89 Dropped: 203

NIM's PID: 180690

net_ether_02_0 [ 2] 2
2
S
192.168.20.11
192.168.20.12

net_ether_02_0 [ 2] en4
0x38c579e9
0x38f067a4

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 226 Current group: 128

Packets sent
: 4490887 ICMP 135 Errors: 0 No mbuf: 0

Packets received: 7020968 ICMP 168 Dropped: 198

NIM's PID: 160190

net_ether_02_1 [ 3] 2
2
S
192.168.19.11
192.168.19.12

net_ether_02_1 [ 3] en3
0x38c579ea
0x38c57aa6

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 311 Current group: 311

Packets sent
: 4491083 ICMP 200 Errors: 0 No mbuf: 0

Packets received: 7020754 ICMP 77 Dropped: 270

NIM's PID: 156052

rs232_0 [ 4] 2 2 S 255.255.0.0
255.255.0.1

rs232_0 [ 4] tty0 0x80f0673a
0x8109687a

HB Interval = 2.000 secs. Sensitivity = 5 missed beats

Missed HBs: Total: 124 Current group: 0

Packets sent
: 3087618 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 3201908 ICMP 0 Dropped: 21

NIM's PID: 204938

2 locally connected Clients with PIDs:

haemd(245990) hagsd(213044)

Dead Man Switch Enabled:

reset interval = 1 seconds

trip
interval = 20 seconds

Client Heartbeating Disabled.

Configuration Instance = 18

Daemon employs no security

Segments pinned: Text Data.

Text segment size: 809 KB. Static data segment size: 1520 KB.

Dynamic data segment size: 4737. Number of outstanding malloc: 269

User time 407 sec. System time 282 sec.

Number of page faults: 1353. Process swapped out 0 times.

Number of nodes up: 2. Number of nodes down: 0.

1、请问串口心跳是否存在问题呢？

2、errpt中的报错是什么原因导致的，怎么可以排除？

3、假设串口心跳不通，走网络心跳，当两台主机都正常时，资源组可以切换，当一台主机宕机时，理论上资源组能否被另一个主机接管呢？

4、目前该问题可能的故障原因在哪里呢？

关注1

参与9

8同行回答
全部行业
全部行业 IT其它 互联网服务 证券 系统集成 轨道交通 IT培训教育
|
按赞同排序
按时间排序

zjttt1981QA工程师123

dingdingdingdingdingdingdingdingding收起

IT培训教育 · 2014-01-03

mdl9630系统运维工程师西安民航凯亚科技

学习了，增添知识！收起

轨道交通 · 2013-12-24

hai314615910系统运维工程师xxxxxxxxxx

串口是确实有问题的，经过断电重新拔插后，串口心跳状态恢复正常，但是切换依然存在问题。
后来检查发现是客户在一个主机上做了双机文件系统的扩展，但没有做同步，导致的。收起

系统集成 · 2013-09-24

flm20080704系统工程师XXXX

我这边是停双机后检查，测试串口通信正常，查看topsvcs正常。重新启动双机正常，查看双机状态正常，报错没有继续报。收起

IT其它 · 2013-08-15

mailboyvip学生北航

问题是否已解决？？收起

互联网服务 · 2013-06-14

flm20080704系统工程师XXXX

顶一个，也遇到类似的情况。暂时没法停双机检查串口通信，查看smitty hacmp ->problem tools-> View Current State显示资源组和网络正常，串口  Network Name: net_rs232_01    State: DOWN。当前tty0波特率19200，异步卡5723显示可用。
LABEL:       TS_NIM_ERROR_STUCK_
IDENTIFIER:    3D32B80D

Date/Time:    Sun Jun  9 11:15:45 BEIST 2013
Sequence Number: 4672
Machine Id:    00C895704C00
Node Id:       ERCC_A
Class:          S
Type:          PERM
Resource Name: topsvcs

Description
NIM thread blocked

Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU

User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention

      Recommended Actions
      Examine I/O and memory activity on the system
      Reduce load on the system
      Tune virtual memory parameters
      Call IBM Service if problem persists

Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O

      Recommended Actions
      Examine I/O and memory activity on the system
      Reduce load on the system
      Tune virtual memory parameters
      Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.21,5943
ERROR ID
6BUfAx.VBzgF/YCR./gQ.8....................
REFERENCE CODE

Thread which was blocked
receive thread
Interval in seconds during which process was blocked
      20
Interface name
tty0收起

IT其它 · 2013-06-13

zhanghhai系统工程师银科控股有限公司

先确认下串口线物理连接是否正常,然后主机上运行lsattrr -El tty0查看系统状态是否正常,如果正常再测试两台主机的串口通信是否正常.如果有条件可以换下串口线,再测试.收起

证券 · 2013-03-05

2008010755软件开发工程师交通运输部规划研究院

这个还真信不会收起

互联网服务 · 2013-02-06

IBM HA5.3主机宕机时，资源组无法自动切换问题【已解决】

8同行回答
全部行业
全部行业 IT其它 互联网服务 证券 系统集成 轨道交通 IT培训教育
|
按赞同排序
按时间排序

提问者

相关问题

相关资料

相关文章

问题状态

IBM HA5.3主机宕机时，资源组无法自动切换问题【已解决】

8同行回答全部行业全部行业IT其它互联网服务证券系统集成轨道交通IT培训教育|按赞同排序按时间排序

提问者

相关问题

相关资料

相关文章

问题状态

8同行回答
全部行业
全部行业 IT其它互联网服务证券系统集成轨道交通 IT培训教育
|
按赞同排序
按时间排序