IBM P620+7133 HACMP 切换故障及SSA环回报错处理

最近刚处理的客户 HACMP 切换故障及SSA环回报错。

环境:

P620 F85小型机两台,7133磁盘阵列一台;

AIX 4.3.3,HACMP 4.3.3

Sybase数据库应用

故障现象:

1、P620主机在18-22日连续出现625E6B9A-SSA_LINK_OPEN错误;

2、主、备之间HACMP切换不成功,备机应用访问不了;

故障分析处理:

一、625E6B9A-SSA_LINK_OPEN

# errpt -dH
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
625E6B9A   0819100711 P H ssa0           ADAPTER DETECTED OPEN SERIAL LINK

#errpt -aj 625E6B9A

LABEL:          SSA_LINK_OPEN
IDENTIFIER:     625E6B9A
STORAGE DEVICE CABLEFF OR FAILED
Date/Time:       Fri Aug 19 10:16:51
Sequence Number: 59983
Machine Id:      000281AF4C00
Node Id:         localhosts
Class:           HOBLEM DETERMINATION PROCEDURES
Type:            PERM
Resource Name:   ssa0
Resource Class:  adapter
Resource Type:   ssa1600 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
Location:        27-0800 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
VPD:            
        Part Number................. 09L5695
        FRU Number.................. 34L5388
        Serial Number...............S2295106
        EC Level....................    E27782
        Manufacturer................IBM053
        ROS Level and ID............BB00    0000
        Loadable Microcode Level....05
        Device Driver Level.........00
        Displayable Message.........SSA-ADAPTER
        Device Specific.(Z0)........SDRAM=064 ES
        Device Specific.(Z1)........CACHE=00
        Device Specific.(Z2)........UID=006094C1000012E8
Resource Class:  adapter
Descriptionpe:   ssa1600 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
ADAPTER DETECTED OPEN SERIAL LINK0 0000 0000 0000 0000 0000 0000 0000 0000 0000
VPD:            
Probable Causesmber................. 09L5695
SSA DEVICE POWERED OFF OR FAILED.... 34L5388
        Serial Number...............S2295106
User CausesLevel....................    E27782
ANOTHER SYSTEM ON THE LINK HAS BEEN POWERED OFF OR AN SSA DEVICE HAS BEEN REMOVE
D FROM THE LINKel and ID............BB00    0000
        Loadable Microcode Level....05
        Recommended Actions
        POWER ON THE OTHER SYSTEM OR REPLACE THE REMOVED DISK DRIVE

Install Causes
CABLING OR POWER FAULT

        Recommended Actions
        CHECK LINK CONFIGURATION

Failure Causessmber................. 09L5695
STORAGE DEVICE CABLEFF OR FAILED.... 34L5388
DISK DRIVErial Number...............S2295106
ADAPTERusesLevel....................    E27782
ANOTHER SYSTEM ON THE LINK HAS BEEN POWERED OFF OR AN SSA DEVICE HAS BEEN REMOVE
        Recommended Actions.........BB00    0000
        PERFORM PROBLEM DETERMINATION PROCEDURES
        Recommended Actions
Detail DataER ON THE OTHER SYSTEM OR REPLACE THE REMOVED DISK DRIVE
ERROR CODE
2450 0000 0004 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------

# ssa_ela
ssa0 SRN 45000

#DIAG

#diag> Task Selection (Diagnostics, Advanced Diagnostics, Service Aids, etc.)> SSA Service Aids>

Link Verification>GOOD

Link speed >GOOD

Certify Disk>GOOD

#smitty ssaraid ——List All Defined SSA RAID Arrays—

磁盘阵列状态GOOD,4块硬盘物理及逻辑状态GOOD。



根据以上信息,以及查阅的一些7133 SSA卡625E6B9A错误,认为SSA卡硬件故障可能较小,最大可能是连线或者硬盘接触、端口接触

不良造成SSA环回报错。

解决:对P620与7133之间线缆加固,对结果追踪观察,错误没有再现。

二、HACMP切换故障

1、 第一阶段,开始对HACMP做排查,排查心跳线问题,使用命令:

posdb1# cat /etc/hosts >/dev/tty0

posdb2# cat
结果正常;

# stty
# stty
结果正常;

2、 第二阶段,客户POS业务停止后,关闭POSDB1 HACMP,进行同步检测;

#smityy hacmp—cluster configure—cluster vertify对HACMP做同步检测,出现Error错误

Contacting node posdb2 ...

HACMPnim ODM on node posdb2 verified.



Error: HACMPsubnet ODM on the local node is different from that on the remote no

de posdb2.



Contacting node posdb2 ...

Error: HACMPadapter ODM on the local node is different from that on the remote n

ode posdb2.



Verifying ATM Configuration...

Verifying Cluster Resources...

WARNING: The service IP label (posdb_serv2) on node posdb2 is not configured

to be part of a resource group. Therefore it will not be acquired and used as

a service address by any node.



Remember to redo automatic error notification if configuration has changed.

clconfig: Error(s) have been detected.

3、 把主机POSDB1在/etc/objerpos目录下的HACMPsubnet、HACMPadapter文件FTP至POSDB2相同目录,再进行cluster vertify;

4、 再次进行cluster vertify检测,检测OK,开始进行下一阶段:切换测试;

测试开始前,对主、备机HACMP运行做确认:

#lssrc –g cluster (确认HACMP的三个守护进程运行)

#/usr/sbin/cluster/clustat –a(HACMP节点运行状态)

#lsvg –o (确认共享卷datavg 状态:varyonvaryoff)

#netstat –in(确认网卡状态以及boot_ipservices_ipstandby_ip)

#ps –ef,#df –k (确认应用程序进程、/data,/data1挂载信息)

5、 在POSDB1操作

#tail –f /tmp/hacmp.out(在POSDB1POSDB2同时新窗口打开,对HACMP实时运行状态查看)

#smitty clstop,shutdown 选择takeover,

在hacmp.out窗口输出停止后,进行确认:

#lsvg –o (共享卷datavg已释放,此时只看到rootvg)

#netstat –i (services_ip 被释放,在ADDRESS查看到boot_ip、standby_ip

#ps –ef (sybase进程从4个变为2个,应用已被KILL掉)

#df –k(看不到/datat,/data1磁盘符)

6、 在POSDB2操作lsvg –o,netstat –in,ps –ef 等查看service_ip已被启用,应用程序资源被接管,在客户端进行业务访问正常;手动切换takeover成功;

7、 在主机posdb1启动HACMP,

#smitty clstart,最后一项改为true,其他选项不变;

主机自动成功接管service_ip,应用程序自动切换回主机运行。

(注—客户HACMP设定为Cascading双机热备,POSDB1优先级高)

HACMP恢复正常。

故障处理结果:

1、 系统错误625E6B9A-SSA_LINK_OPEN在对线缆加固后,错误没有再次重现;

2、 HACMP故障,应用不能切换到备机解决,双机切换恢复正常;

3、#df –k发现/var已占85%,建议客户对/var做文件清理,删除垃圾文件或已无关文件;
参与1

0同行回答

“答”则兼济天下,请您为题主分忧!

提问者

hotmail
软件开发工程师hotmail
擅长领域: 数据库服务器云计算

相关问题

相关资料

问题状态

  • 发布时间:2011-11-28
  • 关注会员:1 人
  • 问题浏览:4945
  • X社区推广