现场环境,手机银行系统两台P6 550 PowerHA环境,某晚运维发现告警,一台主机意外宕机了 .接到电话赶到现场,发现P6前面板已经亮起了刺眼的黄灯,为了保护现场,先不动,先看看另外一台主机哪里能不能找到宕机线索.
1、errpt 相关报错
Description
Possible malfunction on local adapter
Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured
Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured
RecommendedActions
Verify adapterconfiguration
Verify networkconnectivity
2 、Powerha报错日志
May 14 23:38:15 SJbank1user:notice HACMP for AIX: EVENT START: node_down SJbank2
May 14 23:38:15 SJbank1user:notice HACMP for AIX: EVENT COMPLETED: node_down SJbank2 0
May 14 23:38:15 SJbank1user:notice HACMP for AIX: EVENT START: node_down_complete SJbank2
May 14 23:38:15 SJbank1user:notice HACMP for AIX: EVENT COMPLETED: node_down_complete SJbank2 0
May 14 23:38:34 SJbank1daemon:notice topsvcs[181226]: (Recorded using libct_ffdc.a cv 2):::Error ID:6zV5DL.myHL9/i0x/6LF.4....................:::Reference ID: :::Template ID:173c787f:::Details File: :::Location:rsct,nim_control.C,1.39.1.18,4303 :::TS_LOC_DOWN_ST Possible malfunction on local adapter Adapterinterface name tty0 Adapter offset 2 Adapter IP address 255.255.0.0
May 14 23:38:36 SJbank1user:notice HACMP for AIX: EVENT START: network_down minus 1 net_rs232_01
May 14 23:38:36 SJbank1user:notice HACMP for AIX: EVENT COMPLETED: network_down minus 1 net_rs232_01 0
May 14 23:38:36 SJbank1user:notice HACMP for AIX: EVENT START: network_down_complete minus 1net_rs232_01
May 14 23:38:36 SJbank1user:notice HACMP for AIX: EVENT COMPLETED: network_down_complete minus 1net_rs232_01 0
通过如上的一些日志,基本锁定了元凶
就是因为Powerha当时的串口心跳异常导致一台主机宕机发生。
找到了原因,那就把主机启动起来吧,结果意外发生了,这台主机无法启动了,最终定格在了11002630了。似乎是硬件问题了,赶紧call来原厂商处理
厂商说这是因为CPU Regulator导致的,调来了备件更换完成,主机顺利启动.
来自社区交流活动“起底宕机事故-深度剖析宕机真相”
由社区会员“hp_hp”发布
如果觉得我的文章对您有用,请点赞。您的支持将鼓励我继续创作!
赞2
添加新评论0 条评论