利用IBM硬件信息中心定位硬件问题

本文主要是通过一次对AIX服务器的硬件故障排查过程来引进一个故障排查的思路,希望大家拍砖。# errptIDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTIONBFE4C025   0416192308 P H sysplanar0     UNDETERMINED ERROR# errpt -aj BFE4C02...显示全部
本文主要是通过一次对AIX服务器的硬件故障排查过程来引进一个故障排查的思路,希望大家拍砖。

# errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
BFE4C025   0416192308 P H sysplanar0     UNDETERMINED ERROR

# errpt -aj BFE4C025
---------------------------------------------------------------------------
LABEL:          SCAN_ERROR_CHRP
IDENTIFIER:     BFE4C025

Date/Time:       Wed Apr 16 19:23:10 2008
Sequence Number: 120
Machine Id:      000599F6D700
Node Id:         PEKAX019
Class:           H
Type:            PERM
Resource Name:   sysplanar0      #系统平台错误,根据经验可先通过

Resource Class: planar                 diag  sysplanar0 -v -e 查看相关日志在通过
Resource Type:   sysplanar_rspc     lsmcode -A检查 微码是否过旧 ,如 果 微码没问
Location:                                      题,那么应该是硬件 故障   

Description
UNDETERMINED ERROR

Failure Causes
UNDETERMINED

        Recommended Actions
        RUN SYSTEM DIAGNOSTICS.

Detail Data
PROBLEM DATA
0644 00E0 0000 01B4 8E00 8E00 0000 0000 0000 0000 4942 4D00 5048 0030 0100 EA10

...省略了一些
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

Diagnostic Analysis
Diagnostic Log sequence number: 104
Resource tested:        sysplanar0
Resource Description:   System Planar
Location:            
SRC:                    B17CE433  
Description:            Surveillance Error Predictive Error, general. Refer to
                        the system service documentation for more information.
Additional Words:       2-030000F0 3-53B71510 4-C13920FF 5-400000FF
                        6-00000000 7-000007F7 8-00000800 9-00000000
Possible FRUs:
    Priority: H Maintainence Procedure: FSPSP33
    Location: n/a
    Priority: M Maintainence Procedure: FSPSP04
    Location: n/a
    Priority: L FRU: 32N1272 S/N: YL1126327097 CCIN: 293A
    Location: U787F.001.DPM2DCM-P1-C7

---------------------------------------------------------------------------

打开IBM 硬件信息中心

http://publib.boulder.ibm.com/in ... ys/v3r1m5/index.jsp

搜索

1) SRC  B17CE433

System Reference Code (SRC)主要用于描述系统错误的代码

Explanation
This error log entry is generated when the HMC fails to send its heartbeat message within the allotted time. The reason could be network issues, or the Ethernet cable is disconnected.
Response
If this is a tracking event, no service actions are required. Otherwise, use the FRU and procedure callouts detailed with the SRC to determine service actions.


2)FSPSP33:
A problem has been detected in the connection with the HMC.
    Ensure that the cable connectors to the network from the HMC, managed system, managed system partitions, and other HMCs are securely connected. If the connections are not secure, plug the cables back into the proper spots and make sure that the connections are good.
    Check to see if the HMC is working correctly or if the HMC was disconnected incorrectly from the managed system, managed system partitions, and other HMCs. If either has happened, reboot the HMC. For more information, see Shutting down, rebooting, and logging off the HMC.
    Verify that the network connection between the HMC, managed system, managed system partitions, and other HMCs is working properly. If you have a high performance switch (HPS) network, verify that the network connection to the CSM Management Server is also working. If the connection is not working properly, contact the customer network support to correct the problems.
    If applicable, service the next FRU.
    If the problem continues to persist, contact your next level of support. This ends the procedure


3)FSPSP04:
A problem has been detected in the service processor firmware.


4)FRU:32N1272

Field Replace Unit(FRU)现场可更换单元

在电脑上的一些可更换的部件。主要是厂商为了节省成本,把设备分成多个FRU,直接更换而不修。(该FRU号没有找到结果,有时候事实就是这样!)


5)CCIN:293A

custom card identification number(CCIN)自定义识别号


6)Location: U787F.001.DPM2DCM-P1-C7

实际的物理位置,其中U787F.001.DPM2DCM为逻辑分区标识,P1-C7为物理设备标识

通过Location结合FRU与CCIN可定位到实际设备,定位的时候注意比对Maintainence Procedure避免定位错误。

定位结果

5616035720110914180155019.jpg



相关说明

561603572011091418022005.jpg

收起
参与6

查看其它 4 个回答goodluck1999的回答

goodluck1999goodluck1999数据库管理员和平铝业
好文,谢谢分享
机械装备 · 2012-02-28
浏览3243

回答者

goodluck1999
数据库管理员和平铝业

goodluck1999 最近回答过的问题

回答状态

  • 发布时间:2012-02-28
  • 关注会员:1 人
  • 回答浏览:3243
  • X社区推广