aix意外重启后,一直报错errpt,请帮忙看下是不是内存条坏了。怎么定位是哪根坏了?

今天凌晨服务器意外重启。conslog 日志

alog -f /var/adm/ras/conslog -o 

         0 Sat May 23 01:50:37 GMT+08:00 2020 
         0 Sat May 23 01:50:37 GMT+08:00 2020
Starting Desktop Login on display :0...
         0 Sat May 23 01:50:37 GMT+08:00 2020
Wait for the Desktop Login screen before logging in.

  •        0 Sat May 23 01:50:37 GMT+08:00 2020          0 Sat May 23 01:50:48 GMT+08:00 2020           0 Sat May 23 01:50:48 GMT+08:00 2020 Saving Base Customize Data to boot disk          0 Sat May 23 01:50:49 GMT+08:00 2020 Starting the sync daemon          0 Sat May 23 01:50:49 GMT+08:00 2020 Mounting the platform dump file system, /var/adm/ras/platform          0 Sat May 23 01:50:49 GMT+08:00 2020 Starting the error daemon          0 Sat May 23 01:50:53 GMT+08:00 2020           0 Sat May 23 01:50:53 GMT+08:00 2020 System initialization completed.          0 Sat May 23 01:50:53 GMT+08:00 2020 Sat May 23 01:50:53 GMT+08:00 2020          0 Sat May 23 01:50:53 GMT+08:00 2020 Automatic Error Log Analysis for sysplanar0 has detected a problem. The Service Request Number is    B123E504: Memory subsystem including external cache Predictive Error,             general. Refer to the system service documentation for more             information.            Additional Words: 2-030000F0 3-2BFC0110 4-C13920FF 5-400000FF                              6-81032E40 7-00000303 8-0FFF0024 9-A9008270.          0 Sat May 23 01:50:53 GMT+08:00 2020           0 Sat May 23 01:50:53 GMT+08:00 2020 in sinpolhndlr OFF           0 Sat May 23 01:50:53 GMT+08:00 2020 TE=OFF          0 Sat May 23 01:50:53 GMT+08:00 2020 CHKEXEC=OFF          0 Sat May 23 01:50:53 GMT+08:00 2020 CHKSHLIB=OFF          0 Sat May 23 01:50:53 GMT+08:00 2020 CHKSCRIPT=OFF          0 Sat May 23 01:50:53 GMT+08:00 2020 CHKKERNEXT=OFF          0 Sat May 23 01:50:53 GMT+08:00 2020 STOP_UNTRUSTD=OFF          0 Sat May 23 01:50:53 GMT+08:00 2020 STOP_ON_CHKFAIL=OFF          0 Sat May 23 01:50:53 GMT+08:00 2020 LOCK_KERN_POLICIES=OFF          0 Sat May 23 01:50:53 GMT+08:00 2020 TSD_FILES_LOCK=OFF          0 Sat May 23 01:50:53 GMT+08:00 2020 TSD_LOCK=OFF          0 Sat May 23 01:50:53 GMT+08:00 2020 TEP=OFF          0 Sat May 23 01:50:53 GMT+08:00 2020 TLP=OFF          0 Sat May 23 01:50:53 GMT+08:00 2020 Successfully updated the Kernel Authorization Table.          0 Sat May 23 01:50:53 GMT+08:00 2020 Successfully updated the Kernel Role Table.          0 Sat May 23 01:50:53 GMT+08:00 2020 Successfully updated the Kernel Command Table.          0 Sat May 23 01:50:53 GMT+08:00 2020 Successfully updated the Kernel Device Table.          0 Sat May 23 01:50:53 GMT+08:00 2020 Successfully updated the Kernel Object Domain Table.          0 Sat May 23 01:50:53 GMT+08:00 2020 Successfully updated the Kernel  Domains Table.          0 Sat May 23 01:50:53 GMT+08:00 2020 OPERATIONAL MODE Security Flags          0 Sat May 23 01:50:53 GMT+08:00 2020 ROOT                      :   DISABLED          0 Sat May 23 01:50:53 GMT+08:00 2020 TRACEAUTH                 :   DISABLED          0 Sat May 23 01:50:53 GMT+08:00 2020 System runtime mode is now OPERATIONAL MODE.          0 Sat May 23 01:50:54 GMT+08:00 2020 Setting tunable parameters...         0 Sat May 23 01:50:55 GMT+08:00 2020 complete          0 Sat May 23 01:50:55 GMT+08:00 2020 Starting Multi-user Initialization          0 Sat May 23 01:50:55 GMT+08:00 2020  Performing auto-varyon of Volume Groups           0 Sat May 23 01:50:56 GMT+08:00 2020  Activating all paging spaces           0 Sat May 23 01:50:57 GMT+08:00 2020 0517-075 swapon: Paging device /dev/hd6 is already active.          0 Sat May 23 01:50:59 GMT+08:00 2020           0 Sat May 23 01:50:59 GMT+08:00 2020 The current volume is: /dev/hd1          0 Sat May 23 01:50:59 GMT+08:00 2020 Primary superblock is valid.          0 Sat May 23 01:50:59 GMT+08:00 2020           0 Sat May 23 01:50:59 GMT+08:00 2020 The current volume is: /dev/hd10opt          0 Sat May 23 01:50:59 GMT+08:00 2020 Primary superblock is valid.          0 Sat May 23 01:50:59 GMT+08:00 2020  Performing all automatic mounts           0 Sat May 23 01:50:59 GMT+08:00 2020 Replaying log for /dev/oracle_lv.          0 Sat May 23 01:52:05 GMT+08:00 2020 Multi-user initialization completed          0 Sat May 23 01:52:05 GMT+08:00 2020 Checking for srcmstr active...         0 Sat May 23 01:52:06 GMT+08:00 2020 success          0 Sat May 23 01:52:06 GMT+08:00 2020 complete          0 Sat May 23 01:52:06 GMT+08:00 2020 Starting tcpip daemons:          0 Sat May 23 01:52:07 GMT+08:00 2020 success          0 Sat May 23 01:52:07 GMT+08:00 2020 success          0 Sat May 23 01:52:11 GMT+08:00 2020 0513-059 The syslogd Subsystem has been started. Subsystem PID is 4980904.          0 Sat May 23 01:52:11 GMT+08:00 2020 0513-059 The sendmail Subsystem has been started. Subsystem PID is 5964030.          0 Sat May 23 01:52:11 GMT+08:00 2020 0513-059 The portmap Subsystem has been started. Subsystem PID is 6750392.          0 Sat May 23 01:52:11 GMT+08:00 2020 0513-059 The inetd Subsystem has been started. Subsystem PID is 6881532.          0 Sat May 23 01:52:11 GMT+08:00 2020 0513-029 The snmpd Subsystem is already active. Multiple instances are not supported.          0 Sat May 23 01:52:12 GMT+08:00 2020 0513-059 The aixmibd Subsystem has been started. Subsystem PID is 5767168.          0 Sat May 23 01:52:12 GMT+08:00 2020 0513-059 The snmpmibd Subsystem has been started. Subsystem PID is 6029326.          0 Sat May 23 01:52:12 GMT+08:00 2020 0513-059 The hostmibd Subsystem has been started. Subsystem PID is 7274730.          0 Sat May 23 01:52:12 GMT+08:00 2020 Finished starting tcpip daemons.          0 Sat May 23 01:52:12 GMT+08:00 2020 nsmb0 Available          0 Sat May 23 01:52:12 GMT+08:00 2020 Starting NFS services:          0 Sat May 23 01:52:12 GMT+08:00 2020 0513-059 The biod Subsystem has been started. Subsystem PID is 6291654.          0 Sat May 23 01:52:14 GMT+08:00 2020 0513-059 The rpc.statd Subsystem has been started. Subsystem PID is 8126468.          0 Sat May 23 01:52:14 GMT+08:00 2020 0513-059 The rpc.lockd Subsystem has been started. Subsystem PID is 7471342.          0 Sat May 23 01:52:14 GMT+08:00 2020 Completed NFS services.          0 Sat May 23 01:52:19 GMT+08:00 2020 success          0 Sat May 23 01:52:20 GMT+08:00 2020 success          0 Sat May 23 01:52:33 GMT+08:00 2020 0513-059 The ctrmc Subsystem has been started. Subsystem PID is 8388648.          0 Sat May 23 01:52:50 GMT+08:00 2020           0 Sat May 23 01:52:50 GMT+08:00 2020 Sat May 23 01:52:50 GMT+08:00 2020          0 Sat May 23 01:52:50 GMT+08:00 2020 Automatic Error Log Analysis for sysplanar0 has detected a problem. The Service Request Number is    B123E504: Memory subsystem including external cache Predictive Error,             general. Refer to the system service documentation for more             information.            Additional Words: 2-030000F0 3-2BFC0110 4-C13920FF 5-400000FF                              6-81032E40 7-00000303 8-0FFF0024 9-A9008270.          0 Sat May 23 01:52:50 GMT+08:00 2020 

errpt 发现每分钟都有告警日志

errpt

IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
51E537B5   0523134820 P H sysplanar0     platform_dump saved to file
291D64C3   0523134820 I H sysplanar0     Platform dump data
BFE4C025   0523134820 P H sysplanar0     UNDETERMINED ERROR

查看51E537B5报错信息

errpt -aj 51E537B5      

LABEL:          PLAT_DUMP_COMPLETE
IDENTIFIER:     51E537B5
Date/Time:       Sat May 23 13:48:34 GMT+08:00 2020
Sequence Number: 20925
Machine Id:      00F6945B4C00
Node Id:         szzd_db1
Class:           H
Type:            PERM
WPAR:            Global
Resource Name:   sysplanar0      
Resource Class:  
Resource Type:   
Location:        

Description
platform_dump saved to file

Detail Data
platform_dump indicator event
...... 

Diagnostic Analysis
Diagnostic Log sequence number: 13401
Resource tested:        sysplanar0
Menu Number:            651303
Description:

The following informational event was reported by Platform Firmware.

Platform Firmware Dump Notification.

查看BFE4C025报错信息  #errpt -aj BFE4C025

LABEL:          SCAN_ERROR_CHRP
IDENTIFIER:     BFE4C025

Date/Time:       Sat May 23 13:48:33 GMT+08:00 2020
Sequence Number: 20923
Machine Id:      00F6945B4C00
Node Id:         szzd_db1
Class:           H
Type:            PERM
WPAR:            Global
Resource Name:   sysplanar0      
Resource Class:  
Resource Type:   
Location:        

Description
UNDETERMINED ERROR

Failure Causes
UNDETERMINED

        Recommended Actions
        RUN SYSTEM DIAGNOSTICS.

Detail Data
PROBLEM DATA
......

Diagnostic Analysis
Diagnostic Log sequence number: 13399
Resource tested:        sysplanar0
Resource Description:   System Planar
Location:               
SRC:                    B123E504
Description:            Memory subsystem including external cache Predictive
                        Error, general. Refer to the system service
                        documentation for more information.
Additional Words:       2-030000F0 3-2BFC0110 4-C13920FF 5-400000FF
                        6-81032E40 7-00000303 8-09A00029 9-A9008070
Possible FRUs:
    Priority: H FRU: 77P8784  S/N: n/a          CCIN: 31C5 
    Location: U78AA.001.WZSGD13-P1-C17-C7
    Priority: H FRU: 77P8784  S/N: n/a          CCIN: 31C5 
    Location: U78AA.001.WZSGD13-P1-C17-C9

4回答

张文正张文正  系统工程师 , dcits
laixgzhuhaiqiangldq003赞同了此回答
Priority: H FRU: 77P8784 S/N: n/a CCIN: 31C5Location: U78AA.001.WZSGD13-P1-C17-C7Priority: H FRU: 77P8784 S/N: n/a CCIN: 31C5Location: U78AA.001.WZSGD13-P1-C17-C9 这是位置,内存型号77P8784 ,更换吧显示全部

Priority: H FRU: 77P8784 S/N: n/a CCIN: 31C5

Location: U78AA.001.WZSGD13-P1-C17-C7

Priority: H FRU: 77P8784 S/N: n/a CCIN: 31C5

Location: U78AA.001.WZSGD13-P1-C17-C9

这是位置,内存型号77P8784 ,更换吧

收起
 2020-05-25
浏览597
lipeng9239lipeng9239  系统运维工程师 , 北京智控美信
交换测试的过程中内存报错硬件位置在变是正常的。8202的机器根据一路或者两路处理器,可以配置一到四个内存盒子,每个盒子有8个内存插槽,内存盒子的多少,内存条的大小和数量,都有相应的内存插槽使用规则。请结合你机器实际情况,去官网查看内存插法,看懂了,也就理解你现在遇到的问...显示全部

交换测试的过程中内存报错硬件位置在变是正常的。8202的机器根据一路或者两路处理器,可以配置一到四个内存盒子,每个盒子有8个内存插槽,内存盒子的多少,内存条的大小和数量,都有相应的内存插槽使用规则。请结合你机器实际情况,去官网查看内存插法,看懂了,也就理解你现在遇到的问题了。

收起
 2020-06-30
浏览178
xueliangjiaxueliangjia  系统工程师 , 神州数码
Priority: H FRU: 77P8784 S/N: n/a CCIN: 31C5Location: U78AA.001.WZSGD13-P1-C17-C7 这里面包含了你需要申请的内存备件号,请选用CCIN一致的内存,如果更换后还报警可以交叉测试来定位内存故障还是槽位故障。进一步选择是否需要更换内存板。...显示全部

Priority: H FRU: 77P8784 S/N: n/a CCIN: 31C5

Location: U78AA.001.WZSGD13-P1-C17-C7
这里面包含了你需要申请的内存备件号,请选用CCIN一致的内存,如果更换后还报警可以交叉测试来定位内存故障还是槽位故障。进一步选择是否需要更换内存板。

收起
 2020-06-22
浏览198
ldq003ldq003  系统运维工程师 , it
c17-c7,c17-c9 一组内存在周一已经更换了。在hmc发现另一块内存板卡全部deconfig,设定为config后。其中开机后发现c18-c2和c18-c4在报错。关机拔掉c18-c2,c18-c4后开机发现c18-c1,c18-c3又开始报错。尝试把c18-c8,c18-c10拔掉装在c18-c2和c18-c4位置。开机发现c18-c7和c9...显示全部

c17-c7,c17-c9 一组内存在周一已经更换了。
在hmc发现另一块内存板卡全部deconfig,设定为config后。
其中开机后发现c18-c2和c18-c4在报错。
关机拔掉c18-c2,c18-c4后开机发现c18-c1,c18-c3又开始报错。尝试把c18-c8,c18-c10拔掉装在c18-c2和c18-c4位置。开机发现c18-c7和c9在报错。
都快崩溃了,最后把c18-c7和c9拔掉再开机。没有再报错。
今天巡检查看日志发现在周二下午的时候
c18,c18-c1,c18-c3这3个位置出现了一次报错。
下周去现场再测试一下。怀疑c18内存卡有问题。

收起
 2020-05-30
浏览517

提问者

ldq003系统运维工程师, it

核心数据库服务器选型优先顺序调查

发表您的选型观点,参与即得50金币。

问题状态

  • 发布时间:2020-05-25
  • 关注会员:4 人
  • 问题浏览:1304
  • 最近回答:2020-06-30