myciciy
作者myciciy·2017-02-28 13:38
IT顾问·某金融科技公司

如何使用KDB分析PowerHA宕机原因

字数 4117阅读 5328评论 0赞 1

如何使用KDB分析PowerHA宕机原因

1、现场环境

两台IBM 9117-MMA AIX 6100-01-01-0823 PowerHA 5.4 某一日其中一台主机发生宕机,业务切换至备机

Processor Type: PowerPC_POWER6

    Processor Implementation Mode: POWER 6
    Processor Version: PV_6
    Number Of Processors: 16
    Processor Clock Speed: 4400 MHz
    CPU Type: 64-bit
    Kernel Type: 64-bit
    LPAR Info: 1 06-4C966
    Memory Size: 189440 MB
    Good Memory Size: 189440 MB
    Platform Firmware level: EM350_071
    Firmware Version: IBM,EM350_071

2、故障分析

# errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
A6DF45AA   0720131411 I O RMCdaemon      The daemon is started.
D221BD55   0720131411 I O perftune       RESTRICTED TUNABLES MODIFIED AT REBOOT
67145A39   0720131311 U S SYSDUMP        SYSTEM DUMP
F48137AC   0720131211 U O minidump       COMPRESSED MINIMAL DUMP
9D035E4D   0720131211 P S SYSVMM         DATA STORAGE INTERRUPT, PROCESSOR
8CC2A219   0720131211 P S fwadump        Firmware-assisted system dump initializa
9DBCFDEE   0720131311 T O errdemon       ERROR LOGGING TURNED ON
# errpt -aj 9D035E4D
LABEL:          DSI_PROC
IDENTIFIER:     9D035E4D
Date/Time:       Wed Jul 20 13:12:14 PAKST 2011
Sequence Number: 180810
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SYSVMM
Description
DATA STORAGE INTERRUPT, PROCESSOR
Probable Causes
SOFTWARE PROGRAM
Failure Causes
SOFTWARE PROGRAM
Recommended Actions
IF PROBLEM PERSISTS THEN DO THE FOLLOWING
CONTACT APPROPRIATE SERVICE REPRESENTATIVE
Detail Data
DATA STORAGE INTERRUPT STATUS REGISTER
0000 0000 0220 0000
SEGMENT REGISTER, SEGREG
0000 7FFF FFFF D080
DATA STORAGE INTERRUPT ADDRESS REGISTER
F100 0180 1199 B328
EXVAL
0000 0000 0000 010E

系统产生了dump文件

# sysdumpdev -L
Device name:         /dev/lvcorefiles
Major device number: 10
Minor device number: 13
Size:                6082147840 bytes
Uncompressed Size:   23820959312 bytes
Date/Time:           Wed Jul 20 12:56:31 PAKST 2011
Dump status:         0
Type of dump:        traditional
dump completed successfully

kdb分析dump文件

# kdb ./vmcore.0 /unix

./vmcore.0 mapped from @ 700000000000000 to @ 70000058d7f6149

Preserving 1692070 bytes of symbol table [/unix]

Component Names:

1) minidump [2 entries]

2) dmp_minimal [10 entries]

3) proc [1867 entries]

4) thrd [13858 entries]

5) mtrc [97 entries]

6) lfs [8 entries]

7) bos [5 entries]

8) vmm [13 entries]

9) alloc_kheap [2306 entries]

...

(30)>

(30)> stat

SYSTEM_CONFIGURATION:

CHRP_SMP_PCI POWER_PC POWER_6 machine with 32 available CPU(s) (64-bit registers)

SYSTEM STATUS:

sysname... AIX

nodename.. cbp6a

release... 1

version... 6

build date Jan 16 2009

build time 19:37:08

label..... b2009_03D0

machine... 00C4C9664C00

nid....... C4C9664C

time of crash: Wed Jul 20 12:56:30 2011

age of system: 146 day, 2 hr., 53 min., 44 sec.

xmalloc debug: enabled

FRRs active... 0

FRRs started.. 0

CRASH INFORMATION:

CPU 30 CSA F100041580553D00 at time of crash, error code for LEDs: 30000000

pvthread+3800400 STACK:

[0426A04C]pofTimer+00006C (0000000000000005 [??])

[00189AC8]watchdog+000388 ()

[000750D4]sys_timer+000154 (??)

[00075C00]clock+0002E0 (??)

[00175C00]i_softmod+000360 ()

[0019651C]flih_util+000248 ()

Exception (F00000002FF47600)

iar : 000000000042B028 msr : 8000000000009032 cr : 24208028
lr : 0000000000053030 ctr : 0000000000000000 xer : 00000000
mq : 00000000 asr : 0000000ADA40D001 amr : F3FCFFFC00000000
r0 : 0000000000000000 r1 : 0FFFFFFFF3FFFDF0 r2 : 0000000002C25AA8
r3 : 0000000000000000 r4 : 0000000000000080 r5 : 0000000000000000
r6 : 0000000000000080 r7 : 0000000000000000 r8 : 0000000000000000
r9 : 0000000000000000 r10 : 0000000000000000 r11 : 0000000000000001
r12 : 0000000000000000 r13 : F100013807F8C400 r14 : 0000000000000002
r15 : 000000001464A750 r16 : 0000000000000000 r17 : 0000000002084C26
r18 : 00000000007CE608 r19 : 00000000000034C8 r20 : 0000000002911240
r21 : 00162CAC685565B4 r22 : F100080713800478 r23 : F100010010066C00
r24 : 00000000020C17B4 r25 : 0000000000000000 r26 : 0000000000000000
r27 : 0000000002551D00 r28 : 0000000002551CFE r29 : 00000000020C1730
r30 : F10001C041FD1000 r31 : 00000000020C0A80

经过咨询IBM,确认这是由于AIX系统bug导致系统宕机。

如果觉得我的文章对您有用,请点赞。您的支持将鼓励我继续创作!

1

添加新评论0 条评论

Ctrl+Enter 发表

本文隶属于专栏

AIX系统故障案例集锦
IBM Power AIX PowerHA PowerVM PowerVC IBM flashsystem SVC Storage 等相关技术案例

作者其他文章

相关文章

相关问题

相关资料

X社区推广