系统集成故障分析

RAC异常宕机如何分析原因

目前环境如下,一个数据库  两套RAC, 4个实例。A机因为dump宕机了,B机也跟着宕机,无硬件错误日志,dump日志内容如下


A机日志内容如下

3C81E43F   0621000016 P U topsvcs        Late in sending heartbeat

A6DF45AA   0620000916 I O RMCdaemon      The daemon is started.

67145A39   0620000716 U S SYSDUMP        SYSTEM DUMP

F48137AC   0620000516 U O minidump       COMPRESSED MINIMAL DUMP

225E3B63   0620000516 T S PANIC          SOFTWARE PROGRAM ABNORMALLY TERMINATED

9DBCFDEE   0620000916 T O errdemon       ERROR LOGGING TURNED ON

90EDB0A5   0619235916 P S topsvcs        Dead Man Switch being allowed to expire.

BA6A5ED2   0619231616 I S rmt6           CONFIGURATION MISMATCH


# errpt -aj 3C81E43F
---------------------------------------------------------------------------
LABEL:          TS_LATEHB_PE
IDENTIFIER:     3C81E43F

Date/Time:       Tue Jun 21 00:00:07 GMT+08:00 2016
Sequence Number: 490017
Machine Id:      00CC37E64C00
Node Id:         hzypa
Class:           U
Type:            PERF
WPAR:            Global
Resource Name:   topsvcs         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Late in sending heartbeat

Probable Causes
Heavy CPU load
Severe physical memory shortage
Heavy I/O activities

Failure Causes
Daemon can not get required system resource

        Recommended Actions
        Reduce the system load

Detail Data
DETECTING MODULE
rsct,bootstrp.C,1.215.1.10,5366               
ERROR ID
6zESUw.5A/OL/lXb.t2pZ8....................
REFERENCE CODE
                                          
A heartbeat is late by the following number of seconds
           8
#
#

# errpt -aj 67145A39
---------------------------------------------------------------------------
LABEL:          DUMP_STATS
IDENTIFIER:     67145A39

Date/Time:       Mon Jun 20 00:07:04 GMT+08:00 2016
Sequence Number: 489865
Machine Id:      00CC37E64C00
Node Id:         hzypa
Class:           S
Type:            UNKN
WPAR:            Global
Resource Name:   SYSDUMP         

Description
SYSTEM DUMP

Probable Causes
UNEXPECTED SYSTEM HALT

User Causes
SYSTEM DUMP REQUESTED BY USER

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
UNEXPECTED SYSTEM HALT

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
DUMP DEVICE
/dev/lg_dumplv
DUMP SIZE
            1773494272
TIME
Sun Jun 19 23:59:26 2016
DUMP TYPE (1 = PRIMARY, 2 = SECONDARY)
           1
DUMP STATUS
           0
ERROR CODE
0000 0000 0000 0000
DUMP INTEGRITY
Compressed dump - Run dmpfmt with -c flag on dump after uncompressing.

FILE NAME

PROCESSOR ID
           0
#

LABEL:          TS_DMS_EXPIRING_EM
IDENTIFIER:     90EDB0A5

Date/Time:       Sun Jun 19 23:59:16 GMT+08:00 2016
Sequence Number: 489861
Machine Id:      00CC37E64C00
Node Id:         hzypa
Class:           S
Type:            PEND
WPAR:            Global
Resource Name:   topsvcs         

Description
Dead Man Switch being allowed to expire.
If a TS_DMS_RESTORED_TE error appears after this, that will indicate this
condition has been recovered from.  Otherwise, a DMS-triggered node failure
should be expected to occur after the time indicated in the Detail Data.

Probable Causes
Topology Services has detected blockage that puts it in danger of suffering
a sundered network.  This is due to all viable NIM processes experiencing
blockage, or the daemon's main thread being hung for too long.

User Causes
Excessive I/O load is causing high I/O interrupt traffic
Excessive memory consumption is causing high memory contention

        Recommended Actions
        Reduce application load on the system
        Change (relax) Topology Services tunable parameters
        Call IBM Service if problem persists

Failure Causes
Problem in Operating System prevents processes from running
Excessive I/O interrupt traffic prevents processes from running
Excessive virtual memory activity prevents Topology Services from making progress

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Change (relax) Topology Services tunable parameters
        Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.34,4890            
ERROR ID
6Z0PvE0I3gNL/wtD/t2pZ8....................
REFERENCE CODE
                                          
Time remaining until DMS triggers (in msec)
       10000
DMS trigger interval (in msec)
       20000
---------------------------------------------------------------------------
LABEL:          SC_TAPE_ERR7
IDENTIFIER:     BA6A5ED2

Date/Time:       Sun Jun 19 23:16:41 GMT+08:00 2016
Sequence Number: 489860
Machine Id:      00CC37E64C00
Node Id:         hzypa
Class:           S
Type:            INFO
WPAR:            Global
Resource Name:   rmt6            

Description
CONFIGURATION MISMATCH

Probable Causes
CONFIGURATION
CONFIGURATION PARAMETER MISMATCH
DEVICE CONFIGURATION DATABASE

Failure Causes
SOFTWARE DEVICE DRIVER

        Recommended Actions
        VERIFY SYSTEM CONFIGURATION IS VALID
        CORRECT CONFIGURATION
        REFER TO PRODUCT DOCUMENTATION FOR ADDITIONAL INFORMATION

Detail Data
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001
0007 7298 0000 0000 007B 0C00 0007 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0019





B机日志内容如下

A6DF45AA   0620001416 I O RMCdaemon      The daemon is started.

67145A39   0620001216 U S SYSDUMP        SYSTEM DUMP

F48137AC   0620001116 U O minidump       COMPRESSED MINIMAL DUMP

AB59ABFF   0620001116 U U LIBLVM         Remote node Concurrent Volume Group fail

9DBCFDEE   0620001416 T O errdemon       ERROR LOGGING TURNED ON

AB59ABFF   0620000016 U U LIBLVM         Remote node Concurrent Volume Group fail


# errpt -aj AB59ABFF
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Mon Jun 20 00:11:12 GMT+08:00 2016
Sequence Number: 371773
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A08 A77F
MAJOR/MINOR DEVICE NUMBER
002C 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Mon Jun 20 00:00:42 GMT+08:00 2016
Sequence Number: 371771
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A08 A77F
MAJOR/MINOR DEVICE NUMBER
002C 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Mon Jun 20 00:00:42 GMT+08:00 2016
Sequence Number: 371770
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A0E B016
MAJOR/MINOR DEVICE NUMBER
002D 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Mon Jun 20 00:00:42 GMT+08:00 2016
Sequence Number: 371769
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A0B BD73
MAJOR/MINOR DEVICE NUMBER
002E 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Mon Jun 20 00:00:41 GMT+08:00 2016
Sequence Number: 371768
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A08 A77F
MAJOR/MINOR DEVICE NUMBER
002C 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Mon Jun 20 00:00:41 GMT+08:00 2016
Sequence Number: 371767
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A0B BD73
MAJOR/MINOR DEVICE NUMBER
002E 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Mon Jun 20 00:00:41 GMT+08:00 2016
Sequence Number: 371766
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A0E B016
MAJOR/MINOR DEVICE NUMBER
002D 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Wed Jun  1 00:00:42 GMT+08:00 2016
Sequence Number: 370819
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A08 A77F
MAJOR/MINOR DEVICE NUMBER
002C 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Wed Jun  1 00:00:42 GMT+08:00 2016
Sequence Number: 370818
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A0E B016
MAJOR/MINOR DEVICE NUMBER
002D 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Wed Jun  1 00:00:42 GMT+08:00 2016
Sequence Number: 370817
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A08 A77F
MAJOR/MINOR DEVICE NUMBER
002C 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Wed Jun  1 00:00:42 GMT+08:00 2016
Sequence Number: 370816
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A0E B016
MAJOR/MINOR DEVICE NUMBER
002D 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Wed Jun  1 00:00:42 GMT+08:00 2016
Sequence Number: 370815
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A0B BD73
MAJOR/MINOR DEVICE NUMBER
002E 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------
LABEL:          LVM_GS_RLEAVE
IDENTIFIER:     AB59ABFF

Date/Time:       Wed Jun  1 00:00:42 GMT+08:00 2016
Sequence Number: 370814
Machine Id:      00CC37F64C00
Node Id:         hzypb
Class:           U
Type:            UNKN
WPAR:            Global
Resource Name:   LIBLVM         
Resource Class:  NONE
Resource Type:   NONE
Location:        

Description
Remote node Concurrent Volume Group failure detected

Probable Causes
Remote node Concurrent Volume Group forced offline

Failure Causes
Remote node left VGSA/VGDA groups due to failure

        Recommended Actions
        Examine error log on identified remote node

Detail Data
Remote Node Name
hzypa
Volume Group ID
00CC 37E6 0000 4C00 0000 0141 0A0B BD73
MAJOR/MINOR DEVICE NUMBER
002E 0000
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

参与25

3同行回答

powerhelperpowerhelper  系统工程师 , 神州数码
一般这情况我们都是调整ha的对时频率,在应许范围内适当调整,一般设置为一分钟60次显示全部

一般这情况我们都是调整ha的对时频率,在应许范围内适当调整,一般设置为一分钟60次

收起
互联网服务 · 2016-06-21
浏览4540
myciciymyciciy  IT顾问 , 某金融科技公司
脑裂了显示全部

脑裂了

收起
银行 · 2016-06-21
浏览4019
wailonwailon  数据库管理员 , elegps
AIX 6.1oracle 10.2.0.5我也遇到同样的问题,而且是两套RAC相继重启显示全部

AIX 6.1

oracle 10.2.0.5

我也遇到同样的问题,而且是两套RAC相继重启

收起
系统集成 · 2016-08-22
浏览3712

提问者

GDUTHJQ
系统工程师css
擅长领域: 服务器存储灾备

相关问题

相关文章

问题状态

  • 发布时间:2016-06-21
  • 关注会员:4 人
  • 问题浏览:7253
  • 最近回答:2016-08-22
  • X社区推广