hunksty
作者hunksty·2011-01-11 15:55
存储架构师·金融行业

DMS(deadman switch)

字数 5418阅读 5669评论 0赞 1
     1.       dms 的介绍:


DMS
deadman switch)是用来描述系统kernel extension用的,它可以在系统崩溃前down掉系统,并产生dump文件,以供日后检查。
DMS
存在的目的是为了保护共享外置硬盘及数据,当系统挂起时间长过一定限制时间时,DMS会自动down掉该系统,由hacmp的备份节点接管系统,以保护数据和业务的正常进行,避免潜在的问题,特别是外置磁盘阵列。

2. DMS
的起因:

DMS
起作用的原因主要有以下几点:
a.
某种应用程序的优先级大于clstrmgr deamon , 导致clstrmgr无法正常reset DMS计数器。
b.
在系统上存在大量IO 操作, 导致cpu 没有时间相应clstrmgr deamon .
c.
内存泄漏或溢出问题
d.
大量的系统错误日志活动, 如: token-ring beaconing 问题)

3.
如何检查是否系统发生了DMS

我们可以通过分析DUMP文件来看,如:

# crash /dev/lv00
Using /unix as the default namelist file.
> cpu
Selected cpu number : 0
> stat
------sysname: AIX
------nodename: sp13
------release: 3
------version: 4
------machine: 00091968A400
------time of crash: Sat Aug 31 04:36:52 EDT 2002
------age of system: 5 day, 21 hr., 6 min.
------xmalloc debug: disabled
------abend code: 700
------csa: 0x438eb0
------exception struct:
------0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
------panic: HACMP for AIX dms timeout - ha
.
> status
CPU TID TSLOT --PID PSLOT STOPPED PROC_NAME
0 --205 ----2 --204 ----2 ----yes wait
1 --307 ----3 --306 ----3 ----yes wait
2 --409 ----4 --408 ----4 ----yes wait
3 --50b ----5 --50a ----5 ----yes wait
4 --60d ----6 --60c ----6---- yes wait
5 -1867 ---24 -125a -- 18 ----yes errdemon
6 --811 ----8 --810 ----8---- yes wait
7 --913 ----9 --912 ----9 ----yes wait
> t -mk
Skipping first MST
.
MST STACK TRACE:
0x00438eb0 (excpt=00000000:00000000:00000000:00000000:00000000)
(intpri=5)
IAR: -----.panic_trap+0 (00012678): tweq r1,r1
LR: ------.[dms:dead_man_sw_handler]+18 (0171335c)
00438d40: .[dms:timeout_end]+4c (01713b98)
00438d80: .clock+134 (0002e9a8)
00438de0: .i_softmod+2a8 (0001c3b0)
00438e70: flih_603_patch+cc (00028b74)
.
0x2ff3b400 (excpt=00000000:00000000:00000000:00000000:00000000)
(intpri=11)
IAR: -----.waitproc_find_run_queue+c0 (000255e0): addic r3,r0,-4
LR: ----- .waitproc+a0 (00025aa4)
2ff3b328: .waitproc+a0 (00025aa4)
2ff3b388: .procentry+14 (00098288)
2ff3b3c8: .low+0 (00000000)
.
> symptom
PIDS/5765C3403 LVLS/430 PCSS/SPI1 MS/700 FLDS/panic_tra VALU/7c810808
FLDS/[dms:dead VALU/18

或者检查 errpt , 如:

errpt -a
-------
LABEL: ----- ----- KERNEL_PANIC
IDENTIFIER: ----- 225E3B63

Date/Time: -------Thu Apr 25 21:26:16
Sequence Number: 609
Machine Id: ----- 0040613A4C00
Node Id: ---------localhost
Class: ------------S
Type: -------------TEMP
Resource Name: ---PANIC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

---Recommended Actions
---PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
ASSERT STRING

PANIC STRING
HACMP for AIX dms timeout - halting hung node


   4.避免DMS的几种方法:
   
   a.调整系统的io pacing
     如:#smitty chsys 如下调整高低水印

Maximum number of PROCESSES allowed per user -----[128]

Maximum number of pages in block I/O BUFFER CACHE [20]

Maximum Kbytes of real memory allowed for MBUFS --[0]

Automatically REBOOT system after a crash --------false
Continuously maintain DISK I/O history -----------false
HIGH water mark for pending write I/Os per file --[33]

LOW water mark for pending write I/Os per file ---[24]

Amount of usable physical memory in Kbytes -------262144
State of system keylock at boot time -------------normal
Enable full CORE dump ----------------------------false
Use pre-430 style CORE dump ----------------------false
Enable CPU Guard ---------------------------------disable
b.调快cpu同步频率,(系统默认60秒) 

如果客户安装了hacmp4.4.0或以上版本,再hacmp菜单中可以直接设置
建议可以改为 10

Smitty cm_tuning_parms_chsyncd
Type or select values in entry fields.
Press Enter AFTER making all desired changes.

syncd frequency (in seconds) ----[60]

Esc+1=Help -Esc+2=Refresh Esc+3=Cancel Esc+4=List
Esc+5=Reset Esc+6=Command Esc+7=Edit --Esc+8=Image
Esc+9=Shell Esc+0=Exit ---Enter=Do

如果hacmp版本比较低,可以修改 /sbin/rc.boot 文件中的sync 值。

如:
echo "Starting the sync daemon" | alog -t boot
nohup /usr/sbin/syncd 60 > /dev/null 2>&1 &

c. 减慢心跳线诊断频率:

smitty cm_config_networks.chg_pre.select

Change a Cluster Network Module using Predefined Values

Type or select values in entry fields.
Press Enter AFTER making all desired changes.

---- ---- ---- ---- ---- ---- -[Entry Fields]
* Network Module Name ---- - IP
New Network Module Name ----[]
Description ---- ---- ---- ---[Generic IP]
Failure Detection Rate ---- -Normal >> slow

d.
调整网络参数;
# no -a
extendednetstats = 0
thewall = 6048
sockthresh = 85
sb_max = 1048576
somaxconn = 1024
clean_partial_conns = 0
net_malloc_police = 1
net_malloc_frag_mask = 0
rto_low = 1

#no -o thewall=131052

#no -a

extendednetstats = 0
thewall = 131052
sockthresh = 85
sb_max = 1048576
somaxconn = 1024
clean_partial_conns = 0
net_malloc_police = 1
net_malloc_frag_mask = 0
rto_low = 1

e.
如果客户安装了hacmp软件又发生了DMS , 则可以检查一下是否机器运行了电源管理软件(power management ),如果是,请关闭电源管理。如:
smitty pm

-------------------------------Power Management

Move cursor to desired item and press Enter.

Enable / Disable Power Management State Transition
Configure / Unconfigure Power Management
System State Transition from Enable State
Change / Show Characteristics of Power Management
Power Management Timer
Display Power Management
Power Management Characteristics of Each Device
Battery

如果觉得我的文章对您有用,请点赞。您的支持将鼓励我继续创作!

1

添加新评论0 条评论

Ctrl+Enter 发表

作者其他文章

X社区推广