两个 vlpar分区作powerHA,互备模式。采用别名方式进行配置。网络+磁盘心跳。在资源切换测试过程中遇到如下问题:
1.拔掉host1 的服务网线,地址漂移到另一个网卡上;但是拔掉host1 的剩下一根的网线,直接导致节点2宕机。
2.拔掉host2 的服务网线,地址漂移到另一个网卡上;但是拔掉host2 的剩下一根的网线,直接导致节点1宕机。
3.如果halt -q进行测试都正常,手动进行资源转移也正常。
经过查看hacmp.out日志分析,与脑裂有关。但是磁盘心跳和网络心跳的配置都是正常的。系统版本是6100-09-03-1415,ha版本是 6.1.0.11 。
麻烦各位帮忙分析下。
日志如下:
May 9 10:28:01 EVENT START: node_down_complete host2
:node_down_complete[83] version=1.2.12.1
:node_down_complete[85] cl_get_path
:node_down_complete[85] HA_DIR=es
:node_down_complete[87] NODENAME=host2
:node_down_complete[87] export NODENAME
:node_down_complete[88] PARAM=''
:node_down_complete[88] export PARAM
:node_down_complete[90] VSD_PROG=/usr/lpp/csd/bin/hacmp_vsd_down2
:node_down_complete[91] HPS_PROG=/usr/es/sbin/cluster/events/utils/cl_HPS_init
:node_down_complete[92] NODE_HALT_CONTROL_FILE=/usr/es/sbin/cluster/etc/ha_nodehalt.lock
:node_down_complete[101] STATUS=0
:node_down_complete[103] [[ -z '' ]]
:node_down_complete[105] EMULATE=REAL
:node_down_complete[108] set -u
:node_down_complete[110] (( 1 < 1 ))
:node_down_complete[116] [[ '' == forced ]]
:node_down_complete[128] : if RG_DEPENDENCIES is set to false by the cluster manager,
:node_down_complete[129] : then resource groups will be processed via clsetenvgrp
:node_down_complete[131] [[ FALSE == FALSE ]]
:node_down_complete[134] : Set the RESOURCE_GROUPS environment variable with the names
:node_down_complete[135] : of all Resource Groups participating in this event, and export
:node_down_complete[136] : them to all successive scripts
:node_down_complete[138] set -a
:node_down_complete[139] clsetenvgrp host2 node_down_complete
:clsetenvgrp[+50] [[ high = high ]]
:clsetenvgrp[+50] version=1.16
:clsetenvgrp[+52] usingVer=clSetenvgrp
:clsetenvgrp[+57] clSetenvgrp host2 node_down_complete
executing clSetenvgrp
clSetenvgrp completed successfully
:clsetenvgrp[+58] exit 0
:node_down_complete[139] eval FORCEDOWN_GROUPS='""' RESOURCE_GROUPS='""' HOMELESS_GROUPS='""' HOMELESS_FOLLOWER_GROUPS='""' ERRSTATE_GROUPS='""' PRINCIPAL_ACTIONS='""' ASSOCIATE_ACTIONS='""' AUXILLIARY_ACTIONS='""'
:node_down_complete[1] FORCEDOWN_GROUPS=''
:node_down_complete[1] RESOURCE_GROUPS=''
:node_down_complete[1] HOMELESS_GROUPS=''
:node_down_complete[1] HOMELESS_FOLLOWER_GROUPS=''
:node_down_complete[1] ERRSTATE_GROUPS=''
:node_down_complete[1] PRINCIPAL_ACTIONS=''
:node_down_complete[1] ASSOCIATE_ACTIONS=''
:node_down_complete[1] AUXILLIARY_ACTIONS=''
:node_down_complete[140] RC=0
:node_down_complete[141] set +a
:node_down_complete[142] : exit status of clsetenvgrp host2 node_down_complete is: 0
:node_down_complete[143] (( 0 != 0 ))
:node_down_complete[151] : Process_Resources for parallel-processed resource groups
:node_down_complete[153] [[ FALSE == FALSE ]]
:node_down_complete[155] process_resources
:process_resources[2538] version=1.132.1.2
:process_resources[2541] STATUS=0
:process_resources[2542] sddsrv_off=FALSE
:process_resources[2544] true
:process_resources[2546] : call rgpa, and it will tell us what to do next
:process_resources[2548] set -a
:process_resources[2549] clRGPA
:clRGPA[+49] [[ high = high ]]
:clRGPA[+49] version=1.16
:clRGPA[+51] usingVer=clrgpa
:clRGPA[+56] clrgpa
2016-05-09T10:28:02.060250 clrgpa
:clRGPA[+57] exit 0
:process_resources[2549] eval JOB_TYPE=ERROR RESOURCE_GROUPS='"host2_rg"'
:process_resources[1] JOB_TYPE=ERROR
:process_resources[1] RESOURCE_GROUPS=host2_rg
:process_resources[2550] RC=0
:process_resources[2551] set +a
:process_resources[2553] (( 0 != 0 ))
:process_resources[2559] RESOURCE_GROUPS=host2_rg
+host2_rg:process_resources[2560] GROUPNAME=host2_rg
+host2_rg:process_resources[2560] export GROUPNAME
+host2_rg:process_resources[2830] set_resource_group_state ERROR
+host2_rg:process_resources[69] PS4_FUNC=set_resource_group_state
+host2_rg:process_resources[69] typeset PS4_FUNC
+host2_rg:process_resources[70] [[ high == high ]]
+host2_rg:process_resources[70] set -x
+host2_rg:process_resources[71] STAT=0
+host2_rg:process_resources[72] new_status=ERROR
+host2_rg:process_resources[76] export GROUPNAME
+host2_rg:process_resources[77] [[ ERROR != DOWN ]]
+host2_rg:process_resources[79] clchdaemons -d clstrmgr_scripts -t resource_locator -n host1 -o host2_rg -v ERROR
+host2_rg:process_resources[87] : Resource Manager Updates
+host2_rg:process_resources[107] cl_RMupdate rg_error host2_rg process_resources
2016-05-09T10:28:02.113488
2016-05-09T10:28:02.121127
Reference string: Mon.May.9.10:28:02.CST.2016.process_resources.host2_rg.ref
+host2_rg:process_resources[132] return 0
+host2_rg:process_resources[2544] true
+host2_rg:process_resources[2546] : call rgpa, and it will tell us what to do next
+host2_rg:process_resources[2548] set -a
+host2_rg:process_resources[2549] clRGPA
+host2_rg:clRGPA[+49] [[ high = high ]]
+host2_rg:clRGPA[+49] version=1.16
+host2_rg:clRGPA[+51] usingVer=clrgpa
+host2_rg:clRGPA[+56] clrgpa
2016-05-09T10:28:02.150158 clrgpa
+host2_rg:clRGPA[+57] exit 0
+host2_rg:process_resources[2549] eval JOB_TYPE=NONE
+host2_rg:process_resources[1] JOB_TYPE=NONE
+host2_rg:process_resources[2550] RC=0
+host2_rg:process_resources[2551] set +a
+host2_rg:process_resources[2553] (( 0 != 0 ))
+host2_rg:process_resources[2559] RESOURCE_GROUPS=host2_rg
+host2_rg:process_resources[2560] GROUPNAME=host2_rg
+host2_rg:process_resources[2560] export GROUPNAME
+host2_rg:process_resources[2864] break
+host2_rg:process_resources[2875] : If sddsrv was turned off above, turn it back on again
+host2_rg:process_resources[2877] [[ FALSE == TRUE ]]
+host2_rg:process_resources[2883] exit 0
:node_down_complete[156] RC=0
:node_down_complete[157] : exit status of process_resources is: 0
:node_down_complete[158] (( 0 != 0 ))
:node_down_complete[166] : VSD hook up
:node_down_complete[168] [[ '' != forced ]]
:node_down_complete[168] [[ -f /usr/lpp/csd/bin/hacmp_vsd_down2 ]]
:node_down_complete[180] : Determine whether this node contains an HACMP-controlled SP switch network
:node_down_complete[182] grep hps
:node_down_complete[182] clodmget '-qnodename = host1' -f type -n HACMPadapter
:node_down_complete[182] SP_SWITCH=''
:node_down_complete[184] SWITCH_TYPE=''
:node_down_complete[185] FED_TYPE=''
:node_down_complete[186] [[ -n '' ]]
:node_down_complete[227] : For each participating resource group, serially process the resources
:node_down_complete[229] LOCALCOMP=N
:node_down_complete[232] : if RG_DEPENDENCIES is set to false by the cluster manager,
:node_down_complete[233] : then resource groups will be processed via clsetenvgrp
:node_down_complete[235] [[ FALSE == FALSE ]]
:node_down_complete[274] [[ host2 == host1 ]]
:node_down_complete[361] : Refresh clcomd, FWIW
:node_down_complete[363] refresh -s clcomd
0513-095 The request for subsystem refresh was completed successfully.
收起1、什么叫vlpar? 指VIOS下的LPAR
2、HA版本是6.1.0.11,建议更新到最新的补丁版本。SP11应该比较老的补丁了,有问题也正常。
3、检查心跳磁盘路径状态。拔掉一台主机的2根网线后,检查没关那台主机的心跳磁盘的路径状态。
4、整个测试过程中全程不停ping关联IP地址-t,检测IP状态。IP有冲突的情况下,会出现这种情况。