powerHA脑裂问题

两个 vlpar分区作powerHA,互备模式。采用别名方式进行配置。网络+磁盘心跳。在资源切换测试过程中遇到如下问题:1.拔掉host1 的服务网线,地址漂移到另一个网卡上;但是拔掉host1 的剩下一根的网线,直接导致节点2宕机。2.拔掉host2 的服务网线,地址漂移到另一个网卡上;但是拔掉hos...显示全部

两个 vlpar分区作powerHA,互备模式。采用别名方式进行配置。网络+磁盘心跳。在资源切换测试过程中遇到如下问题:

1.拔掉host1 的服务网线,地址漂移到另一个网卡上;但是拔掉host1 的剩下一根的网线,直接导致节点2宕机。

2.拔掉host2 的服务网线,地址漂移到另一个网卡上;但是拔掉host2 的剩下一根的网线,直接导致节点1宕机。

3.如果halt -q进行测试都正常,手动进行资源转移也正常。

经过查看hacmp.out日志分析,与脑裂有关。但是磁盘心跳和网络心跳的配置都是正常的。系统版本是6100-09-03-1415,ha版本是 6.1.0.11 。

麻烦各位帮忙分析下。

日志如下:

May  9 10:28:01 EVENT START: node_down_complete host2

:node_down_complete[83] version=1.2.12.1

:node_down_complete[85] cl_get_path

:node_down_complete[85] HA_DIR=es

:node_down_complete[87] NODENAME=host2

:node_down_complete[87] export NODENAME

:node_down_complete[88] PARAM=''

:node_down_complete[88] export PARAM

:node_down_complete[90] VSD_PROG=/usr/lpp/csd/bin/hacmp_vsd_down2

:node_down_complete[91] HPS_PROG=/usr/es/sbin/cluster/events/utils/cl_HPS_init

:node_down_complete[92] NODE_HALT_CONTROL_FILE=/usr/es/sbin/cluster/etc/ha_nodehalt.lock

:node_down_complete[101] STATUS=0

:node_down_complete[103] [[ -z '' ]]

:node_down_complete[105] EMULATE=REAL

:node_down_complete[108] set -u

:node_down_complete[110] (( 1 < 1 ))

:node_down_complete[116] [[ '' == forced ]]

:node_down_complete[128] : if RG_DEPENDENCIES is set to false by the cluster manager,

:node_down_complete[129] : then resource groups will be processed via clsetenvgrp

:node_down_complete[131] [[ FALSE == FALSE ]]

:node_down_complete[134] : Set the RESOURCE_GROUPS environment variable with the names

:node_down_complete[135] : of all Resource Groups participating in this event, and export

:node_down_complete[136] : them to all successive scripts

:node_down_complete[138] set -a

:node_down_complete[139] clsetenvgrp host2 node_down_complete

:clsetenvgrp[+50] [[ high = high ]]

:clsetenvgrp[+50] version=1.16

:clsetenvgrp[+52] usingVer=clSetenvgrp

:clsetenvgrp[+57] clSetenvgrp host2 node_down_complete

executing clSetenvgrp

clSetenvgrp completed successfully

:clsetenvgrp[+58] exit 0

:node_down_complete[139] eval FORCEDOWN_GROUPS='""' RESOURCE_GROUPS='""' HOMELESS_GROUPS='""' HOMELESS_FOLLOWER_GROUPS='""' ERRSTATE_GROUPS='""' PRINCIPAL_ACTIONS='""' ASSOCIATE_ACTIONS='""' AUXILLIARY_ACTIONS='""'

:node_down_complete[1] FORCEDOWN_GROUPS=''

:node_down_complete[1] RESOURCE_GROUPS=''

:node_down_complete[1] HOMELESS_GROUPS=''

:node_down_complete[1] HOMELESS_FOLLOWER_GROUPS=''

:node_down_complete[1] ERRSTATE_GROUPS=''

:node_down_complete[1] PRINCIPAL_ACTIONS=''

:node_down_complete[1] ASSOCIATE_ACTIONS=''

:node_down_complete[1] AUXILLIARY_ACTIONS=''

:node_down_complete[140] RC=0

:node_down_complete[141] set +a

:node_down_complete[142] : exit status of clsetenvgrp host2 node_down_complete is: 0

:node_down_complete[143] (( 0 != 0 ))

:node_down_complete[151] : Process_Resources for parallel-processed resource groups

:node_down_complete[153] [[ FALSE == FALSE ]]

:node_down_complete[155] process_resources

:process_resources[2538] version=1.132.1.2

:process_resources[2541] STATUS=0

:process_resources[2542] sddsrv_off=FALSE

:process_resources[2544] true

:process_resources[2546] : call rgpa, and it will tell us what to do next

:process_resources[2548] set -a

:process_resources[2549] clRGPA

:clRGPA[+49] [[ high = high ]]

:clRGPA[+49] version=1.16

:clRGPA[+51] usingVer=clrgpa

:clRGPA[+56] clrgpa

2016-05-09T10:28:02.060250 clrgpa

:clRGPA[+57] exit 0

:process_resources[2549] eval JOB_TYPE=ERROR RESOURCE_GROUPS='"host2_rg"'

:process_resources[1] JOB_TYPE=ERROR

:process_resources[1] RESOURCE_GROUPS=host2_rg

:process_resources[2550] RC=0

:process_resources[2551] set +a

:process_resources[2553] (( 0 != 0 ))

:process_resources[2559] RESOURCE_GROUPS=host2_rg

+host2_rg:process_resources[2560] GROUPNAME=host2_rg

+host2_rg:process_resources[2560] export GROUPNAME

+host2_rg:process_resources[2830] set_resource_group_state ERROR

+host2_rg:process_resources[69] PS4_FUNC=set_resource_group_state

+host2_rg:process_resources[69] typeset PS4_FUNC

+host2_rg:process_resources[70] [[ high == high ]]

+host2_rg:process_resources[70] set -x

+host2_rg:process_resources[71] STAT=0

+host2_rg:process_resources[72] new_status=ERROR

+host2_rg:process_resources[76] export GROUPNAME

+host2_rg:process_resources[77] [[ ERROR != DOWN ]]

+host2_rg:process_resources[79] clchdaemons -d clstrmgr_scripts -t resource_locator -n host1 -o host2_rg -v ERROR

+host2_rg:process_resources[87] : Resource Manager Updates

+host2_rg:process_resources[107] cl_RMupdate rg_error host2_rg process_resources

2016-05-09T10:28:02.113488

2016-05-09T10:28:02.121127

Reference string: Mon.May.9.10:28:02.CST.2016.process_resources.host2_rg.ref

+host2_rg:process_resources[132] return 0

+host2_rg:process_resources[2544] true

+host2_rg:process_resources[2546] : call rgpa, and it will tell us what to do next

+host2_rg:process_resources[2548] set -a

+host2_rg:process_resources[2549] clRGPA

+host2_rg:clRGPA[+49] [[ high = high ]]

+host2_rg:clRGPA[+49] version=1.16

+host2_rg:clRGPA[+51] usingVer=clrgpa

+host2_rg:clRGPA[+56] clrgpa

2016-05-09T10:28:02.150158 clrgpa

+host2_rg:clRGPA[+57] exit 0

+host2_rg:process_resources[2549] eval JOB_TYPE=NONE

+host2_rg:process_resources[1] JOB_TYPE=NONE

+host2_rg:process_resources[2550] RC=0

+host2_rg:process_resources[2551] set +a

+host2_rg:process_resources[2553] (( 0 != 0 ))

+host2_rg:process_resources[2559] RESOURCE_GROUPS=host2_rg

+host2_rg:process_resources[2560] GROUPNAME=host2_rg

+host2_rg:process_resources[2560] export GROUPNAME

+host2_rg:process_resources[2864] break

+host2_rg:process_resources[2875] : If sddsrv was turned off above, turn it back on again

+host2_rg:process_resources[2877] [[ FALSE == TRUE ]]

+host2_rg:process_resources[2883] exit 0

:node_down_complete[156] RC=0

:node_down_complete[157] : exit status of process_resources is: 0

:node_down_complete[158] (( 0 != 0 ))

:node_down_complete[166] : VSD hook up

:node_down_complete[168] [[ '' != forced ]]

:node_down_complete[168] [[ -f /usr/lpp/csd/bin/hacmp_vsd_down2 ]]

:node_down_complete[180] : Determine whether this node contains an HACMP-controlled SP switch network

:node_down_complete[182] grep hps

:node_down_complete[182] clodmget '-qnodename = host1' -f type -n HACMPadapter

:node_down_complete[182] SP_SWITCH=''

:node_down_complete[184] SWITCH_TYPE=''

:node_down_complete[185] FED_TYPE=''

:node_down_complete[186] [[ -n '' ]]

:node_down_complete[227] : For each participating resource group, serially process the resources

:node_down_complete[229] LOCALCOMP=N

:node_down_complete[232] : if RG_DEPENDENCIES is set to false by the cluster manager,

:node_down_complete[233] : then resource groups will be processed via clsetenvgrp

:node_down_complete[235] [[ FALSE == FALSE ]]

:node_down_complete[274] [[ host2 == host1 ]]

:node_down_complete[361] : Refresh clcomd, FWIW

:node_down_complete[363] refresh -s clcomd

0513-095 The request for subsystem refresh was completed successfully.

收起
参与33

查看其它 5 个回答hong2611的回答

hong2611hong2611系统工程师北京银信长远科技股份有限公司

1、什么叫vlpar?  指VIOS下的LPAR

2、HA版本是6.1.0.11,建议更新到最新的补丁版本。SP11应该比较老的补丁了,有问题也正常。

3、检查心跳磁盘路径状态。拔掉一台主机的2根网线后,检查没关那台主机的心跳磁盘的路径状态。

4、整个测试过程中全程不停ping关联IP地址-t,检测IP状态。IP有冲突的情况下,会出现这种情况。

系统集成 · 2016-05-12
浏览2906

回答者

hong2611
系统工程师北京银信长远科技股份有限公司
擅长领域: 服务器系统管理脑裂

hong2611 最近回答过的问题

回答状态

  • 发布时间:2016-05-12
  • 关注会员:6 人
  • 回答浏览:2906
  • X社区推广