IT咨询服务PowerHAaix6.1

hacmp配置同步时的错误?

环境如下:AIX6.1+hacmp 6.1,两节点配置如下:节点1:hostnameLPAR1两网卡en0,en1配置如下:en0:HOSTNAME [LPAR1]Internet ADDRESS (dotted decimal) [192.168.10.111]Network MASK (dotted decimal) ...显示全部

环境如下:

AIX6.1+hacmp 6.1,两节点配置如下:

节点1:

hostname

LPAR1

两网卡en0,en1配置如下:
en0:

  • HOSTNAME [LPAR1]
  • Internet ADDRESS (dotted decimal) [192.168.10.111]
    Network MASK (dotted decimal) [255.255.255.0]
  • Network INTERFACE en0
    NAMESERVER

         Internet ADDRESS (dotted decimal)         []
         DOMAIN Name                               []

    Default Gateway

     Address (dotted decimal or symbolic name)     [192.168.1.1]

    en1:

  • HOSTNAME [LPAR1]
  • Internet ADDRESS (dotted decimal) [192.168.20.111]
    Network MASK (dotted decimal) [255.255.255.0]
  • Network INTERFACE en1
    NAMESERVER

         Internet ADDRESS (dotted decimal)         []
         DOMAIN Name                               []

    Default Gateway

     Address (dotted decimal or symbolic name)     [192.168.1.1]
     Cost                                          [0]          

    节点2配置:

    hostname

    LPAR2

两网卡en0,en1配置为:
en0:

  • HOSTNAME [LPAR2]
  • Internet ADDRESS (dotted decimal) [192.168.10.112]
    Network MASK (dotted decimal) [255.255.255.0]
  • Network INTERFACE en0
    NAMESERVER

         Internet ADDRESS (dotted decimal)         []
         DOMAIN Name                               []

    Default Gateway

     Address (dotted decimal or symbolic name)     [192.168.1.1]
     Cost                                          [0]                                                                                     #
     Do Active Dead Gateway Detection?              no                                                                                    +

    Your CABLE Type N/A +
    START Now no

en1:

  • HOSTNAME [LPAR2]
  • Internet ADDRESS (dotted decimal) [192.168.20.112]
    Network MASK (dotted decimal) [255.255.255.0]
  • Network INTERFACE en1
    NAMESERVER

         Internet ADDRESS (dotted decimal)         []
         DOMAIN Name                               []

    Default Gateway

     Address (dotted decimal or symbolic name)     [192.168.1.1]
     Cost                                          [0]                                                                                     #
     Do Active Dead Gateway Detection?              no                                                                                    +

    Your CABLE Type N/A +
    START Now no

两节点hosts文件内容为:
/etc/hosts文件内容:

boot ip

192.168.10.111 LPAR1_boot
192.168.10.112 LPAR2_boot

standby ip

192.168.20.111 LPAR1_standby
192.168.20.112 LPAR2_standby

persisit ip

192.168.1.111 LPAR1
192.168.1.112 LPAR2

service ip

192.168.1.110 LPAR_srv

现在遇到的问题是:
1. 完成hacmp配置,最后运行smit hacmp->Extended Configuratio->Extended Verification and Synchronization时,设置选项为:

                                                    [Entry Fields]
  • Verify, Synchronize or Both [Both] +
  • Automatically correct errors found during [Yes] +
    verification?
  • Force synchronization if verification fails? [No] +
  • Verify changes only? [No] +
  • Logging [Standard]
    最终运行结果是:OK
    但是运行结果下方的日志中显示了这个错误:
    rshexec: cannot connect to node LPAR1
    Could not run clfilecollection -u on node LPAR1.
    rshexec: cannot connect to node LPAR2
    Could not run clfilecollection -u on node LPAR2.

Verification has completed normally.
rshexec: cannot connect to node LPAR1
ERROR: Cannot refresh clcomdES subsystem on node LPAR1rshexec: cannot connect to node LPAR2
ERROR: Cannot refresh clcomdES subsystem on node LPAR2

请问这个错误对hacmp配置有影响吗?怎么解决?
2. 运行上述命令后,发现/etc/hosts文件被自动修改成了下面的样子:

boot ip

192.168.10.112 LPAR2_boot

standby ip

192.168.20.112 LPAR2_standby

persisit ip

192.168.1.111 LPAR1
192.168.1.112 LPAR2

service ip

192.168.1.110 LPAR_srv
192.168.10.111 LPAR1_boot LPAR1
192.168.20.111 LPAR1_standby LPAR1
加了别名,这个别名是什么机制?
3. 完成上述配置后,运行smit clstart,选择启动两个节点,
运行结果是OK,但是下方日志显示:
migcheck[475]: cl_connect() error, nodename=LPAR1, rc=-1
migcheck[475]: cl_connect() error, nodename=LPAR2, rc=-1

WARNING: A communication error was encountered trying to get the VRMF from remote nodes. Please make sure clcomd is running
按提示检查clcomd,

lssrc -s clcomd

Subsystem Group PID Status
clcomd caa 4980916 active
两节点均显示active,既然是active,为什么会有上面的warning?

按步骤3启动服务后,查看Ip情况
节点LPAR1上

ifconfig -a|more

en0: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

    inet 192.168.10.111 netmask 0xffffff00 broadcast 192.168.10.255
    inet 192.168.1.111 netmask 0xffffff00 broadcast 192.168.1.255
     tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en1: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

    inet 192.168.20.111 netmask 0xffffff00 broadcast 192.168.20.255
    inet 192.168.1.110 netmask 0xffffff00 broadcast 192.168.1.255
     tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

lo0: flags=e08084b,c0<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,LARGESEND,CHAIN>

    inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
    inet6 ::1%1/0
     tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1
     

节点LPAR2上:

ifconfig -a|more

en0: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

    inet 192.168.10.112 netmask 0xffffff00 broadcast 192.168.10.255
     tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en1: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

    inet 192.168.20.112 netmask 0xffffff00 broadcast 192.168.20.255
    inet 192.168.1.112 netmask 0xffffff00 broadcast 192.168.1.255
     tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

(...)
ip查看没有发现异常,
使用smit hacmp->System Management (C-SPOC)--> HACMP Services-->Show Cluster Services
显示服务运行如下

Status of the RSCT subsystems used by HACMP:
Subsystem Group PID Status
topsvcs topsvcs 9633858 active
grpsvcs grpsvcs 13172936 active
grpglsm grpsvcs inoperative
emsvcs emsvcs 7733330 active
emaixos emsvcs inoperative
ctrmc rsct 5112004 active

Status of the HACMP subsystems:
Subsystem Group PID Status
clcomdES clcomdES 4063414 active
clstrmgrES cluster 6815944 active

Status of the optional HACMP subsystems:
Subsystem Group PID Status
clinfoES cluster 4128932 active
初步看到这些状态都是正常的,但是在LPAR1上运行stop service时,运行失败,提示
Command: failed stdout: yes stderr: no
cl_clstop: ERROR: Node LPAR1 has 1 event(s) outstanding as reported by command 'lssrc -ls clstrmgrES' and cannot be stopped until all outstandi
ng events have completed. The stop request has been aborted for all nodes. Please wait for all nodes to stabalize before attempting to stop c
luster services again.
根据提示,运行lssrc -ls clstrmgrES,结果如下

lssrc -ls clstrmgrES

Current state: ST_RP_FAILED
sccsid = "@(#)36 1.135.6.5 src/43haes/usr/sbin/cluster/hacmprd/main.C, hacmp.pe, 53haes_r610, 1442A_hacmp610 9/11/14 13:15:08"
i_local_nodeid 0, i_local_siteid -1, my_handle 1
ml_idx[1]=0 ml_idx[2]=1
tp is 20459278
Events on event queue:
te_type 4, te_nodeid 1, te_network -1
There are 0 events on the Ibcast queue
There are 0 events on the RM Ibcast queue
CLversion: 11
local node vrmf is 6111
cluster fix level is "1"
The following timer(s) are currently active:
Event error node list: LPAR1
Current DNP values
DNP Values for NodeId - 1 NodeName - LPAR1

PgSpFree = 128613  PvPctBusy = 0  PctTotalTimeIdle = 99.652258

DNP Values for NodeId - 2 NodeName - LPAR2

PgSpFree = 128973  PvPctBusy = 0  PctTotalTimeIdle = 99.790585

这个是什么原因?

收起
参与18

查看其它 1 个回答crystalwmagic的回答

crystalwmagiccrystalwmagic系统工程师浙商银行

看你配置的过程中是否忽略了 /usr/es/sbin/cluster/etc/rhosts文件的配置?

银行 · 2018-01-16
浏览4861
  • 两边都配置了,LPAR1上是:# more /usr/es/sbin/cluster/etc/rhosts 192.168.10.112 192.168.20.112 LPAR2上是192.168.10.111 192.168.20.111
    2018-01-16
  • 添加了ip地址后是否重新启动过clcmd服务?
    2018-01-16
  • 建议rhosts文件中将cluster有关的所有地址都加入,boot ip、persist ip、service ip,而不是每个LPAR只加入对端的boot地址,rhost和hosts文件修改后建议重启clcomd和clcomdES服务,stop start,而不是refresh
    2018-01-17

回答者

crystalwmagic
系统工程师浙商银行
擅长领域: 存储服务器灾备

crystalwmagic 最近回答过的问题

回答状态

  • 发布时间:2018-01-16
  • 关注会员:3 人
  • 回答浏览:4861
  • X社区推广