IT咨询服务PowerHAaix6.1

hacmp配置同步时的错误?

环境如下:AIX6.1+hacmp 6.1,两节点配置如下:节点1:hostnameLPAR1两网卡en0,en1配置如下:en0:HOSTNAME [LPAR1]Internet ADDRESS (dotted decimal) [192.168.10.111]Network MASK (dotted decimal) ...显示全部

环境如下:

AIX6.1+hacmp 6.1,两节点配置如下:

节点1:

hostname

LPAR1

两网卡en0,en1配置如下:
en0:

  • HOSTNAME [LPAR1]
  • Internet ADDRESS (dotted decimal) [192.168.10.111]
    Network MASK (dotted decimal) [255.255.255.0]
  • Network INTERFACE en0
    NAMESERVER

         Internet ADDRESS (dotted decimal)         []
         DOMAIN Name                               []

    Default Gateway

     Address (dotted decimal or symbolic name)     [192.168.1.1]

    en1:

  • HOSTNAME [LPAR1]
  • Internet ADDRESS (dotted decimal) [192.168.20.111]
    Network MASK (dotted decimal) [255.255.255.0]
  • Network INTERFACE en1
    NAMESERVER

         Internet ADDRESS (dotted decimal)         []
         DOMAIN Name                               []

    Default Gateway

     Address (dotted decimal or symbolic name)     [192.168.1.1]
     Cost                                          [0]          

    节点2配置:

    hostname

    LPAR2

两网卡en0,en1配置为:
en0:

  • HOSTNAME [LPAR2]
  • Internet ADDRESS (dotted decimal) [192.168.10.112]
    Network MASK (dotted decimal) [255.255.255.0]
  • Network INTERFACE en0
    NAMESERVER

         Internet ADDRESS (dotted decimal)         []
         DOMAIN Name                               []

    Default Gateway

     Address (dotted decimal or symbolic name)     [192.168.1.1]
     Cost                                          [0]                                                                                     #
     Do Active Dead Gateway Detection?              no                                                                                    +

    Your CABLE Type N/A +
    START Now no

en1:

  • HOSTNAME [LPAR2]
  • Internet ADDRESS (dotted decimal) [192.168.20.112]
    Network MASK (dotted decimal) [255.255.255.0]
  • Network INTERFACE en1
    NAMESERVER

         Internet ADDRESS (dotted decimal)         []
         DOMAIN Name                               []

    Default Gateway

     Address (dotted decimal or symbolic name)     [192.168.1.1]
     Cost                                          [0]                                                                                     #
     Do Active Dead Gateway Detection?              no                                                                                    +

    Your CABLE Type N/A +
    START Now no

两节点hosts文件内容为:
/etc/hosts文件内容:

boot ip

192.168.10.111 LPAR1_boot
192.168.10.112 LPAR2_boot

standby ip

192.168.20.111 LPAR1_standby
192.168.20.112 LPAR2_standby

persisit ip

192.168.1.111 LPAR1
192.168.1.112 LPAR2

service ip

192.168.1.110 LPAR_srv

现在遇到的问题是:
1. 完成hacmp配置,最后运行smit hacmp->Extended Configuratio->Extended Verification and Synchronization时,设置选项为:

                                                    [Entry Fields]
  • Verify, Synchronize or Both [Both] +
  • Automatically correct errors found during [Yes] +
    verification?
  • Force synchronization if verification fails? [No] +
  • Verify changes only? [No] +
  • Logging [Standard]
    最终运行结果是:OK
    但是运行结果下方的日志中显示了这个错误:
    rshexec: cannot connect to node LPAR1
    Could not run clfilecollection -u on node LPAR1.
    rshexec: cannot connect to node LPAR2
    Could not run clfilecollection -u on node LPAR2.

Verification has completed normally.
rshexec: cannot connect to node LPAR1
ERROR: Cannot refresh clcomdES subsystem on node LPAR1rshexec: cannot connect to node LPAR2
ERROR: Cannot refresh clcomdES subsystem on node LPAR2

请问这个错误对hacmp配置有影响吗?怎么解决?
2. 运行上述命令后,发现/etc/hosts文件被自动修改成了下面的样子:

boot ip

192.168.10.112 LPAR2_boot

standby ip

192.168.20.112 LPAR2_standby

persisit ip

192.168.1.111 LPAR1
192.168.1.112 LPAR2

service ip

192.168.1.110 LPAR_srv
192.168.10.111 LPAR1_boot LPAR1
192.168.20.111 LPAR1_standby LPAR1
加了别名,这个别名是什么机制?
3. 完成上述配置后,运行smit clstart,选择启动两个节点,
运行结果是OK,但是下方日志显示:
migcheck[475]: cl_connect() error, nodename=LPAR1, rc=-1
migcheck[475]: cl_connect() error, nodename=LPAR2, rc=-1

WARNING: A communication error was encountered trying to get the VRMF from remote nodes. Please make sure clcomd is running
按提示检查clcomd,

lssrc -s clcomd

Subsystem Group PID Status
clcomd caa 4980916 active
两节点均显示active,既然是active,为什么会有上面的warning?

按步骤3启动服务后,查看Ip情况
节点LPAR1上

ifconfig -a|more

en0: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

    inet 192.168.10.111 netmask 0xffffff00 broadcast 192.168.10.255
    inet 192.168.1.111 netmask 0xffffff00 broadcast 192.168.1.255
     tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en1: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

    inet 192.168.20.111 netmask 0xffffff00 broadcast 192.168.20.255
    inet 192.168.1.110 netmask 0xffffff00 broadcast 192.168.1.255
     tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

lo0: flags=e08084b,c0<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,LARGESEND,CHAIN>

    inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
    inet6 ::1%1/0
     tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1
     

节点LPAR2上:

ifconfig -a|more

en0: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

    inet 192.168.10.112 netmask 0xffffff00 broadcast 192.168.10.255
     tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en1: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

    inet 192.168.20.112 netmask 0xffffff00 broadcast 192.168.20.255
    inet 192.168.1.112 netmask 0xffffff00 broadcast 192.168.1.255
     tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

(...)
ip查看没有发现异常,
使用smit hacmp->System Management (C-SPOC)--> HACMP Services-->Show Cluster Services
显示服务运行如下

Status of the RSCT subsystems used by HACMP:
Subsystem Group PID Status
topsvcs topsvcs 9633858 active
grpsvcs grpsvcs 13172936 active
grpglsm grpsvcs inoperative
emsvcs emsvcs 7733330 active
emaixos emsvcs inoperative
ctrmc rsct 5112004 active

Status of the HACMP subsystems:
Subsystem Group PID Status
clcomdES clcomdES 4063414 active
clstrmgrES cluster 6815944 active

Status of the optional HACMP subsystems:
Subsystem Group PID Status
clinfoES cluster 4128932 active
初步看到这些状态都是正常的,但是在LPAR1上运行stop service时,运行失败,提示
Command: failed stdout: yes stderr: no
cl_clstop: ERROR: Node LPAR1 has 1 event(s) outstanding as reported by command 'lssrc -ls clstrmgrES' and cannot be stopped until all outstandi
ng events have completed. The stop request has been aborted for all nodes. Please wait for all nodes to stabalize before attempting to stop c
luster services again.
根据提示,运行lssrc -ls clstrmgrES,结果如下

lssrc -ls clstrmgrES

Current state: ST_RP_FAILED
sccsid = "@(#)36 1.135.6.5 src/43haes/usr/sbin/cluster/hacmprd/main.C, hacmp.pe, 53haes_r610, 1442A_hacmp610 9/11/14 13:15:08"
i_local_nodeid 0, i_local_siteid -1, my_handle 1
ml_idx[1]=0 ml_idx[2]=1
tp is 20459278
Events on event queue:
te_type 4, te_nodeid 1, te_network -1
There are 0 events on the Ibcast queue
There are 0 events on the RM Ibcast queue
CLversion: 11
local node vrmf is 6111
cluster fix level is "1"
The following timer(s) are currently active:
Event error node list: LPAR1
Current DNP values
DNP Values for NodeId - 1 NodeName - LPAR1

PgSpFree = 128613  PvPctBusy = 0  PctTotalTimeIdle = 99.652258

DNP Values for NodeId - 2 NodeName - LPAR2

PgSpFree = 128973  PvPctBusy = 0  PctTotalTimeIdle = 99.790585

这个是什么原因?

收起
参与18

查看其它 1 个回答wangmj的回答

wangmjwangmj  系统运维工程师 , CES

hosts文件别名问题应该是与你配置HA时节点名导致的;
你把服务来拉起来后你这个2个persisit分别活在了不同vlan的网卡上,感觉也不太正常;
至于clcmd服务,建议你看下官方文档上面这部分具体要怎么配置,由于没配置过6所以也不是很清楚。

银行 · 2018-01-16
浏览6576
  • 那是哪个地方配置出了问题?请指点一下,谢谢
    2018-01-16
  • 我感觉你还没搞清楚HA里面这些所谓的boot、standby、persisit ip的用途与意义,没必要生搬硬套的,那些只是一个名字,建议你下载官方的红皮书过一篇powerha 6的配置步骤。我只用过5跟7这2个版本,所以你6版本具体怎么配置我也不是还好说明。
    2018-01-16
  • 这个是测试环境,取那些名字只是方便区分而已,之前也花了很长时间弄清楚这几个Ip的用途,boot ip是机器启动时的ip, 也就是网卡配置中指定的ip,standby 也可以理解为boot类的后备ip,作用与boot ip相同,主要是双机内部通讯用,persisit ip与节点绑定,可以在同一节点的不同的网卡上漂移,主要用于节点管理,service ip是对外提供服务的ip,可以在双机节点之间漂移,对外保证服务的可持续性,配置过程也是参照Powerha的文档来的,就是不知道哪个地方出了问题
    2018-01-16
  • 既然是测试环境,建议你把之前环境删掉,重新配置。配置时关于hosts文件的规划,建议你原来的boot ip直接对应主机名,采取下面的方式: boot ip 192.168.10.111 LPAR1 192.168.10.112 LPAR2 standby ip 192.168.20.111 LPAR1_st 192.168.20.112 LPAR2_st persisit ip 192.168.1.111 LPAR1_per 192.168.1.112 LPAR2_per service ip 192.168.1.110 LPAR_srv 确保使用host解析时主机名对应解析到正确的地址,由于per ip是在ha里面定义的,第一次同步ha之前并不能访问,所以最开始我们定义节点使用的名字最好一开始有对应的ip。 clcomd在你配置好了对应的那个hosts文件后,重启下这个服务并验证ok。 至于你启动后停不下来的原因,你可以认真看下你的hacmp.out日志。
    2018-01-17

回答者

wangmj
系统运维工程师CES
擅长领域: 存储灾备服务器

wangmj 最近回答过的问题

回答状态

  • 发布时间:2018-01-16
  • 关注会员:3 人
  • 回答浏览:6576
  • X社区推广