作者mxin·2014-02-20 12:58

资深工程师·上海宝信软件股份有限公司

PowerHA 完全手册（三）

字数 141251阅读 6119评论 2赞 5

2. 第三部分--测试篇

虽然HACMP提供了自动化测试工具test tool，使用起来也较为简单。但个人认为由于HACMP的完整测试是一个比较复杂的事情，工具虽然出来了蛮久的，但似乎感觉还是不能非常让人放心，何况也无法模拟交换机等故障，所以只能提供协助，不能完全依赖，结果仅供参考。

2.1. 测试方法说明：

1. ping测试：从client同时发起，每次1024个字节，延续10分钟。

2. ping长测试：每次1024个字节，延续24小时。

3. 应用测试：利用自动化测试工具如loadrunner持续从 client连接应用服务使用查询。

4. 应用长测试:48小时内，进行应用测试使用。

5. telnet测试：telnet连接后根据情况确认。

2.2. 标准测试

这个测试为必须完成的测试，网络部分每个网段都要做一次，时间节点一般为安装配置中的初始配置阶段，最终配置阶段以及运维定修阶段。

2.2.1. 标准测试表

注意：每步动作后，需要采用clstat确保HACMP已处于STABLE稳定状态再做下一步动作，尤其是恢复动作（对于4，10 实际为3个小步骤），最好间隔120-300s，否则HACMP由于状态不稳定来不及做出判断，出现异常。

序号	测试步骤	系统结果	应用结果
1	拔掉host1的服务网线	地址漂移到另一个网卡	中断30s左右可继续使用
2	拔掉host1的剩下一根的网线	发生切换	中断5分钟左右可继续使用
3	拔掉host2的服务网线	所有服务地址漂到另一网卡	中断30s左右可继续使用
4	恢复所有网线	地址join，clstat可看到均up	无影响
5	在host2上执行ha1t -q	host2机宕机，切换到host1机	中断5分钟左右可继续使用

6	起动host2机器，在host2上手工执行 smit clstart回原集群	host1上的属于host2的相关资源及服务切换回host2,集群回到设计状态	中断5分钟左右可继续使用

7	拔掉host2的服务网线	地址漂另一个网卡	中断30s左右可继续使用
8	拔掉host2的剩下一根的网线	发生切换	中断5分钟左右可继续使用
9	拔掉host1的服务网线	所有服务地址漂到另一网卡	中断30s左右可继续使用
10	恢复所有网线	地址join，clstat可看到均up	无影响
11	在host1上执行halt -q	host1机宕机，切换到host2机	中断5分钟左右可继续使用
12	起动host1机器，在host1上手工执行 smit clstart回原集群	host2上的属于host1的相关资源及服务切换回host1,集群回到设计状态	中断5分钟左右可继续使用

以下为日志/var/hacmp/log/hacmp.out的部分分析，供大家实际测试参考：

步骤1：拔掉host1的服务网线

Sep 16 14:53:10 EVENT START: swap_adapter host1 net_ether_02 10.2.12.1 10.2.200.1

Sep 16 14:53:12 EVENT START: swap_aconn_protocols en3 en1

Sep 16 14:53:12 EVENT COMPLETED: swap_aconn_protocols en3 en1 0

Sep 16 14:53:12 EVENT COMPLETED: swap_adapter host1 net_ether_02 10.2.12.1 10.2.200.1 0

Sep 16 14:53:12 EVENT START: swap_adapter_complete host1 net_ether_02 10.2.12.1 10.2.200.1

Sep 16 14:53:13 EVENT COMPLETED: swap_adapter_complete host1 net_ether_02 10.2.12.1 10.2.200.1 0

步骤2：拔掉host1的剩下一根的网线

Sep 16 14:53:14 EVENT START: fail_interface host1 10.2.2.1

Sep 16 14:53:14 EVENT COMPLETED: fail_interface host1 10.2.2.1 0

Sep 16 14:53:55 EVENT START: network_down host1 net_ether_02

Sep 16 14:53:56 EVENT COMPLETED: network_down host1 net_ether_02 0

Sep 16 14:53:56 EVENT START: network_down_complete host1 net_ether_02

Sep 16 14:53:56 EVENT COMPLETED: network_down_complete host1 net_ether_02 0

Sep 16 14:54:03 EVENT START: rg_move_release host1 1

Sep 16 14:54:03 EVENT START: rg_move host1 1 RELEASE

Sep 16 14:54:03 EVENT START: node_down_local

Sep 16 14:54:03 EVENT START: stop_server host2_app host1_app

Sep 16 14:54:04 EVENT COMPLETED: stop_server host2_app host1_app 0

Sep 16 14:54:04 EVENT START: release_vg_fs ALL host1vg

Sep 16 14:54:06 EVENT COMPLETED: release_vg_fs ALL host1vg 0

Sep 16 14:54:06 EVENT START: release_service_addr host1_l1_svc1 host1_l1_svc2 host1_l2_svc

Sep 16 14:54:11 EVENT COMPLETED: release_service_addr host1_l1_svc1 host1_l1_svc2 host1_l2_svc 0

Sep 16 14:54:11 EVENT COMPLETED: node_down_local 0

Sep 16 14:54:11 EVENT COMPLETED: rg_move host1 1 RELEASE 0

Sep 16 14:54:11 EVENT COMPLETED: rg_move_release host1 1 0

Sep 16 14:54:13 EVENT START: rg_move_fence host1 1

Sep 16 14:54:14 EVENT COMPLETED: rg_move_fence host1 1 0

Sep 16 14:54:14 EVENT START: rg_move_acquire host1 1

Sep 16 14:54:14 EVENT START: rg_move host1 1 ACQUIRE

Sep 16 14:54:14 EVENT COMPLETED: rg_move host1 1 ACQUIRE 0

Sep 16 14:54:14 EVENT COMPLETED: rg_move_acquire host1 1 0

Sep 16 14:54:24 EVENT START: rg_move_complete host1 1

Sep 16 14:54:25 EVENT START: node_up_remote_complete host1

Sep 16 14:54:25 EVENT COMPLETED: node_up_remote_complete host1 0

Sep 16 14:54:25 EVENT COMPLETED: rg_move_complete host1 1 0

步骤4：恢复所有网线

Sep 16 14:55:49 EVENT START: network_up host1 net_ether_02

Sep 16 14:55:49 EVENT COMPLETED: network_up host1 net_ether_02 0

Sep 16 14:55:50 EVENT START: network_up_complete host1 net_ether_02

Sep 16 14:55:50 EVENT COMPLETED: network_up_complete host1 net_ether_02 0

Sep 16 14:56:00 EVENT START: join_interface host1 10.2.12.1

Sep 16 14:56:00 EVENT COMPLETED: join_interface host1 10.2.12.1 0

步骤5：在host2上执行ha1t -q

Sep 16 14:58:56 EVENT START: node_down host2

Sep 16 14:58:57 EVENT START: acquire_service_addr

Sep 16 14:58:58 EVENT START: acquire_aconn_service en0 net_ether_01

Sep 16 14:58:59 EVENT COMPLETED: acquire_aconn_service en0 net_ether_01 0

Sep 16 14:59:00 EVENT START: acquire_aconn_service en2 net_ether_01

Sep 16 14:59:00 EVENT COMPLETED: acquire_aconn_service en2 net_ether_01 0

Sep 16 14:59:01 EVENT START: acquire_aconn_service en1 net_ether_02

Sep 16 14:59:01 EVENT COMPLETED: acquire_aconn_service en1 net_ether_02 0

Sep 16 14:59:01 EVENT COMPLETED: acquire_service_addr 0

Sep 16 14:59:02 EVENT START: acquire_takeover_addr

Sep 16 14:59:05 EVENT COMPLETED: acquire_takeover_addr 0

Sep 16 14:59:11 EVENT COMPLETED: node_down host2 0

Sep 16 14:59:11 EVENT START: node_down_complete host2

Sep 16 14:59:12 EVENT START: start_server host1_app host2_app

Sep 16 14:59:12 EVENT START: start_server host2_app

Sep 16 14:59:12 EVENT COMPLETED: start_server host1_app host2_app 0

Sep 16 14:59:12 EVENT COMPLETED: start_server host2_app 0

Sep 16 14:59:13 EVENT COMPLETED: node_down_complete host2 0

步骤6：回原

Sep 16 15:10:25 EVENT START: node_up host2

Sep 16 15:10:27 EVENT START: acquire_service_addr

Sep 16 15:10:28 EVENT START: acquire_aconn_service en0 net_ether_01

Sep 16 15:10:28 EVENT COMPLETED: acquire_aconn_service en0 net_ether_01 0

Sep 16 15:10:29 EVENT START: acquire_aconn_service en2 net_ether_01

Sep 16 15:10:29 EVENT COMPLETED: acquire_aconn_service en2 net_ether_01 0

Sep 16 15:10:31 EVENT START: acquire_aconn_service en1 net_ether_02

Sep 16 15:10:31 EVENT COMPLETED: acquire_aconn_service en1 net_ether_02 0

Sep 16 15:10:31 EVENT COMPLETED: acquire_service_addr 0

Sep 16 15:10:36 EVENT COMPLETED: node_up host2 0

Sep 16 15:10:36 EVENT START: node_up_complete host2

Sep 16 15:10:36 EVENT START: start_server host2_app

Sep 16 15:10:37 EVENT COMPLETED: start_server host2_app 0

Sep 16 15:10:37 EVENT COMPLETED: node_up_complete host2 0

Sep 16 15:10:41 EVENT START: network_up host2 net_diskhbmulti_01

Sep 16 15:10:42 EVENT COMPLETED: network_up host2 net_diskhbmulti_01 0

Sep 16 15:10:42 EVENT START: network_up_complete host2 net_diskhbmulti_01

Sep 16 15:10:42 EVENT COMPLETED: network_up_complete host2 net_diskhbmulti_01 0

步骤7：拔掉host2的服务网线

Sep 16 15:20:36 EVENT START: swap_adapter host2 net_ether_02 10.2.12.2 10.2.200.2

Sep 16 15:20:38 EVENT START: swap_aconn_protocols en3 en1

Sep 16 15:20:38 EVENT COMPLETED: swap_aconn_protocols en3 en1 0

Sep 16 15:20:38 EVENT COMPLETED: swap_adapter host2 net_ether_02 10.2.12.2 10.2.200.2 0

Sep 16 15:20:39 EVENT START: swap_adapter_complete host2 net_ether_02 10.2.12.2 10.2.200.2

Sep 16 15:20:39 EVENT COMPLETED: swap_adapter_complete host2 net_ether_02 10.2.12.2 10.2.200.2 0

步骤8：拔掉host2的剩下一根的网线

Sep 16 15:20:40 EVENT START: fail_interface host2 10.2.2.2

Sep 16 15:20:40 EVENT COMPLETED: fail_interface host2 10.2.2.2 0

Sep 16 15:21:40 EVENT START: network_down host2 net_ether_02

Sep 16 15:21:40 EVENT COMPLETED: network_down host2 net_ether_02 0

Sep 16 15:21:40 EVENT START: network_down_complete host2 net_ether_02

Sep 16 15:21:41 EVENT COMPLETED: network_down_complete host2 net_ether_02 0

Sep 16 15:21:47 EVENT START: rg_move_release host2 2

Sep 16 15:21:47 EVENT START: rg_move host2 2 RELEASE

Sep 16 15:21:48 EVENT START: node_down_local

Sep 16 15:21:48 EVENT START: stop_server host2_app

Sep 16 15:21:48 EVENT COMPLETED: stop_server host2_app 0

Sep 16 15:21:48 EVENT START: release_vg_fs ALL host2vg

Sep 16 15:21:50 EVENT COMPLETED: release_vg_fs ALL host2vg 0

Sep 16 15:21:50 EVENT START: release_service_addr host2_l1_svc1 host2_l1_svc2 host2_l2_svc

Sep 16 15:21:55 EVENT COMPLETED: release_service_addr host2_l1_svc1 host2_l1_svc2 host2_l2_svc 0

Sep 16 15:21:55 EVENT COMPLETED: node_down_local 0

Sep 16 15:21:55 EVENT COMPLETED: rg_move host2 2 RELEASE 0

Sep 16 15:21:55 EVENT COMPLETED: rg_move_release host2 2 0

Sep 16 15:21:57 EVENT START: rg_move_fence host2 2

Sep 16 15:21:58 EVENT COMPLETED: rg_move_fence host2 2 0

Sep 16 15:21:58 EVENT START: rg_move_acquire host2 2

Sep 16 15:21:58 EVENT START: rg_move host2 2 ACQUIRE

Sep 16 15:21:58 EVENT COMPLETED: rg_move host2 2 ACQUIRE 0

Sep 16 15:21:58 EVENT COMPLETED: rg_move_acquire host2 2 0

Sep 16 15:22:08 EVENT START: rg_move_complete host2 2

Sep 16 15:22:08 EVENT START: node_up_remote_complete host2

Sep 16 15:22:09 EVENT COMPLETED: node_up_remote_complete host2 0

Sep 16 15:22:09 EVENT COMPLETED: rg_move_complete host2 2 0

步骤9：拔掉host1的服务网线

Sep 16 15:43:42 EVENT START: swap_adapter host1 net_ether_02 10.2.2.1 10.2.200.2

Sep 16 15:43:43 EVENT COMPLETED: swap_adapter host1 net_ether_02 10.2.2.1 10.2.200.2 0

Sep 16 15:43:45 EVENT START: swap_adapter_complete host1 net_ether_02 10.2.2.1 10.2.200.2

Sep 16 15:43:45 EVENT COMPLETED: swap_adapter_complete host1 net_ether_02 10.2.2.1 10.2.200.2 0

Sep 16 15:43:47 EVENT START: fail_interface host1 10.2.12.1

Sep 16 15:43:47 EVENT COMPLETED: fail_interface host1 10.2.12.1 0

步骤10：恢复所有网线

Sep 16 15:45:07 EVENT START: network_up host2 net_ether_02

Sep 16 15:45:08 EVENT COMPLETED: network_up host2 net_ether_02 0

Sep 16 15:45:08 EVENT START: network_up_complete host2 net_ether_02

Sep 16 15:45:08 EVENT COMPLETED: network_up_complete host2 net_ether_02 0

Sep 16 15:45:43 EVENT START: join_interface host2 10.2.12.2

Sep 16 15:45:43 EVENT COMPLETED: join_interface host2 10.2.12.2 0

Sep 16 15:47:05 EVENT START: join_interface host1 10.2.12.1

Sep 16 15:47:05 EVENT COMPLETED: join_interface host1 10.2.12.1 0

步骤11:在host1上执行halt -q

Sep 16 15:48:48 EVENT START: node_down host1

Sep 16 15:48:49 EVENT START: acquire_service_addr

Sep 16 15:48:50 EVENT START: acquire_aconn_service en0 net_ether_01

Sep 16 15:48:50 EVENT COMPLETED: acquire_aconn_service en0 net_ether_01 0

Sep 16 15:48:51 EVENT START: acquire_aconn_service en2 net_ether_01

Sep 16 15:48:51 EVENT COMPLETED: acquire_aconn_service en2 net_ether_01 0

Sep 16 15:48:53 EVENT START: acquire_aconn_service en1 net_ether_02

Sep 16 15:48:53 EVENT COMPLETED: acquire_aconn_service en1 net_ether_02 0

Sep 16 15:48:53 EVENT COMPLETED: acquire_service_addr 0

Sep 16 15:48:53 EVENT START: acquire_takeover_addr

Sep 16 15:48:57 EVENT COMPLETED: acquire_takeover_addr 0

Sep 16 15:49:02 EVENT COMPLETED: node_down host1 0

Sep 16 15:49:02 EVENT START: node_down_complete host1

Sep 16 15:49:03 EVENT START: start_server host1_app host2_app

Sep 16 15:49:03 EVENT START: start_server host2_app

Sep 16 15:49:03 EVENT COMPLETED: start_server host1_app host2_app 0

Sep 16 15:49:03 EVENT COMPLETED: start_server host2_app 0

Sep 16 15:49:04 EVENT COMPLETED: node_down_complete host1 0

2.3. 完全测试

完全测试在有充分测试时间和测试条件（如交换机可参与测试）完整加以测试，时间节点一般为系统上线前一周。

注：考虑到下表的通用性，有2种情况没有细化，需要注意。

1. 同一网络有2个服务IP地址，考虑到负载均衡，将自动分别落在boot1、boot2上，这样不论那个网卡有问题，都会发生地址漂移。

2. 应用中断没有加入应用的重新连接时间，如oracleDB发生漂移，实际tuxedo需要重新启动才可继续连接，这个需要起停脚本来实现。

此外，由于实际环境也许有所不同甚至更为复杂，此表仅供大家实际参考，但大体部分展现出来，主要提醒大家不要遗漏。

2.3.1. 完全测试表

序号	测试场景	系统结果	应用结果	参考时长
	功能测试
1	host2起HA	host2服务IP地址生效，vg、文件系统生效	host2 app(db)启动OK	120s
2	host2停HA	host2服务IP地址、vg释放干净	host2 app 停止	15s
3	host1起HA	host1服务IP地址生效，vg、文件系统生效	host1 app启动OK	120s
4	host1停HA	host1网卡、vg释放干净	host2 app 停止	15s
5	host2 takeover切换host1	host2服务地址切换到host1的boot2和vg等	host2 app 短暂中断	30s
5	host2 clstart	回原	host2 app短暂中断	120s
6	host1 takeover到 host2	host1服务地址切换到host2的boot2和vg等切换到host2	host1 app 短暂中断	30s
6	host1 clstart	回原	host1 app短暂中断	120s
	网卡异常测试
1	host2断boot1网线测试	host2的服务ip从boot1漂移至boot2	host2 app 短暂中断	30s
1	host2恢复boot1网线测试	host2 boot1 join	无影响	40s
2	host2断boot2网线测试	host2的服务ip从boot1漂移至boot2	host2 app 短暂中断	30s
2	host2恢复boot2网线测试	host2 boot1 join	无影响	40s
3	host2断boot1、boot2网线测试	host2服务地址切换到host1的boot2上，vg等切换到host1	host2 app短暂中断	210s
	host1再断boot2网线，	host2的服务ip漂移到host1的boot1	host2 app短暂中断	30s
	host2恢复boot1、boot2网线测试	host2 boot1，boot 2join	无影响	30s
	host2 clstart	回原	host2 app短暂中断	120s
4	host1断boot1、boot2网线测试	host1服务地址切换到host2的boot2上，vg等切换到host2	host1 app短暂中断	210s
	host1再断boot2网线，	host1的服务ip漂移到host2的boot1	host1 app短暂中断	30s
	host1恢复boot1、boot2网线测试	host1 boot1，boot 2join	无影响	30s
	host2 clstart	回原	host2 app短暂中断	120s
5	host2 force clstop	cluster服务停止，ip、vg资源无反应	无影响	20s
5	host2 clstart	回原	无影响	20s
6	host1 force clstop	cluster服务停止，ip、vg资源无反应	无影响	20s
6	host1 clstart	回原	无影响	20s
7	host2,host1 boot2 网线同时断30mins	boot2 failed	无影响	20s
7	host2,host1 boot2 网线恢复	boot2 均join	无影响	20s
8	host2,host1 boot1 网线同时断30mins	服务IP地址均漂移到boot2上。	host1,host2 app短暂中断	30s
8	host2,host1 boot1 网线恢复	boot1 均join	无影响	20s
	主机宕机测试
1	host2 突然宕机halt -q	host2服务地址切换到host1的boot2和vg等	host2 app 短暂中断	30s
1	host2 clstart	回原	host2 app短暂中断	120s
2	host1 突然宕机halt -q	host1服务地址切换到host2的boot2和vg等切换到host2	host1 app 短暂中断	30s
2	host1 clstart	回原	host1 app短暂中断	120s
	交换机异常测试
1	SwitchA断电	服务IP地址均漂移到boot2上	host1、host2 app短暂中断	50s
	SwitchA恢复	boot1 均join	无影响	40s
	SwitchB断电	服务IP地址均漂移回boot1上	host1、host2 app短暂中断	50s
	SwitchB恢复	boot2 均join	无影响	40s
2	SwitchB断电	boot2 failed	无影响	50s
	SwitchB恢复	boot2 均join	无影响	40s
	SwitchA断电	服务IP地址均漂移到boot2上。	host1、host2 app短暂中断	50s
	SwitchA恢复	boot1 均join	无影响	40s
3	SwitchA，B同时断电10mins	network报down，其他一切不动。	host1、host2 app中断	10min
3	SwitchA，B恢复	boot1，boot2 join	服务自动恢复	50s
4	SwitchA断电	服务IP地址均漂移到boot2上	host1、host2 app短暂中断	50s
	30s后B也断电	不动	host1、host2 app中断	50s
	SwitchA，B恢复	boot1 均join	自动恢复	40s
5	SwitchB断电	boot2 failed	无影响	50s
	30s后A也断电	network报down，其他一切不动。	host1、host2 app中断	50s
	SwitchA，B恢复	boot1 均join	自动恢复	40s
6	SwitchA异常（对接网线触发广播风暴）	机器本身正常，但网络不通	host1、host2 app中断	20s
6	SwitchA恢复	恢复后一切正常	自动恢复
7	SwitchB异常（对接网线触广播风暴）	机器本身正常，但网络不通恢复后一切正常	host1、host2 app中断	20s
7	SwitchB恢复		自动恢复
8	SwitchA，B同时异常（对接网线触广播风暴）	机器本身正常，但网络丢包严重，	host1、host2 app中断	10s
	SwitchA，B恢复	恢复后一切正常	自动恢复	20s
	稳定性测试
1	host2， host1各起HA		48小时以上正常服务
2	host2 takeover切换host1		48小时以上正常服务
3	host1 takeover到 host2		48小时以上正常服务

2.4. 运维切换测试：

运维切换测试是为了在运维过程中，为保证高可靠性加以实施。建议每年实施一次。因为这样的测试实际是一种演练，能够及时发现各方面的问题，为故障期间切换成功提供有效保证。

一直以来，听过不少用户和同仁抱怨，说平时测试完美，实际关键时刻却不能切换，原因其实除了运维篇没做到位之外，还有测试不够充分的原因。因此本人目前强烈推荐有条件的环境一定要定期进行运维切换测试。

之前由于成本的原因，备机配置一般比主机低，或者大量用于开发测试，很难实施这样的测试。但随着Power机器能力越来越强，一台机器只装一个AIX系统的越来越少，也就使得互备LPAR的资源可以在HA生效是多个LAPR之间直接实时调整资源，使得这样的互换测试成为了可能。

2.4.1. 运维切换测试表

场景		建议时长	切换方式
主备（run->dev）	主机和备机互换	>10天	备机开发测试停用或临时修改HA配置
主备（run->dev）	主分区切、备用分区互换	>30天	备用分区资源增加、主分区资源减少。开发测试停用或临时修改HA配置
互备（app <->db,app<->app,db<->db）	互换	>30天	手工互相交叉启动资源组

主机切换到备机：

有2种方式：

Ø 可用takeover(move Resource Groups )方式,但由于负荷和防止误操作的原因，备机的开发测试环境一般需要停用。

Ø 也可通过修改HA的配置，将备机资源组的节点数增加运行节点。这样可以在切换测试期间继续使用开发测试环境。但这样不光要对HA有所改动。还要预先配置时就要保证备机开发测试环境也不是放在本地盘上，需要放在共享vg里，此外还要同步开发测试的环境到运行机。建议最好在设计时就有这样的考虑。

手工互相切换：

停掉资源组：

smitty hacmp->System Management (C-SPOC)

-> Resource Group and Applications

->Bring a Resource Group Offline 选择 host2_RG,host2

Bring a Resource Group Offline

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

Resource Group to Bring Offline host2_RG

Node On Which to Bring Resource Group Offline host2

Persist Across Cluster Reboot? false

同样停掉host1_RG

互换资源组：

smitty HACMP->System Management (C-SPOC)

-> Resource Group and Applications

->Bring a Resource Group Online 选择host2_RG,host1

Resource Group to Bring Online host2_RG

Node on Which to Bring Resource Group Online host1

Persist Across Cluster Reboot回答No。

即在host1上启动host2的资源组，同样方法在host2上启动host1资源组。这样2台机器就实现了互换。

注：由于互切需要人工干预，回原也要人工干预，所以切换期间需要密切监控运行状况，如方便出现有异常时，能立刻人工处理。

互换crontab及相关后台脚本：

由于备份作业等crontab里的后台作业会有所不同，所以需要进行互换，按我们的做法（参见脚本篇的同步HA的脚本）只需拷贝相应crontab即可。

[host1][root][/]>cp -rp /home/scripts/host2/crontab_host2 /var/spool/cron/crontabs/root

修正文件属性：

[host1][root][/]>chown root:cron /var/spool/cron/crontabs/root

[host1][root][/]>chmod 600 /var/spool/cron/crontabs/root

重起crontab:

[host1][root][/]> ps -ef|grep cron

root 278688 1 0 Dec 19 - 0:02 /usr/sbin/cron

[host1][root][/]>kill -9 278688

如果不采用我们脚本的做法，除需要拷贝对方的crontab外，还要记得同步相应脚本。

互换备份策略：

由于备份方式不同，可能所作的调整也不一样，需要具体系统具体对待。实验环境中的备份采用后台作业方式，无须进一步处理。实际环境中可能采用备份软件，由于主机互换了，备份策略是否有效需要确认，如无效，需要做相应修正。

3. 第四部分--维护篇

作为高可用性的保证，通过了配置和测试之后，系统成功上线了，但不要忘记，HACMP也需要精心维护才能在最关键的时刻发生作用，否则不光是多余的摆设，维护人员会由于“既然已经安装好HACMP了，关键时刻自然会发生作用”的想法反而高枕无忧，麻痹大意。

3.1. HACMP切换问题及处理

我们简单统计了以往遇到的切换不成功或误切换的场景，编制了测试成功切换却失败的原因及对策，如下表：

3.1.1. HACMP切换问题表

故障现象	原因	根本原因	对策
无法切换1	测试一段时间后两边配置不一致、不同步	没通过HACMP的功能（含C-SPOC）进行用户、文件系统等系统变更。	制定和遵守规范，定期检查，定修及时处理
无法切换2	应用停不下来，导致超时，文件系统不能umount	停止脚本考虑不周全	规范化增加kill_vg_user脚本
切换成功但应用不正常1	应用启动异常	应用有变动，停止脚本异常停止或启动脚本不正确	规范化和及时更新起停脚本
切换成功但应用不正常2	备机配置不符合运行要求	各类系统和软件参数不合适	制定检查规范初稿，通过运维切换测试检查确认。
切换成功但通信不正常1	网络路由不通	网络配置原因	修正测试路由，通过运维切换测试检查确认。
切换成功但通信不正常2	通信软件配置问题	由于一台主机同时漂移同一网段的2个服务地址，通信电文从另一个IP地址通信，导致错误	修正配置，绑定指定服务ip。
误切换	DMS问题	系统负荷持续过高	参见经验篇DMS相应章节

注：请记住，对于客户来说，不管什么原因，“应用中断超过了5-10分钟，就是HACMP切换不成功”，也意味着前面所有的工作都白费了，所以维护工作的重要性也是不言而谕的。

3.1.2. 强制方式停掉HACMP:

HACMP的停止分为3种，

Bring Resource Groups Offline （正常停止）

Move Resource Groups （手工切换）

Unmanage Resource Groups （强制停掉HACMP,而不停资源组）

下面的维护工作，很多时候需要强制停掉HACMP来进行，此时资源组不会释放，这样做的好处是，由于IP地址、文件系统等等没有任何影响，只是停掉HACMP本身，所以应用服务可以继续提供，实现了在线检查和变更HACMP的目的。

[host1][root][/]>smitty clstop

Stop Cluster Services

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

* Stop now, on system restart or both now

Stop Cluster Services on these nodes [host1]

BROADCAST cluster shutdown? false

* Select an Action on Resource Groups Unmanage Resource Group

记得一般所有节点都要进行这样操作。

用cldump可以看到以下结果：

......

luster Name: test_cluster

Resource Group Name: rg_diskhbmulti_01

Startup Policy: Online On All Available Nodes

Fallover Policy: Bring Offline (On Error Node Only)

Fallback Policy: Never Fallback

Site Policy: ignore

Node Group State

---------------------------- ---------------

host1 UNMANAGED

host2 UNMANAGED

Resource Group Name: host1_RG

Startup Policy: Online On Home Node Only

Fallover Policy: Fallover To Next Priority Node In The List

Fallback Policy: Fallback To Higher Priority Node In The List

Site Policy: ignore

Node Group State

---------------------------- ---------------

host1 UNMANAGED

host2 UNMANAGED

Resource Group Name: host2_RG

Startup Policy: Online On Home Node Only

Fallover Policy: Fallover To Next Priority Node In The List

Fallback Policy: Fallback To Higher Priority Node In The List

Site Policy: ignore

Node Group State

---------------------------- ---------------

host2 UNMANAGED

host1 UNMANAGED

3.1.3. 强制停掉后的HACMP启动:

在修改HACMP的配置后，大多数情况下需要重新申请资源启动，这样才能使HACMP的配置重新生效。

[host1][root][/]>smitty clstart

请注意：为保险，Startup Cluster Information Daemon？选择 true。

3.2. 日常检查及处理

为了更好的维护好HACMP，平时的检查和处理是必不可少的，下面提供的检查和处理方法除非特别说明，均是不用停机、停止应用即可进行，不影响用户使用。不过具体实施前需要仔细检查状态，再予以实施。

当然，最有说服力的检查和验证是通过运维切换测试，参见测试篇。

3.2.1. clverify检查

这个检查可以对包括LVM的绝大多数HACMP的配置同步状态，是HACMP检查是否同步的主要方式。

smitty clverify ->Verify HACMP Configuration

回车即可

经过检查，结果应是OK。如果发现不一致，需要区别对待。对于非LVM的报错，大多数情况下不用停止应用，可以用以下步骤解决：

1. 先利用强制方式停止HACMP服务。

同样停止host2的HACMP服务。

2. 就检查出的问题进行修正和同步：

smitty hacmp -> Extended Configuration

-> Extended Verification and Synchronization

这时由于已停止HACMP服务，可以包括自动修正和强制同步。

对于LVM的报错，一般是由于未使用HACMP的C-SPOC功能，单边修改文件系统、lv、VG造成的，会造成VG的timestamp不一致。这种情况即使手工在另一边修正（通常由于应用在使用，也不能这样做），选取自动修正的同步，也仍然会报failed。此时只能停掉应用，按首次整理中的整理VG一节解决。

3.2.2. 进程检查：

1）查看服务及进程，至少有以下三个：

[host1][root][/]#lssrc -a|grep ES

clcomdES clcomdES 10027064 active

clstrmgrES cluster 9109532 active

clinfoES cluster 5767310 active

2） /var目录存放hacmp的相关log，还有剩余空间。

3.2.3. cldump检查：

实际HACMP菜单中也可以调用cldump，效果相同。

cldump的监测为将当前HACMP的状态快照，确认显示为UP，STABLE，否则根据实际情况进行分析处理。

[host1][root][/]>/usr/sbin/cluster/utilities/cldump

Obtaining information via SNMP from Node: host1...

_____________________________________________________________________________

Cluster Name: test_cluster

Cluster State: UP

Cluster Substate: STABLE

_____________________________________________________________________________

Node Name: host1 State: UP

Network Name: net_diskhbmulti_01 State: UP

Address: Label: host1_1 State: UP

Network Name: net_ether_01 State: UP

Address: 10.2.100.1 Label: host1_l1_svc1 State: UP

Address: 10.2.101.1 Label: host1_l1_svc2 State: UP

Address: 10.2.11.1 Label: host1_l1_boot2 State: UP

Address: 10.2.1.21 Label: host1_l1_boot1 State: UP

Network Name: net_ether_02 State: UP

Address: 10.2.12.1 Label: host1_l2_boot2 State: UP

Address: 10.2.2.1 Label: host1_l2_boot1 State: UP

Address: 10.2.200.1 Label: host1_l2_svc State: UP

Node Name: host2 State: UP

Network Name: net_diskhbmulti_01 State: UP

Address: Label: host2_2 State: UP

Network Name: net_ether_01 State: UP

Address: 10.2.100.2 Label: host2_l1_svc1 State: UP

Address: 10.2.101.2 Label: host2_l1_svc2 State: UP

Address: 10.2.11.2 Label: host2_l1_boot2 State: UP

Address: 10.2.1.22 Label: host2_l1_boot1 State: UP

Network Name: net_ether_02 State: UP

Address: 10.2.12.2 Label: host2_l2_boot2 State: UP

Address: 10.2.2.2 Label: host2_l2_boot1 State: UP

Address: 10.2.200.2 Label: host2_l2_svc State: UP

Cluster Name: test_cluster

Resource Group Name: rg_diskhbmulti_01

Startup Policy: Online On All Available Nodes

Fallover Policy: Bring Offline (On Error Node Only)

Fallback Policy: Never Fallback

Site Policy: ignore

Node Group State

---------------------------- ---------------

host1 ONLINE

host2 ONLINE

Resource Group Name: host1_RG

Startup Policy: Online On Home Node Only

Fallover Policy: Fallover To Next Priority Node In The List

Fallback Policy: Fallback To Higher Priority Node In The List

Site Policy: ignore

Node Group State

---------------------------- ---------------

host1 ONLINE

host2 OFFLINE

Resource Group Name: host2_RG

Startup Policy: Online On Home Node Only

Fallover Policy: Fallover To Next Priority Node In The List

Fallback Policy: Fallback To Higher Priority Node In The List

Site Policy: ignore

Node Group State

---------------------------- ---------------

host2 ONLINE

host1 OFFLINE

3.2.4. clstat检查

clstat可以实时监控HACMP的状态，及时确认显示为UP，STABLE，否则根据实际情况进行分析处理。

[host1][root][/]>/usr/sbin/cluster/clstat

clstat - HACMP Cluster Status Monitor

-------------------------------------

Cluster: test_cluster (1572117373)

Mon Sep 16 13:38:31 GMT+08:00 2013

State: UP Nodes: 2

SubState: STABLE

Node: host1 State: UP

Interface: host1_l2_boot1 (2) Address: 10.2.2.1

State: UP

Interface: host1_l1_boot2 (1) Address: 10.2.11.1

State: UP

Interface: host1_l2_boot2 (2) Address: 10.2.12.1

State: UP

Interface: host1_l1_boot1 (1) Address: 10.2.1.21

State: UP

Interface: host1_1 (0) Address: 0.0.0.0

State: UP

Interface: host1_l1_svc1 (1) Address: 10.2.100.1

State: UP

Interface: host1_l1_svc2 (1) Address: 10.2.101.1

State: UP

Interface: host1_l2_svc (2) Address: 10.2.200.1

State: UP

Resource Group: host1_RG State: On line

Resource Group: rg_diskhbmulti_01 State: On line

Node: host2 State: UP

Interface: host2_l2_boot1 (2) Address: 10.2.2.2

State: UP

Interface: host2_l1_boot2 (1) Address: 10.2.11.2

State: UP

Interface: host2_l2_boot2 (2) Address: 10.2.12.2

State: UP

Interface: host2_l1_boot1 (1) Address: 10.2.1.22

State: UP

Interface: host2_2 (0) Address: 0.0.0.0

State: UP

Interface: host2_l1_svc1 (1) Address: 10.2.100.2

State: UP

Interface: host2_l1_svc2 (1) Address: 10.2.101.2

State: UP

Interface: host2_l2_svc (2) Address: 10.2.200.2

State: UP

Resource Group: host2_RG State: On line

Resource Group: rg_diskhbmulti_01 State: On line

************************ f/forward, b/back, r/refresh, q/quit *****************

3.2.5. cldisp检查：

这是从资源的角度做一个查看，可以看到相关资源组的信息是否正确，同样是状态应都为up，stable，online。

[host1][root][/]#/usr/es/sbin/cluster/utilities/cldisp

Cluster: test_cluster

Cluster services: active

State of cluster: up

Substate: stable

#############

APPLICATIONS

#############

Cluster test_cluster provides the following applications: host1_app host2_app

Application: host1_app

host1_app is started by /usr/sbin/cluster/app/start_host1

host1_app is stopped by /usr/sbin/cluster/app/stop_host1

No application monitors are configured for host1_app.

This application is part of resource group 'host1_RG'.

Resource group policies:

Startup: on home node only

Fallover: to next priority node in the list

Fallback: if higher priority node becomes available

State of host1_app: online

Nodes configured to provide host1_app: host1 {up} host2 {up}

Node currently providing host1_app: host1 {up}

The node that will provide host1_app if host1 fails is: host2

Resources associated with host1_app:

Service Labels

host1_l1_svc1(10.2.100.1) {online}

Interfaces configured to provide host1_l1_svc1:

host1_l1_boot1 {up}

with IP address: 10.2.1.21

on interface: en0

on node: host1 {up}

on network: net_ether_01 {up}

host1_l1_boot2 {up}

with IP address: 10.2.11.1

on interface: en2

on node: host1 {up}

on network: net_ether_01 {up}

host2_l1_boot2 {up}

with IP address: 10.2.11.2

on interface: en2

on node: host2 {up}

on network: net_ether_01 {up}

host2_l1_boot1 {up}

with IP address: 10.2.1.22

on interface: en0

on node: host2 {up}

on network: net_ether_01 {up}

host1_l1_svc2(10.2.101.1) {online}

Interfaces configured to provide host1_l1_svc2:

host1_l1_boot1 {up}

with IP address: 10.2.1.21

on interface: en0

on node: host1 {up}

on network: net_ether_01 {up}

host1_l1_boot2 {up}

with IP address: 10.2.11.1

on interface: en2

on node: host1 {up}

on network: net_ether_01 {up}

host2_l1_boot2 {up}

with IP address: 10.2.11.2

on interface: en2

on node: host2 {up}

on network: net_ether_01 {up}

host2_l1_boot1 {up}

with IP address: 10.2.1.22

on interface: en0

on node: host2 {up}

on network: net_ether_01 {up}

host1_l2_svc(10.2.200.1) {online}

Interfaces configured to provide host1_l2_svc:

host1_l2_boot1 {up}

with IP address: 10.2.2.1

on interface: en1

on node: host1 {up}

on network: net_ether_02 {up}

host1_l2_boot2 {up}

with IP address: 10.2.12.1

on interface: en3

on node: host1 {up}

on network: net_ether_02 {up}

host2_l2_boot2 {up}

with IP address: 10.2.12.2

on interface: en3

on node: host2 {up}

on network: net_ether_02 {up}

host2_l2_boot1 {up}

with IP address: 10.2.2.2

on interface: en1

on node: host2 {up}

on network: net_ether_02 {up}

Shared Volume Groups:

host1vg

Application: host2_app

host2_app is started by /usr/sbin/cluster/app/start_host2

host2_app is stopped by /usr/sbin/cluster/app/stop_host2

No application monitors are configured for host2_app.

This application is part of resource group 'host1_RG'.

Resource group policies:

Startup: on home node only

Fallover: to next priority node in the list

Fallback: if higher priority node becomes available

State of host2_app: online

Nodes configured to provide host2_app: host1 {up} host2 {up}

Node currently providing host2_app: host1 {up}

The node that will provide host2_app if host1 fails is: host2

Resources associated with host2_app:

Service Labels

host1_l1_svc1(10.2.100.1) {online}

Interfaces configured to provide host1_l1_svc1:

host1_l1_boot1 {up}

with IP address: 10.2.1.21

on interface: en0

on node: host1 {up}

on network: net_ether_01 {up}

host1_l1_boot2 {up}

with IP address: 10.2.11.1

on interface: en2

on node: host1 {up}

on network: net_ether_01 {up}

host2_l1_boot2 {up}

with IP address: 10.2.11.2

on interface: en2

on node: host2 {up}

on network: net_ether_01 {up}

host2_l1_boot1 {up}

with IP address: 10.2.1.22

on interface: en0

on node: host2 {up}

on network: net_ether_01 {up}

host1_l1_svc2(10.2.101.1) {online}

Interfaces configured to provide host1_l1_svc2:

host1_l1_boot1 {up}

with IP address: 10.2.1.21

on interface: en0

on node: host1 {up}

on network: net_ether_01 {up}

host1_l1_boot2 {up}

with IP address: 10.2.11.1

on interface: en2

on node: host1 {up}

on network: net_ether_01 {up}

host2_l1_boot2 {up}

with IP address: 10.2.11.2

on interface: en2

on node: host2 {up}

on network: net_ether_01 {up}

host2_l1_boot1 {up}

with IP address: 10.2.1.22

on interface: en0

on node: host2 {up}

on network: net_ether_01 {up}

host1_l2_svc(10.2.200.1) {online}

Interfaces configured to provide host1_l2_svc:

host1_l2_boot1 {up}

with IP address: 10.2.2.1

on interface: en1

on node: host1 {up}

on network: net_ether_02 {up}

host1_l2_boot2 {up}

with IP address: 10.2.12.1

on interface: en3

on node: host1 {up}

on network: net_ether_02 {up}

host2_l2_boot2 {up}

with IP address: 10.2.12.2

on interface: en3

on node: host2 {up}

on network: net_ether_02 {up}

host2_l2_boot1 {up}

with IP address: 10.2.2.2

on interface: en1

on node: host2 {up}

on network: net_ether_02 {up}

Shared Volume Groups:

host1vg

This application is part of resource group 'host2_RG'.

Resource group policies:

Startup: on home node only

Fallover: to next priority node in the list

Fallback: if higher priority node becomes available

State of host2_app: online

Nodes configured to provide host2_app: host2 {up} host1 {up}

Node currently providing host2_app: host2 {up}

The node that will provide host2_app if host2 fails is: host1

Resources associated with host2_app:

Service Labels

host2_l1_svc1(10.2.100.2) {online}

Interfaces configured to provide host2_l1_svc1:

host2_l1_boot2 {up}

with IP address: 10.2.11.2

on interface: en2

on node: host2 {up}

on network: net_ether_01 {up}

host2_l1_boot1 {up}

with IP address: 10.2.1.22

on interface: en0

on node: host2 {up}

on network: net_ether_01 {up}

host1_l1_boot1 {up}

with IP address: 10.2.1.21

on interface: en0

on node: host1 {up}

on network: net_ether_01 {up}

host1_l1_boot2 {up}

with IP address: 10.2.11.1

on interface: en2

on node: host1 {up}

on network: net_ether_01 {up}

host2_l1_svc2(10.2.101.2) {online}

Interfaces configured to provide host2_l1_svc2:

host2_l1_boot2 {up}

with IP address: 10.2.11.2

on interface: en2

on node: host2 {up}

on network: net_ether_01 {up}

host2_l1_boot1 {up}

with IP address: 10.2.1.22

on interface: en0

on node: host2 {up}

on network: net_ether_01 {up}

host1_l1_boot1 {up}

with IP address: 10.2.1.21

on interface: en0

on node: host1 {up}

on network: net_ether_01 {up}

host1_l1_boot2 {up}

with IP address: 10.2.11.1

on interface: en2

on node: host1 {up}

on network: net_ether_01 {up}

host2_l2_svc(10.2.200.2) {online}

Interfaces configured to provide host2_l2_svc:

host2_l2_boot2 {up}

with IP address: 10.2.12.2

on interface: en3

on node: host2 {up}

on network: net_ether_02 {up}

host2_l2_boot1 {up}

with IP address: 10.2.2.2

on interface: en1

on node: host2 {up}

on network: net_ether_02 {up}

host1_l2_boot1 {up}

with IP address: 10.2.2.1

on interface: en1

on node: host1 {up}

on network: net_ether_02 {up}

host1_l2_boot2 {up}

with IP address: 10.2.12.1

on interface: en3

on node: host1 {up}

on network: net_ether_02 {up}

Shared Volume Groups:

host2vg

#############

TOPOLOGY

#############

test_cluster consists of the following nodes: host1 host2

host1

Network interfaces:

host1_1 {up}

device: /dev/mndhb_lv_01

on network: net_diskhbmulti_01 {up}

host1_l1_boot1 {up}

with IP address: 10.2.1.21

on interface: en0

on network: net_ether_01 {up}

host1_l1_boot2 {up}

with IP address: 10.2.11.1

on interface: en2

on network: net_ether_01 {up}

host1_l2_boot1 {up}

with IP address: 10.2.2.1

on interface: en1

on network: net_ether_02 {up}

host1_l2_boot2 {up}

with IP address: 10.2.12.1

on interface: en3

on network: net_ether_02 {up}

host2

Network interfaces:

host2_2 {up}

device: /dev/mndhb_lv_01

on network: net_diskhbmulti_01 {up}

host2_l1_boot2 {up}

with IP address: 10.2.11.2

on interface: en2

on network: net_ether_01 {up}

host2_l1_boot1 {up}

with IP address: 10.2.1.22

on interface: en0

on network: net_ether_01 {up}

host2_l2_boot2 {up}

with IP address: 10.2.12.2

on interface: en3

on network: net_ether_02 {up}

host2_l2_boot1 {up}

with IP address: 10.2.2.2

on interface: en1

on network: net_ether_02 {up}

[host1][root][/]#

3.2.6. /etc/hosts环境检查

正常情况下，2台互备的/etc/hosts应该是一致的，当然如果是主备机方式，可能备机会多些IP地址和主机名。通过对比2个文件的不同，可以确认是否存在问题。

[host1][root][/]>rsh host2 cat /etc/hosts >/tmp/host2_hosts

[host1][root][/]>diff /etc/hosts /tmp/host2_hosts

3.2.7. 脚本检查

需要注意以下事项：

1．应用的变更需要及时修正脚本，两边的脚本需要及时同步，并及时申请时间测试。

2．上一点需要维护人员充分与应用人员沟通，运行环境的任何变更必须通过维护人员实施。

3．维护人员启停应用要养成使用这些脚本启停系统的习惯，尽量避免手工启停。

[host1][root][/home/scripts]>rsh host2 "cd /home/scripts;ls -l host1 host2 comm" >/tmp/host2_scripts

[host1][root][/home/scripts]> ls -l host1 host2 comm" >/tmp/host1_scripts

[host1][root][/]>diff /tmp/host1_scripts /tmp/host2_scripts

3.2.8. 用户检查

正常情况下，2台互备的HA使用到的用户情况应该是一致的，当然如果是主备机方式，可能备机会多些用户。通过对比2节点的配置不同，可以确认是否存在问题。

[host1][root][/]>

rsh host2 lsuser -f orarun,orarunc,tuxrun,bsx1,xcom >/tmp/host2_users

[host1][root][/]>

lsuser -f orarun,orarunc,tuxrun,bsx1,xcom >/tmp/host2_users >/tmp/host1_users

[host1][root][/]>diff /tmp/host1_user /tmp/host2_user

注：两边的必然有些不同，如上次登录时间等等，只要主要部分相同就可以了。

还有两边 .profile的对比，用户环境的对比。

[host1][root][/]>rsh host2 su - orarun -c set >/tmp/host2.set

[host1][root][/]> su - orarun -c set >/tmp/host1.set

[host1][root][/]>diff /tmp/host1.set /tmp/host2.set

3.2.9. 心跳检查

由于心跳在HACMP启动后一直由HACMP在用，所以需要强制停掉HACMP进行检查。

1）察看心跳服务：

从topsvcs可以看到网络的状况，也包括心跳网络，报错为零或比率远低于1%。

[host2][root][/]#lssrc -ls topsvcs

Subsystem Group PID Status

topsvcs topsvcs 9371838 active

Network Name Indx Defd Mbrs St Adapter ID Group ID

net_ether_01_0 [ 0] 2 2 S 10.2.1.22 10.2.1.22

net_ether_01_0 [ 0] en0 0x42366504 0x42366d24

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent : 15690 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 18345 ICMP 0 Dropped: 0

NIM's PID: 7929856

net_ether_01_1 [ 1] 2 2 S 10.2.11.2 10.2.11.2

net_ether_01_1 [ 1] en2 0x42366505 0x42366d25

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent : 15690 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 18347 ICMP 0 Dropped: 0

NIM's PID: 9044088

net_ether_02_0 [ 2] 2 2 S 10.2.2.2 10.2.2.2

net_ether_02_0 [ 2] en1 0x42366506 0x42366d26

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent : 15688 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 18345 ICMP 0 Dropped: 0

NIM's PID: 6881402

net_ether_02_1 [ 3] 2 2 S 10.2.12.2 10.2.12.2

net_ether_02_1 [ 3] en3 0x42366507 0x42366d27

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent : 15687 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 18344 ICMP 0 Dropped: 0

NIM's PID: 6684902

diskhbmulti_0 [ 4] 2 2 S 255.255.10.1 255.255.10.1

diskhbmulti_0 [ 4] rmndhb_lv_01.2_1 0x8236653e 0x82366d48

HB Interval = 3.000 secs. Sensitivity = 6 missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent : 5021 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 4754 ICMP 0 Dropped: 0

NIM's PID: 6553654

2 locally connected Clients with PIDs:

haemd(7602388) hagsd(9699456)

Fast Failure Detection available but off.

Dead Man Switch Enabled:

reset interval = 1 seconds

trip interval = 36 seconds

Client Heartbeating Disabled.

Configuration Instance = 1

Daemon employs no security

Segments pinned: Text Data.

Text segment size: 862 KB. Static data segment size: 1497 KB.

Dynamic data segment size: 8897. Number of outstanding malloc: 269

User time 1 sec. System time 0 sec.

Number of page faults: 151. Process swapped out 0 times.

Number of nodes up: 2. Number of nodes down: 0.

2）串口心跳检查：

u 察看tty速率

确认速率不超过9600

[host1][root][/]>stty -a </dev/tty0

[host2][root][/]>cat /etc/hosts >/dev/tty0

host1上显示

speed 9600 baud; 0 rows; 0 columns;

eucw 1:1:0:0, scrw 1:1:0:0:

….

u 检查连接和配置

[host1][root][/]>host1: cat /etc/hosts>/dev/tty0

[host2][root][/]>host2:cat</dev/tty0

在host2可看到host1上/etc/hosts的内容。

同样反向检测一下。

3）串口心跳检查：

利用dhb_read确认磁盘的心跳连接

[host1][root][/]#/usr/sbin/rsct/bin/dhb_read -p hdisk5 -r

DHB CLASSIC MODE

First node byte offset: 61440

Second node byte offset: 62976

Handshaking byte offset: 65024

Test byte offset: 64512

Receive Mode:

Waiting for response . . .

Magic number = 0x87654321

Link operating normally

[host2][root][/]#/usr/sbin/rsct/bin/dhb_read -p hdisk5 -r

DHB CLASSIC MODE

First node byte offset: 61440

Second node byte offset: 62976

Handshaking byte offset: 65024

Test byte offset: 64512

Receive Mode:

Waiting for response . . .

Magic number = 0x87654321

....

Magic number = 0x87654321

Link operating normally

[host1][root][/]#/usr/sbin/rsct/bin/dhb_read -p hdisk5 -t

DHB CLASSIC MODE

First node byte offset: 61440

Second node byte offset: 62976

Handshaking byte offset: 65024

Test byte offset: 64512

Transmit Mode:

Magic number = 0x87654321

Detected remote utility in receive mode. Waiting for response . . .

Magic number = 0x87654321

Link operating normally

最后报Link operating normally.正常即可，同样反向也检测一下。

3.2.10. errpt的检查

虽然有了以上许多检查，但我们最常看的errpt不要忽略，因为有些报错，需要大家引起注意，由于crontab里HACMP会增加这样一行：

0 0 * * * /usr/es/sbin/cluster/utilities/clcycle 1>/dev/null 2>/dev/null # HACMP for AIX Logfile rotation

即实际上每天零点，系统会自动执行HACMP的检查，如果发现问题，会在errpt看到。

除了HACMP检查会报错，其他运行过程中也有可能报错，大都是由于心跳连接问题或负载过高导致HACMP进程无法处理，需要引起注意，具体分析解决。

3.3.
变更及实现

由于维护的过程出现的情况远比集成实施阶段要复杂，即使红皮书也不能覆盖所有情况。这里只就大家常见的情况加以说明，对于更为复杂或者更为少见的情况，还是请大家翻阅红皮书，实在不行计划停机重新配置也许也是一个快速解决问题的笨方法。

这里的变更原则上是不希望停机，但实际上HACMP的变更，虽然说部分支持DARE（dynamic reconfiguration），部分操作支持Force stop 完成，我们还是建议有条件的话停机完成。

对于动态DARE，我不是非常赞成使用，因为使用不当会造成集群不可控，危险性更大。我一般喜欢使用先强制停止HACMP，再进行以下操作,结束同步确认后再start HACMP。

3.3.1. 卷组变更-增加磁盘到使用的VG里:

注意，pvid一定要先认出来，否则盘会没有或不正常。

1. 集群的各个节点机器运行cfgmgr，设置pvid

[host1][root][/]>cfgmgr

[host1][root][/]>lspv

….

hdisk2 00f6f1569990a1ef host1vg

hdisk3 00f6f1569990a12c host2vg

hdisk4 none none

[host1][root][/]>chdev -l hdisk2 -a pv=yes

[host1][root][/]>lspv

….

hdisk4 00c1eedffc677bfe none

在host2上也要做同样操作。

2. 运行C-SPOC增加盘到host2vg:

smitty hacmp->System Management (C-SPOC)

-> Storage

-> Volume Groups

-> Set Characteristics of a Volume Group

-> Add a Volume to a Volume Group

选择VG、磁盘增加即可

Add a Volume to a Volume Group

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

VOLUME GROUP name host2vg

Resource Group Name host2_RG

Node List host1,host2

VOLUME names hdisk4

Physical Volume IDs 00f6f1562fd2853e

完成后两边都可看到

hdisk3 00f6f1569990a12c host2vg active

hdisk4 00f6f1562fd2853e host2vg active

3.3.2. 逻辑卷lv变更

1) lv本身变更：

目前支持增加lv的拷贝，减少，增加空间，改名；

这里以裸设备lv增加空间举例：

smitty hacmp->System Management (C-SPOC)

-> Storage

-> Shared Logical Volumes

>Set Characteristics of a Logical Volume

-> Increase the Size of a Logical Volume

2) lv属性变更

效果和单机环境一致，但还是建议慎重操作，充分考虑改动后对业务的影响：

smitty hacmp->System Management (C-SPOC)

-> Storage

->Logical Volume

->Change a Logical Volume

->Change a Logical Volume on the Cluster选择lv

Volume Group Name host2vg

Resource Group Name host2_RG

* Logical volume NAME ora11runlv

Logical volume TYPE [jfs2]

POSITION on physical volume outer_middle

RANGE of physical volumes minimum

MAXIMUM NUMBER of PHYSICAL VOLUMES [32]

to use for allocation

Allocate each logical partition copy yes

on a SEPARATE physical volume?

RELOCATE the logical volume during yes

reorganization?

Logical volume LABEL [/ora11run]

MAXIMUM NUMBER of LOGICAL PARTITIONS [512]

SCHEDULING POLICY for writing logical parallel

partition copies

PERMISSIONS read/write

Enable BAD BLOCK relocation? yes

Enable WRITE VERIFY? no

Mirror Write Consistency? active

Serlialize I/O? no

3.3.3. 文件系统变更

smitty hacmp->System Management (C-SPOC)

-> Storage

- >File Systems

->Change / Show Characteristics of a File System

Volume group name host1vg

Resource Group Name host1_RG

* Node Names host2,host1

* File system name /ora11runc

NEW mount point [/ora11runc] /

SIZE of file system

Unit Size 512bytes +

Number of Units [10485760] #

Mount GROUP []

Mount AUTOMATICALLY at system restart? no +

PERMISSIONS read/write +

Mount OPTIONS []

Mount AUTOMATICALLY at system restart? no +

PERMISSIONS read/write +

Mount OPTIONS [] +

Start Disk Accounting? no +

Block Size (bytes) 4096

Inline Log? no

Inline Log size (MBytes) [0] #

Extended Attribute Format [v1]

ENABLE Quota Management? no +

Allow Small Inode Extents? [yes] +

Logical Volume for Log host1_loglv

3.3.4. 增加服务IP地址(仅DARE支持)

1）修改/etc/hosts,增加以下行

10.66.201.1 host1_l2_svc2

10.66.201.2 host2_l2_svc2

注意：2边都要增加。

2）增加服务地址

smitty hacmp->Extended Configuration

-> HACMP Extended Resources Configuration

-> Configure HACMP Service IP Labels/Addresses

-> Add a Service IP Label/Address

-> Configurable on Multiple Nodes选择网络

-> Add a Service IP Label/Address configurable on Multiple Nodes (extended)

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

* IP Label/Address host1_svc2

* Network Name net_ether_02

Alternate HW Address to accompany IP Label/Address []

同样增加host2_svc2

3）修正资源组

smitty hacmp->Extended Configuration

->Extended Resource Configuration

->HACMP Extended Resource Group Configuration

->Change/Show Resources and Attributes for a Resource Group

->Change/Show All Resources and Attributes for a Resource Group

4） HACMP同步

触发新增服务ip生效。

这时netstat -in，可以看到地址生效了。

3.3.5. 修改服务IP地址

如果应用服务使用的IP地址，自然是需要停止应用进行修改。比如要将原地址10.2.200.x改为10.2.201.x，路由改为10.2.201.254步骤如下：

1. 正常停止HACMP

smitty clstop ->Bring Resource Groups offline

2. 所有节点修改/etc/hosts将服务地址修改为需要的地址

10.2.201.1 host1_l2_svc host1

10.2.201.2 host2_l2_svc host2

注意同时要修正 /usr/es/sbin/cluster/etc/clhosts

3. 修改启动脚本的路由部分（如果需要）

GATEWAY=10.2.201.254

4. 在一个节点修改HACMP的配置

smitty hacmp->Extended Configuration

-> Extended Resource Configuration

->HACMP Extended Resources Configuration

->Configure HACMP Service IP Labels/Addresses

->Change/Show a Service IP Label/Address选择host1_l2_svc

不做修改，直接回车即可，同样修改host2_l2_svc。

smitty hacmp->Extended Configuration

->Extended Resource Configuration

->HACMP Extended Resource Group Configuration

->Change/Show Resources and Attributes for a Resource Group

->Change/Show All Resources and Attributes for a Resource Group

选择host1_RG

不做修改，直接回车即可，同样修改host2_RG

5. 同步HACMP

6. 重新启动HACMP确认

触发新服务IP地址生效。

注：如果修改的不是应用服务要用的地址，或者修改期间对该地址的服务可以暂停，则可以将1改为强制停止，增加第7步,整个过程可以不停应用服务。

7.去除原有服务IP地址

netstat -in找到该服务IP地址所在网卡比如为en2

ifconfig en2 alias delete 10.2.200.1

3.3.6. boot地址变更

1. smitty tcpip修改网卡的地址

2. 修改/etc/hosts的boot地址,

注意同时要修正 /usr/es/sbin/cluster/etc/clhosts

3. 修改HACMP配置

smitty hacmp ->Extended Configuration

-> Extended Topology Configuration

Change/Show a Communication Interface

Node Name [bgbcb04]

Network Interface en1

IP Label/Address host1_boot1

Network Type ether

* Network Name [net_ether_01]

不做修改，直接回车即可，同样修改其他boot地址。

4. 同步HACMP

5. 重新启动HACMP确认

注意修改启动参数使得启动时重新申请资源，触发新boot IP地址生效，否则clstat看到的boot地址将是down。

3.3.7. 用户变更

修改用户口令

由于安全策略的原因，系统可能需要更改口令，利用HACMP会方便不少，也避免切换过去后因时隔太久，想不起口令需要强制修改的烦恼。

唯一设计不合理的是，必须root才能使用这个功能。

smitty HACMP ->Extended Configuration

-> Security and Users Configuration

-> Passwords in an HACMP cluster

-> Change a User's Password in the Cluster

Selection nodes by resource group host2_RG

*** No selection means all nodes! ***

* User NAME [orarun]

User must change password on first login? false

此时需要你输入新口令更改：

COMMAND STATUS

Command: running stdout: no stderr: no

Before command completion, additional instructions may appear below.

orarun's New password:

Enter the new password again:

OK即成功

修改用户属性

以下步骤可变更用户属性，值得注意的是，虽然可以直接修改用户的UID，但实际上和在单独的操作系统一样，不会自动修改该用户原有的文件和目录的属性，必须事后自己修改，所以建议UID在规划阶段就早做合理规划。

smitty HACMP ->Extended Configuration

-> Security and Users Configuration

->Users in an HACMP cluster

-> Change / Show Characteristics of a User in the Cluster

选择资源组和用户

除开头1行，其他使用均等同于独立操作系统。

Change User Attributes on the Cluster

Resource group eai1d0_RG

* User NAME test

User ID [301]

ADMINISTRATIVE USER? false

….

hacmp 完全手册 PowerHA

著作权归作者所有

如果觉得我的文章对您有用，请点赞。您的支持将鼓励我继续创作！

添加新评论2 条评论

shshiheng系统运维工程师北京
2020-02-24 20:41

很好的资料

guoxugang系统运维工程师中软国际
2015-06-15 09:55

谢谢楼主啊，楼主是好人

Ctrl+Enter 发表

匿名评论

PowerHA 完全手册（三）

添加新评论2 条评论

作者其他文章

相关文章

相关问题

相关资料