虽然HACMP提供了自动化测试工具test tool,使用起来也较为简单。但个人认为由于HACMP的完整测试是一个比较复杂的事情,工具虽然出来了蛮久的,但似乎感觉还是不能非常让人放心,何况也无法模拟交换机等故障,所以只能提供协助,不能完全依赖,结果仅供参考。
2.1. 测试方法说明:1. ping测试:从client同时发起,每次1024个字节,延续10分钟。
2. ping长测试:每次1024个字节,延续24小时。
3. 应用测试:利用自动化测试工具如loadrunner持续从 client连接应用服务使用查询。
4. 应用长测试:48小时内,进行应用测试使用。
5. telnet测试:telnet连接后根据情况确认。
2.2. 标准测试这个测试为必须完成的测试,网络部分每个网段都要做一次,时间节点一般为安装配置中的初始配置阶段,最终配置阶段以及运维定修阶段。
2.2.1. 标准测试表
注意:每步动作后,需要采用clstat确保HACMP已处于STABLE稳定状态再做下一步动作,尤其是恢复动作(对于4,10 实际为3个小步骤),最好间隔120-300s,否则HACMP由于状态不稳定来不及做出判断,出现异常。
序号 | 测试步骤 | 系统结果 | 应用结果 |
1 | 拔掉host1的服务网线 | 地址漂移到另一个网卡 | 中断30s左右可继续使用 |
2 | 拔掉host1的剩下一根的网线 | 发生切换 | 中断5分钟左右可继续使用 |
3 | 拔掉host2的服务网线 | 所有服务地址漂到另一网卡 | 中断30s左右可继续使用 |
4 | 恢复所有网线
| 地址join,clstat可看到均up | 无影响 |
5 | 在host2上执行ha1t -q | host2机宕机,切换到host1机 | 中断5分钟左右可继续使用 |
|
|
|
|
6 | 起动host2机器,在host2上手工执行 smit clstart回原集群 | host1上的属于host2的相关资源及服务切换回host2,集群回到设计状态 | 中断5分钟左右可继续使用 |
|
|
|
|
7 | 拔掉host2的服务网线 | 地址漂另一个网卡 | 中断30s左右可继续使用 |
8 | 拔掉host2的剩下一根的网线 | 发生切换 | 中断5分钟左右可继续使用 |
9 | 拔掉host1的服务网线 | 所有服务地址漂到另一网卡 | 中断30s左右可继续使用 |
10 | 恢复所有网线 | 地址join,clstat可看到 均up | 无影响 |
11 | 在host1上执行halt -q | host1机宕机,切换到host2机 | 中断5分钟左右可继续使用 |
起动host1机器,在host1上手工执行 smit clstart回原集群 | host2上的属于host1的相关资源及服务切换回host1,集群回到设计状态 | 中断5分钟左右可继续使用 |
以下为日志/var/hacmp/log/hacmp.out的部分分析,供大家实际测试参考:
步骤1:拔掉host1的服务网线
Sep 16 14:53:10 EVENT START: swap_adapter host1 net_ether_02 10.2.12.1 10.2.200.1
Sep 16 14:53:12 EVENT START: swap_aconn_protocols en3 en1
Sep 16 14:53:12 EVENT COMPLETED: swap_aconn_protocols en3 en1 0
Sep 16 14:53:12 EVENT COMPLETED: swap_adapter host1 net_ether_02 10.2.12.1 10.2.200.1 0
Sep 16 14:53:12 EVENT START: swap_adapter_complete host1 net_ether_02 10.2.12.1 10.2.200.1
Sep 16 14:53:13 EVENT COMPLETED: swap_adapter_complete host1 net_ether_02 10.2.12.1 10.2.200.1 0
步骤2:拔掉host1的剩下一根的网线
Sep 16 14:53:14 EVENT START: fail_interface host1 10.2.2.1
Sep 16 14:53:14 EVENT COMPLETED: fail_interface host1 10.2.2.1 0
Sep 16 14:53:55 EVENT START: network_down host1 net_ether_02
Sep 16 14:53:56 EVENT COMPLETED: network_down host1 net_ether_02 0
Sep 16 14:53:56 EVENT START: network_down_complete host1 net_ether_02
Sep 16 14:53:56 EVENT COMPLETED: network_down_complete host1 net_ether_02 0
Sep 16 14:54:03 EVENT START: rg_move_release host1 1
Sep 16 14:54:03 EVENT START: rg_move host1 1 RELEASE
Sep 16 14:54:03 EVENT START: node_down_local
Sep 16 14:54:03 EVENT START: stop_server host2_app host1_app
Sep 16 14:54:04 EVENT COMPLETED: stop_server host2_app host1_app 0
Sep 16 14:54:04 EVENT START: release_vg_fs ALL host1vg
Sep 16 14:54:06 EVENT COMPLETED: release_vg_fs ALL host1vg 0
Sep 16 14:54:06 EVENT START: release_service_addr host1_l1_svc1 host1_l1_svc2 host1_l2_svc
Sep 16 14:54:11 EVENT COMPLETED: release_service_addr host1_l1_svc1 host1_l1_svc2 host1_l2_svc 0
Sep 16 14:54:11 EVENT COMPLETED: node_down_local 0
Sep 16 14:54:11 EVENT COMPLETED: rg_move host1 1 RELEASE 0
Sep 16 14:54:11 EVENT COMPLETED: rg_move_release host1 1 0
Sep 16 14:54:13 EVENT START: rg_move_fence host1 1
Sep 16 14:54:14 EVENT COMPLETED: rg_move_fence host1 1 0
Sep 16 14:54:14 EVENT START: rg_move_acquire host1 1
Sep 16 14:54:14 EVENT START: rg_move host1 1 ACQUIRE
Sep 16 14:54:14 EVENT COMPLETED: rg_move host1 1 ACQUIRE 0
Sep 16 14:54:14 EVENT COMPLETED: rg_move_acquire host1 1 0
Sep 16 14:54:24 EVENT START: rg_move_complete host1 1
Sep 16 14:54:25 EVENT START: node_up_remote_complete host1
Sep 16 14:54:25 EVENT COMPLETED: node_up_remote_complete host1 0
Sep 16 14:54:25 EVENT COMPLETED: rg_move_complete host1 1 0
步骤4:恢复所有网线
Sep 16 14:55:49 EVENT START: network_up host1 net_ether_02
Sep 16 14:55:49 EVENT COMPLETED: network_up host1 net_ether_02 0
Sep 16 14:55:50 EVENT START: network_up_complete host1 net_ether_02
Sep 16 14:55:50 EVENT COMPLETED: network_up_complete host1 net_ether_02 0
Sep 16 14:56:00 EVENT START: join_interface host1 10.2.12.1
Sep 16 14:56:00 EVENT COMPLETED: join_interface host1 10.2.12.1 0
步骤5:在host2上执行ha1t -q
Sep 16 14:58:56 EVENT START: node_down host2
Sep 16 14:58:57 EVENT START: acquire_service_addr
Sep 16 14:58:58 EVENT START: acquire_aconn_service en0 net_ether_01
Sep 16 14:58:59 EVENT COMPLETED: acquire_aconn_service en0 net_ether_01 0
Sep 16 14:59:00 EVENT START: acquire_aconn_service en2 net_ether_01
Sep 16 14:59:00 EVENT COMPLETED: acquire_aconn_service en2 net_ether_01 0
Sep 16 14:59:01 EVENT START: acquire_aconn_service en1 net_ether_02
Sep 16 14:59:01 EVENT COMPLETED: acquire_aconn_service en1 net_ether_02 0
Sep 16 14:59:01 EVENT COMPLETED: acquire_service_addr 0
Sep 16 14:59:02 EVENT START: acquire_takeover_addr
Sep 16 14:59:05 EVENT COMPLETED: acquire_takeover_addr 0
Sep 16 14:59:11 EVENT COMPLETED: node_down host2 0
Sep 16 14:59:11 EVENT START: node_down_complete host2
Sep 16 14:59:12 EVENT START: start_server host1_app host2_app
Sep 16 14:59:12 EVENT START: start_server host2_app
Sep 16 14:59:12 EVENT COMPLETED: start_server host1_app host2_app 0
Sep 16 14:59:12 EVENT COMPLETED: start_server host2_app 0
Sep 16 14:59:13 EVENT COMPLETED: node_down_complete host2 0
步骤6:回原
Sep 16 15:10:25 EVENT START: node_up host2
Sep 16 15:10:27 EVENT START: acquire_service_addr
Sep 16 15:10:28 EVENT START: acquire_aconn_service en0 net_ether_01
Sep 16 15:10:28 EVENT COMPLETED: acquire_aconn_service en0 net_ether_01 0
Sep 16 15:10:29 EVENT START: acquire_aconn_service en2 net_ether_01
Sep 16 15:10:29 EVENT COMPLETED: acquire_aconn_service en2 net_ether_01 0
Sep 16 15:10:31 EVENT START: acquire_aconn_service en1 net_ether_02
Sep 16 15:10:31 EVENT COMPLETED: acquire_aconn_service en1 net_ether_02 0
Sep 16 15:10:31 EVENT COMPLETED: acquire_service_addr 0
Sep 16 15:10:36 EVENT COMPLETED: node_up host2 0
Sep 16 15:10:36 EVENT START: node_up_complete host2
Sep 16 15:10:36 EVENT START: start_server host2_app
Sep 16 15:10:37 EVENT COMPLETED: start_server host2_app 0
Sep 16 15:10:37 EVENT COMPLETED: node_up_complete host2 0
Sep 16 15:10:41 EVENT START: network_up host2 net_diskhbmulti_01
Sep 16 15:10:42 EVENT COMPLETED: network_up host2 net_diskhbmulti_01 0
Sep 16 15:10:42 EVENT START: network_up_complete host2 net_diskhbmulti_01
Sep 16 15:10:42 EVENT COMPLETED: network_up_complete host2 net_diskhbmulti_01 0
步骤7:拔掉host2的服务网线
Sep 16 15:20:36 EVENT START: swap_adapter host2 net_ether_02 10.2.12.2 10.2.200.2
Sep 16 15:20:38 EVENT START: swap_aconn_protocols en3 en1
Sep 16 15:20:38 EVENT COMPLETED: swap_aconn_protocols en3 en1 0
Sep 16 15:20:38 EVENT COMPLETED: swap_adapter host2 net_ether_02 10.2.12.2 10.2.200.2 0
Sep 16 15:20:39 EVENT START: swap_adapter_complete host2 net_ether_02 10.2.12.2 10.2.200.2
Sep 16 15:20:39 EVENT COMPLETED: swap_adapter_complete host2 net_ether_02 10.2.12.2 10.2.200.2 0
步骤8:拔掉host2的剩下一根的网线
Sep 16 15:20:40 EVENT START: fail_interface host2 10.2.2.2
Sep 16 15:20:40 EVENT COMPLETED: fail_interface host2 10.2.2.2 0
Sep 16 15:21:40 EVENT START: network_down host2 net_ether_02
Sep 16 15:21:40 EVENT COMPLETED: network_down host2 net_ether_02 0
Sep 16 15:21:40 EVENT START: network_down_complete host2 net_ether_02
Sep 16 15:21:41 EVENT COMPLETED: network_down_complete host2 net_ether_02 0
Sep 16 15:21:47 EVENT START: rg_move_release host2 2
Sep 16 15:21:47 EVENT START: rg_move host2 2 RELEASE
Sep 16 15:21:48 EVENT START: node_down_local
Sep 16 15:21:48 EVENT START: stop_server host2_app
Sep 16 15:21:48 EVENT COMPLETED: stop_server host2_app 0
Sep 16 15:21:48 EVENT START: release_vg_fs ALL host2vg
Sep 16 15:21:50 EVENT COMPLETED: release_vg_fs ALL host2vg 0
Sep 16 15:21:50 EVENT START: release_service_addr host2_l1_svc1 host2_l1_svc2 host2_l2_svc
Sep 16 15:21:55 EVENT COMPLETED: release_service_addr host2_l1_svc1 host2_l1_svc2 host2_l2_svc 0
Sep 16 15:21:55 EVENT COMPLETED: node_down_local 0
Sep 16 15:21:55 EVENT COMPLETED: rg_move host2 2 RELEASE 0
Sep 16 15:21:55 EVENT COMPLETED: rg_move_release host2 2 0
Sep 16 15:21:57 EVENT START: rg_move_fence host2 2
Sep 16 15:21:58 EVENT COMPLETED: rg_move_fence host2 2 0
Sep 16 15:21:58 EVENT START: rg_move_acquire host2 2
Sep 16 15:21:58 EVENT START: rg_move host2 2 ACQUIRE
Sep 16 15:21:58 EVENT COMPLETED: rg_move host2 2 ACQUIRE 0
Sep 16 15:21:58 EVENT COMPLETED: rg_move_acquire host2 2 0
Sep 16 15:22:08 EVENT START: rg_move_complete host2 2
Sep 16 15:22:08 EVENT START: node_up_remote_complete host2
Sep 16 15:22:09 EVENT COMPLETED: node_up_remote_complete host2 0
Sep 16 15:22:09 EVENT COMPLETED: rg_move_complete host2 2 0
步骤9:拔掉host1的服务网线
Sep 16 15:43:42 EVENT START: swap_adapter host1 net_ether_02 10.2.2.1 10.2.200.2
Sep 16 15:43:43 EVENT COMPLETED: swap_adapter host1 net_ether_02 10.2.2.1 10.2.200.2 0
Sep 16 15:43:45 EVENT START: swap_adapter_complete host1 net_ether_02 10.2.2.1 10.2.200.2
Sep 16 15:43:45 EVENT COMPLETED: swap_adapter_complete host1 net_ether_02 10.2.2.1 10.2.200.2 0
Sep 16 15:43:47 EVENT START: fail_interface host1 10.2.12.1
Sep 16 15:43:47 EVENT COMPLETED: fail_interface host1 10.2.12.1 0
步骤10:恢复所有网线
Sep 16 15:45:07 EVENT START: network_up host2 net_ether_02
Sep 16 15:45:08 EVENT COMPLETED: network_up host2 net_ether_02 0
Sep 16 15:45:08 EVENT START: network_up_complete host2 net_ether_02
Sep 16 15:45:08 EVENT COMPLETED: network_up_complete host2 net_ether_02 0
Sep 16 15:45:43 EVENT START: join_interface host2 10.2.12.2
Sep 16 15:45:43 EVENT COMPLETED: join_interface host2 10.2.12.2 0
Sep 16 15:47:05 EVENT START: join_interface host1 10.2.12.1
Sep 16 15:47:05 EVENT COMPLETED: join_interface host1 10.2.12.1 0
步骤11:在host1上执行halt -q
Sep 16 15:48:48 EVENT START: node_down host1
Sep 16 15:48:49 EVENT START: acquire_service_addr
Sep 16 15:48:50 EVENT START: acquire_aconn_service en0 net_ether_01
Sep 16 15:48:50 EVENT COMPLETED: acquire_aconn_service en0 net_ether_01 0
Sep 16 15:48:51 EVENT START: acquire_aconn_service en2 net_ether_01
Sep 16 15:48:51 EVENT COMPLETED: acquire_aconn_service en2 net_ether_01 0
Sep 16 15:48:53 EVENT START: acquire_aconn_service en1 net_ether_02
Sep 16 15:48:53 EVENT COMPLETED: acquire_aconn_service en1 net_ether_02 0
Sep 16 15:48:53 EVENT COMPLETED: acquire_service_addr 0
Sep 16 15:48:53 EVENT START: acquire_takeover_addr
Sep 16 15:48:57 EVENT COMPLETED: acquire_takeover_addr 0
Sep 16 15:49:02 EVENT COMPLETED: node_down host1 0
Sep 16 15:49:02 EVENT START: node_down_complete host1
Sep 16 15:49:03 EVENT START: start_server host1_app host2_app
Sep 16 15:49:03 EVENT START: start_server host2_app
Sep 16 15:49:03 EVENT COMPLETED: start_server host1_app host2_app 0
Sep 16 15:49:03 EVENT COMPLETED: start_server host2_app 0
Sep 16 15:49:04 EVENT COMPLETED: node_down_complete host1 0
2.3. 完全测试
完全测试在有充分测试时间和测试条件(如交换机可参与测试)完整加以测试,时间节点一般为系统上线前一周。
注:考虑到下表的通用性,有2种情况没有细化,需要注意。
1. 同一网络有2个服务IP地址,考虑到负载均衡,将自动分别落在boot1、boot2上,这样不论那个网卡有问题,都会发生地址漂移。
2. 应用中断没有加入应用的重新连接时间,如oracleDB发生漂移,实际tuxedo需要重新启动才可继续连接,这个需要起停脚本来实现。
此外,由于实际环境也许有所不同甚至更为复杂,此表仅供大家实际参考,但大体部分展现出来,主要提醒大家不要遗漏。
2.3.1. 完全测试表
序号 | 测试场景 | 系统结果 | 应用结果 | 参考时长 |
| 功能测试 |
|
|
|
1 | host2起HA | host2服务IP地址生效,vg、文件系统生效 | host2 app(db)启动OK | 120s |
2 | host2停HA | host2服务IP地址、vg释放干净 | host2 app 停止 | 15s |
3 | host1起HA | host1服务IP地址生效,vg、文件系统生效 | host1 app启动OK | 120s |
4 | host1停HA | host1网卡、vg释放干净 | host2 app 停止 | 15s |
5 | host2 takeover切换host1 | host2服务地址切换到host1的boot2和vg等 | host2 app 短暂中断 | 30s |
host2 clstart | 回原 | host2 app短暂中断 | 120s | |
6 | host1 takeover到 host2 | host1服务地址切换到host2的boot2和vg等切换到host2 | host1 app 短暂中断
| 30s
|
host1 clstart | 回原 | host1 app短暂中断 | 120s | |
| 网卡异常测试 |
|
|
|
1 | host2断boot1网线测试 | host2的服务ip从boot1漂移至boot2 | host2 app 短暂中断 | 30s |
host2恢复boot1网线测试 | host2 boot1 join | 无影响 | 40s | |
2 | host2断boot2网线测试 | host2的服务ip从boot1漂移至boot2 | host2 app 短暂中断 | 30s |
host2恢复boot2网线测试 | host2 boot1 join | 无影响 | 40s | |
3 | host2断boot1、boot2网线测试 | host2服务地址切换到host1的boot2上,vg等切换到host1 | host2 app短暂中断 | 210s
|
host1再断boot2网线, | host2的服务ip漂移到host1的boot1 | host2 app短暂中断 | 30s | |
host2恢复boot1、boot2网线测试 | host2 boot1,boot 2join | 无影响 | 30s | |
host2 clstart | 回原 | host2 app短暂中断 | 120s | |
4 | host1断boot1、boot2网线测试 | host1服务地址切换到host2的boot2上,vg等切换到host2 | host1 app短暂中断 | 210s
|
host1再断boot2网线, | host1的服务ip漂移到host2的boot1 | host1 app短暂中断 | 30s | |
host1恢复boot1、boot2网线测试 | host1 boot1,boot 2join | 无影响 | 30s | |
host2 clstart | 回原 | host2 app短暂中断 | 120s | |
5 | host2 force clstop | cluster服务停止,ip、vg资源无反应 | 无影响 | 20s |
host2 clstart | 回原 | 无影响 | 20s | |
6 | host1 force clstop | cluster服务停止,ip、vg资源无反应 | 无影响 | 20s |
host1 clstart | 回原 | 无影响 | 20s | |
7 | host2,host1 boot2 网线同时断30mins | boot2 failed | 无影响 | 20s |
host2,host1 boot2 网线恢复 | boot2 均join | 无影响 | 20s | |
8 | host2,host1 boot1 网线同时断30mins | 服务IP地址均漂移到boot2上。 | host1,host2 app短暂中断 | 30s |
host2,host1 boot1 网线恢复 | boot1 均join | 无影响 | 20s | |
| 主机宕机测试 |
|
|
|
1 | host2 突然宕机halt -q | host2服务地址切换到host1的boot2和vg等 | host2 app 短暂中断 | 30s |
host2 clstart | 回原 | host2 app短暂中断 | 120s | |
2 | host1 突然宕机halt -q | host1服务地址切换到host2的boot2和vg等切换到host2 | host1 app 短暂中断
| 30s
|
host1 clstart | 回原 | host1 app短暂中断 | 120s | |
| 交换机异常测试 |
|
|
|
1 | SwitchA断电 | 服务IP地址均漂移到boot2上 | host1、host2 app短暂中断 | 50s |
SwitchA恢复 | boot1 均join | 无影响 | 40s | |
SwitchB断电 | 服务IP地址均漂移回boot1上 | host1、host2 app短暂中断 | 50s | |
SwitchB恢复 | boot2 均join | 无影响 | 40s | |
2 | SwitchB断电 | boot2 failed | 无影响 | 50s |
SwitchB恢复 | boot2 均join | 无影响 | 40s | |
SwitchA断电 | 服务IP地址均漂移到boot2上。 | host1、host2 app短暂中断 | 50s | |
SwitchA恢复 | boot1 均join | 无影响 | 40s | |
3 | SwitchA,B同时断电10mins | network报down,其他一切不动。 | host1、host2 app中断 | 10min |
SwitchA,B恢复 | boot1,boot2 join | 服务自动恢复 | 50s | |
4 | SwitchA断电
| 服务IP地址均漂移到boot2上 | host1、host2 app短暂中断 | 50s |
30s后B也断电 | 不动 | host1、host2 app中断 | 50s | |
SwitchA,B恢复 | boot1 均join | 自动恢复 | 40s | |
5 | SwitchB断电
| boot2 failed | 无影响 | 50s |
30s后A也断电 | network报down,其他一切不动。 | host1、host2 app中断 | 50s | |
SwitchA,B恢复 | boot1 均join | 自动恢复 | 40s | |
6 | SwitchA异常(对接网线触发广播风暴) | 机器本身正常,但网络不通
| host1、host2 app中断 | 20s |
SwitchA恢复 | 恢复后一切正常 | 自动恢复 |
| |
7 | SwitchB异常(对接网线触广播风暴) | 机器本身正常,但网络不通 恢复后一切正常 | host1、host2 app中断 | 20s |
SwitchB恢复 |
| 自动恢复 |
| |
8 | SwitchA,B同时异常(对接网线触广播风暴) | 机器本身正常,但网络丢包严重,
| host1、host2 app中断 | 10s |
| SwitchA,B恢复 | 恢复后一切正常 | 自动恢复 | 20s |
| 稳定性测试 |
|
|
|
1 | host2, host1各起HA |
| 48小时以上正常服务 |
|
2 | host2 takeover切换host1 |
| 48小时以上正常服务 |
|
3 | host1 takeover到 host2 |
| 48小时以上正常服务 |
|
2.4. 运维切换测试:
运维切换测试是为了在运维过程中,为保证高可靠性加以实施。建议每年实施一次。因为这样的测试实际是一种演练,能够及时发现各方面的问题,为故障期间切换成功提供有效保证。
一直以来,听过不少用户和同仁抱怨,说平时测试完美,实际关键时刻却不能切换,原因其实除了运维篇没做到位之外,还有测试不够充分的原因。 因此本人目前强烈推荐有条件的环境一定要定期进行运维切换测试。
之前由于成本的原因,备机配置一般比主机低,或者大量用于开发测试,很难实施这样的测试。但随着Power机器能力越来越强,一台机器只装一个AIX系统的越来越少,也就使得互备LPAR的资源可以在HA生效是多个LAPR之间直接实时调整资源,使得这样的互换测试成为了可能。
2.4.1. 运维切换测试表
场景 |
| 建议时长 | 切换方式 |
主备(run->dev) | 主机和备机互换 | >10天 | 备机开发测试停用或临时修改HA配置 |
主分区切、备用分区互换 | >30天 | 备用分区资源增加、主分区资源减少。开发测试停用或临时修改HA配置 | |
互备(app <->db,app<->app,db<->db) | 互换 | >30天 | 手工互相交叉启动资源组 |
主机切换到备机:
有2种方式:
Ø 可用takeover(move Resource Groups )方式,但由于负荷和防止误操作的原因,备机的开发测试环境一般需要停用。
Ø 也可通过修改HA的配置,将备机资源组的节点数增加运行节点。这样可以在切换测试期间继续使用开发测试环境。但这样不光要对HA有所改动。还要预先配置时就要保证备机开发测试环境也不是放在本地盘上,需要放在共享vg里,此外还要同步开发测试的环境到运行机。建议最好在设计时就有这样的考虑。
手工互相切换:
停掉资源组:
smitty hacmp->System Management (C-SPOC)
-> Resource Group and Applications
->Bring a Resource Group Offline 选择 host2_RG,host2
Bring a Resource Group Offline
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
Resource Group to Bring Offline host2_RG
Node On Which to Bring Resource Group Offline host2
Persist Across Cluster Reboot? false
同样停掉host1_RG
互换资源组:
smitty HACMP->System Management (C-SPOC)
-> Resource Group and Applications
->Bring a Resource Group Online 选择host2_RG,host1
Resource Group to Bring Online host2_RG
Node on Which to Bring Resource Group Online host1
Persist Across Cluster Reboot回答No。
即在host1上启动host2的资源组,同样方法在host2上启动host1资源组。这样2台机器就实现了互换。
注:由于互切需要人工干预,回原也要人工干预,所以切换期间需要密切监控运行状况,如方便出现有异常时,能立刻人工处理。
互换crontab及相关后台脚本:
由于备份作业等crontab里的后台作业会有所不同,所以需要进行互换,按我们的做法(参见脚本篇的同步HA的脚本)只需拷贝相应crontab即可。
[host1][root][/]>cp -rp /home/scripts/host2/crontab_host2 /var/spool/cron/crontabs/root
修正文件属性:
[host1][root][/]>chown root:cron /var/spool/cron/crontabs/root
[host1][root][/]>chmod 600 /var/spool/cron/crontabs/root
重起crontab:
[host1][root][/]> ps -ef|grep cron
root 278688 1 0 Dec 19 - 0:02 /usr/sbin/cron
[host1][root][/]>kill -9 278688
如果不采用我们脚本的做法,除需要拷贝对方的crontab外,还要记得同步相应脚本。
互换备份策略:
由于备份方式不同,可能所作的调整也不一样,需要具体系统具体对待。实验环境中的备份采用后台作业方式,无须进一步处理。实际环境中可能采用备份软件,由于主机互换了,备份策略是否有效需要确认,如无效,需要做相应修正。
作为高可用性的保证,通过了配置和测试之后,系统成功上线了,但不要忘记,HACMP也需要精心维护才能在最关键的时刻发生作用,否则不光是多余的摆设,维护人员会由于“既然已经安装好HACMP了,关键时刻自然会发生作用”的想法反而高枕无忧,麻痹大意。
3.1. HACMP切换问题及处理我们简单统计了以往遇到的切换不成功或误切换的场景,编制了测试成功切换却失败的原因及对策,如下表:
3.1.1. HACMP切换问题表故障现象 | 原因 | 根本原因 | 对策 |
无法切换1 | 测试一段时间后两边配置不一致、不同步 | 没通过HACMP的功能(含C-SPOC)进行用户、文件系统等系统变更。 | 制定和遵守规范,定期检查,定修及时处理
|
无法切换2 | 应用停不下来,导致超时,文件系统不能umount | 停止脚本考虑不周全 | 规范化增加kill_vg_user脚本 |
切换成功但应用不正常1 | 应用启动异常 | 应用有变动,停止脚本异常停止或启动脚本不正确 | 规范化和及时更新起停脚本
|
切换成功但应用不正常2 | 备机配置不符合运行要求 | 各类系统和软件参数不合适 | 制定检查规范初稿,通过运维切换测试检查确认。 |
切换成功但通信不正常1 | 网络路由不通
| 网络配置原因 | 修正测试路由,通过运维切换测试检查确认。 |
切换成功但通信不正常2 | 通信软件配置问题 | 由于一台主机同时漂移同一网段的2个服务地址,通信电文从另一个IP地址通信,导致错误 | 修正配置,绑定指定服务ip。 |
误切换 | DMS问题 | 系统负荷持续过高 |
注:请记住,对于客户来说,不管什么原因,“应用中断超过了5-10分钟,就是HACMP切换不成功”,也意味着前面所有的工作都白费了,所以维护工作的重要性也是不言而谕的。
3.1.2. 强制方式停掉HACMP:HACMP的停止分为3种,
Bring Resource Groups Offline (正常停止)
Move Resource Groups (手工切换)
Unmanage Resource Groups (强制停掉HACMP,而不停资源组)
下面的维护工作,很多时候需要强制停掉HACMP来进行,此时资源组不会释放,这样做的好处是,由于IP地址、文件系统等等没有任何影响,只是停掉HACMP本身,所以应用服务可以继续提供,实现了在线检查和变更HACMP的目的。
[host1][root][/]>smitty clstop
Stop Cluster Services
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* Stop now, on system restart or both now
Stop Cluster Services on these nodes [host1]
BROADCAST cluster shutdown? false
* Select an Action on Resource Groups Unmanage Resource Group
记得一般所有节点都要进行这样操作。
用cldump可以看到以下结果:
......
luster Name: test_cluster
Resource Group Name: rg_diskhbmulti_01
Startup Policy: Online On All Available Nodes
Fallover Policy: Bring Offline (On Error Node Only)
Fallback Policy: Never Fallback
Site Policy: ignore
Node Group State
---------------------------- ---------------
host1 UNMANAGED
host2 UNMANAGED
Resource Group Name: host1_RG
Startup Policy: Online On Home Node Only
Fallover Policy: Fallover To Next Priority Node In The List
Fallback Policy: Fallback To Higher Priority Node In The List
Site Policy: ignore
Node Group State
---------------------------- ---------------
host1 UNMANAGED
host2 UNMANAGED
Resource Group Name: host2_RG
Startup Policy: Online On Home Node Only
Fallover Policy: Fallover To Next Priority Node In The List
Fallback Policy: Fallback To Higher Priority Node In The List
Site Policy: ignore
Node Group State
---------------------------- ---------------
host2 UNMANAGED
host1 UNMANAGED
3.1.3. 强制停掉后的HACMP启动:
在修改HACMP的配置后,大多数情况下需要重新申请资源启动,这样才能使HACMP的配置重新生效。
[host1][root][/]>smitty clstart
请注意:为保险,Startup Cluster Information Daemon?选择 true。
为了更好的维护好HACMP,平时的检查和处理是必不可少的,下面提供的检查和处理方法除非特别说明,均是不用停机、停止应用即可进行,不影响用户使用。不过具体实施前需要仔细检查状态,再予以实施。
当然,最有说服力的检查和验证是通过运维切换测试,参见测试篇。
3.2.1. clverify检查
这个检查可以对包括LVM的绝大多数HACMP的配置同步状态,是HACMP检查是否同步的主要方式。
smitty clverify ->Verify HACMP Configuration
经过检查,结果应是OK。如果发现不一致,需要区别对待。对于非LVM的报错,大多数情况下不用停止应用,可以用以下步骤解决:
1. 先利用强制方式停止HACMP服务。
同样停止host2的HACMP服务。
2. 就检查出的问题进行修正和同步:
smitty hacmp -> Extended Configuration
-> Extended Verification and Synchronization
这时由于已停止HACMP服务,可以包括自动修正和强制同步。
对于LVM的报错,一般是由于未使用HACMP的C-SPOC功能,单边修改文件系统、lv、VG造成的,会造成VG的timestamp不一致。这种情况即使手工在另一边修正(通常由于应用在使用,也不能这样做),选取自动修正的同步,也仍然会报failed。此时只能停掉应用,按首次整理中的整理VG一节解决。
3.2.2. 进程检查:
1) 查看服务及进程,至少有以下三个:
[host1][root][/]#lssrc -a|grep ES
clcomdES clcomdES 10027064 active
clstrmgrES cluster 9109532 active
clinfoES cluster 5767310 active
2) /var目录存放hacmp的相关log,还有剩余空间。
3.2.3. cldump检查:
实际HACMP菜单中也可以调用cldump,效果相同。
cldump的监测为将当前HACMP的状态快照,确认显示为UP,STABLE,否则根据实际情况进行分析处理。
[host1][root][/]>/usr/sbin/cluster/utilities/cldump
Obtaining information via SNMP from Node: host1...
_____________________________________________________________________________
Cluster Name: test_cluster
Cluster State: UP
Cluster Substate: STABLE
_____________________________________________________________________________
Node Name: host1 State: UP
Network Name: net_diskhbmulti_01 State: UP
Address: Label: host1_1 State: UP
Network Name: net_ether_01 State: UP
Address: 10.2.100.1 Label: host1_l1_svc1 State: UP
Address: 10.2.101.1 Label: host1_l1_svc2 State: UP
Address: 10.2.11.1 Label: host1_l1_boot2 State: UP
Address: 10.2.1.21 Label: host1_l1_boot1 State: UP
Network Name: net_ether_02 State: UP
Address: 10.2.12.1 Label: host1_l2_boot2 State: UP
Address: 10.2.2.1 Label: host1_l2_boot1 State: UP
Address: 10.2.200.1 Label: host1_l2_svc State: UP
Node Name: host2 State: UP
Network Name: net_diskhbmulti_01 State: UP
Address: Label: host2_2 State: UP
Network Name: net_ether_01 State: UP
Address: 10.2.100.2 Label: host2_l1_svc1 State: UP
Address: 10.2.101.2 Label: host2_l1_svc2 State: UP
Address: 10.2.11.2 Label: host2_l1_boot2 State: UP
Address: 10.2.1.22 Label: host2_l1_boot1 State: UP
Network Name: net_ether_02 State: UP
Address: 10.2.12.2 Label: host2_l2_boot2 State: UP
Address: 10.2.2.2 Label: host2_l2_boot1 State: UP
Address: 10.2.200.2 Label: host2_l2_svc State: UP
Cluster Name: test_cluster
Resource Group Name: rg_diskhbmulti_01
Startup Policy: Online On All Available Nodes
Fallover Policy: Bring Offline (On Error Node Only)
Fallback Policy: Never Fallback
Site Policy: ignore
Node Group State
---------------------------- ---------------
host1 ONLINE
host2 ONLINE
Resource Group Name: host1_RG
Startup Policy: Online On Home Node Only
Fallover Policy: Fallover To Next Priority Node In The List
Fallback Policy: Fallback To Higher Priority Node In The List
Site Policy: ignore
Node Group State
---------------------------- ---------------
host1 ONLINE
host2 OFFLINE
Resource Group Name: host2_RG
Startup Policy: Online On Home Node Only
Fallover Policy: Fallover To Next Priority Node In The List
Fallback Policy: Fallback To Higher Priority Node In The List
Site Policy: ignore
Node Group State
---------------------------- ---------------
host2 ONLINE
host1 OFFLINE
3.2.4. clstat检查
clstat可以实时监控HACMP的状态,及时确认显示为UP,STABLE,否则根据实际情况进行分析处理。
[host1][root][/]>/usr/sbin/cluster/clstat
clstat - HACMP Cluster Status Monitor
-------------------------------------
Cluster: test_cluster (1572117373)
Mon Sep 16 13:38:31 GMT+08:00 2013
State: UP Nodes: 2
SubState: STABLE
Node: host1 State: UP
Interface: host1_l2_boot1 (2) Address: 10.2.2.1
State: UP
Interface: host1_l1_boot2 (1) Address: 10.2.11.1
State: UP
Interface: host1_l2_boot2 (2) Address: 10.2.12.1
State: UP
Interface: host1_l1_boot1 (1) Address: 10.2.1.21
State: UP
Interface: host1_1 (0) Address: 0.0.0.0
State: UP
Interface: host1_l1_svc1 (1) Address: 10.2.100.1
State: UP
Interface: host1_l1_svc2 (1) Address: 10.2.101.1
State: UP
Interface: host1_l2_svc (2) Address: 10.2.200.1
State: UP
Resource Group: host1_RG State: On line
Resource Group: rg_diskhbmulti_01 State: On line
Node: host2 State: UP
Interface: host2_l2_boot1 (2) Address: 10.2.2.2
State: UP
Interface: host2_l1_boot2 (1) Address: 10.2.11.2
State: UP
Interface: host2_l2_boot2 (2) Address: 10.2.12.2
State: UP
Interface: host2_l1_boot1 (1) Address: 10.2.1.22
State: UP
Interface: host2_2 (0) Address: 0.0.0.0
State: UP
Interface: host2_l1_svc1 (1) Address: 10.2.100.2
State: UP
Interface: host2_l1_svc2 (1) Address: 10.2.101.2
State: UP
Interface: host2_l2_svc (2) Address: 10.2.200.2
State: UP
Resource Group: host2_RG State: On line
Resource Group: rg_diskhbmulti_01 State: On line
************************ f/forward, b/back, r/refresh, q/quit *****************
3.2.5. cldisp检查:
这是从资源的角度做一个查看,可以看到相关资源组的信息是否正确,同样是状态应都为up,stable,online。
[host1][root][/]#/usr/es/sbin/cluster/utilities/cldisp
Cluster: test_cluster
Cluster services: active
State of cluster: up
Substate: stable
#############
APPLICATIONS
#############
Cluster test_cluster provides the following applications: host1_app host2_app
Application: host1_app
host1_app is started by /usr/sbin/cluster/app/start_host1
host1_app is stopped by /usr/sbin/cluster/app/stop_host1
No application monitors are configured for host1_app.
This application is part of resource group 'host1_RG'.
Resource group policies:
Startup: on home node only
Fallover: to next priority node in the list
Fallback: if higher priority node becomes available
State of host1_app: online
Nodes configured to provide host1_app: host1 {up} host2 {up}
Node currently providing host1_app: host1 {up}
The node that will provide host1_app if host1 fails is: host2
Resources associated with host1_app:
Service Labels
host1_l1_svc1(10.2.100.1) {online}
Interfaces configured to provide host1_l1_svc1:
host1_l1_boot1 {up}
with IP address: 10.2.1.21
on interface: en0
on node: host1 {up}
on network: net_ether_01 {up}
host1_l1_boot2 {up}
with IP address: 10.2.11.1
on interface: en2
on node: host1 {up}
on network: net_ether_01 {up}
host2_l1_boot2 {up}
with IP address: 10.2.11.2
on interface: en2
on node: host2 {up}
on network: net_ether_01 {up}
host2_l1_boot1 {up}
with IP address: 10.2.1.22
on interface: en0
on node: host2 {up}
on network: net_ether_01 {up}
host1_l1_svc2(10.2.101.1) {online}
Interfaces configured to provide host1_l1_svc2:
host1_l1_boot1 {up}
with IP address: 10.2.1.21
on interface: en0
on node: host1 {up}
on network: net_ether_01 {up}
host1_l1_boot2 {up}
with IP address: 10.2.11.1
on interface: en2
on node: host1 {up}
on network: net_ether_01 {up}
host2_l1_boot2 {up}
with IP address: 10.2.11.2
on interface: en2
on node: host2 {up}
on network: net_ether_01 {up}
host2_l1_boot1 {up}
with IP address: 10.2.1.22
on interface: en0
on node: host2 {up}
on network: net_ether_01 {up}
host1_l2_svc(10.2.200.1) {online}
Interfaces configured to provide host1_l2_svc:
host1_l2_boot1 {up}
with IP address: 10.2.2.1
on interface: en1
on node: host1 {up}
on network: net_ether_02 {up}
host1_l2_boot2 {up}
with IP address: 10.2.12.1
on interface: en3
on node: host1 {up}
on network: net_ether_02 {up}
host2_l2_boot2 {up}
with IP address: 10.2.12.2
on interface: en3
on node: host2 {up}
on network: net_ether_02 {up}
host2_l2_boot1 {up}
with IP address: 10.2.2.2
on interface: en1
on node: host2 {up}
on network: net_ether_02 {up}
Shared Volume Groups:
host1vg
Application: host2_app
host2_app is started by /usr/sbin/cluster/app/start_host2
host2_app is stopped by /usr/sbin/cluster/app/stop_host2
No application monitors are configured for host2_app.
This application is part of resource group 'host1_RG'.
Resource group policies:
Startup: on home node only
Fallover: to next priority node in the list
Fallback: if higher priority node becomes available
State of host2_app: online
Nodes configured to provide host2_app: host1 {up} host2 {up}
Node currently providing host2_app: host1 {up}
The node that will provide host2_app if host1 fails is: host2
Resources associated with host2_app:
Service Labels
host1_l1_svc1(10.2.100.1) {online}
Interfaces configured to provide host1_l1_svc1:
host1_l1_boot1 {up}
with IP address: 10.2.1.21
on interface: en0
on node: host1 {up}
on network: net_ether_01 {up}
host1_l1_boot2 {up}
with IP address: 10.2.11.1
on interface: en2
on node: host1 {up}
on network: net_ether_01 {up}
host2_l1_boot2 {up}
with IP address: 10.2.11.2
on interface: en2
on node: host2 {up}
on network: net_ether_01 {up}
host2_l1_boot1 {up}
with IP address: 10.2.1.22
on interface: en0
on node: host2 {up}
on network: net_ether_01 {up}
host1_l1_svc2(10.2.101.1) {online}
Interfaces configured to provide host1_l1_svc2:
host1_l1_boot1 {up}
with IP address: 10.2.1.21
on interface: en0
on node: host1 {up}
on network: net_ether_01 {up}
host1_l1_boot2 {up}
with IP address: 10.2.11.1
on interface: en2
on node: host1 {up}
on network: net_ether_01 {up}
host2_l1_boot2 {up}
with IP address: 10.2.11.2
on interface: en2
on node: host2 {up}
on network: net_ether_01 {up}
host2_l1_boot1 {up}
with IP address: 10.2.1.22
on interface: en0
on node: host2 {up}
on network: net_ether_01 {up}
host1_l2_svc(10.2.200.1) {online}
Interfaces configured to provide host1_l2_svc:
host1_l2_boot1 {up}
with IP address: 10.2.2.1
on interface: en1
on node: host1 {up}
on network: net_ether_02 {up}
host1_l2_boot2 {up}
with IP address: 10.2.12.1
on interface: en3
on node: host1 {up}
on network: net_ether_02 {up}
host2_l2_boot2 {up}
with IP address: 10.2.12.2
on interface: en3
on node: host2 {up}
on network: net_ether_02 {up}
host2_l2_boot1 {up}
with IP address: 10.2.2.2
on interface: en1
on node: host2 {up}
on network: net_ether_02 {up}
Shared Volume Groups:
host1vg
This application is part of resource group 'host2_RG'.
Resource group policies:
Startup: on home node only
Fallover: to next priority node in the list
Fallback: if higher priority node becomes available
State of host2_app: online
Nodes configured to provide host2_app: host2 {up} host1 {up}
Node currently providing host2_app: host2 {up}
The node that will provide host2_app if host2 fails is: host1
Resources associated with host2_app:
Service Labels
host2_l1_svc1(10.2.100.2) {online}
Interfaces configured to provide host2_l1_svc1:
host2_l1_boot2 {up}
with IP address: 10.2.11.2
on interface: en2
on node: host2 {up}
on network: net_ether_01 {up}
host2_l1_boot1 {up}
with IP address: 10.2.1.22
on interface: en0
on node: host2 {up}
on network: net_ether_01 {up}
host1_l1_boot1 {up}
with IP address: 10.2.1.21
on interface: en0
on node: host1 {up}
on network: net_ether_01 {up}
host1_l1_boot2 {up}
with IP address: 10.2.11.1
on interface: en2
on node: host1 {up}
on network: net_ether_01 {up}
host2_l1_svc2(10.2.101.2) {online}
Interfaces configured to provide host2_l1_svc2:
host2_l1_boot2 {up}
with IP address: 10.2.11.2
on interface: en2
on node: host2 {up}
on network: net_ether_01 {up}
host2_l1_boot1 {up}
with IP address: 10.2.1.22
on interface: en0
on node: host2 {up}
on network: net_ether_01 {up}
host1_l1_boot1 {up}
with IP address: 10.2.1.21
on interface: en0
on node: host1 {up}
on network: net_ether_01 {up}
host1_l1_boot2 {up}
with IP address: 10.2.11.1
on interface: en2
on node: host1 {up}
on network: net_ether_01 {up}
host2_l2_svc(10.2.200.2) {online}
Interfaces configured to provide host2_l2_svc:
host2_l2_boot2 {up}
with IP address: 10.2.12.2
on interface: en3
on node: host2 {up}
on network: net_ether_02 {up}
host2_l2_boot1 {up}
with IP address: 10.2.2.2
on interface: en1
on node: host2 {up}
on network: net_ether_02 {up}
host1_l2_boot1 {up}
with IP address: 10.2.2.1
on interface: en1
on node: host1 {up}
on network: net_ether_02 {up}
host1_l2_boot2 {up}
with IP address: 10.2.12.1
on interface: en3
on node: host1 {up}
on network: net_ether_02 {up}
Shared Volume Groups:
host2vg
#############
TOPOLOGY
#############
test_cluster consists of the following nodes: host1 host2
host1
Network interfaces:
host1_1 {up}
device: /dev/mndhb_lv_01
on network: net_diskhbmulti_01 {up}
host1_l1_boot1 {up}
with IP address: 10.2.1.21
on interface: en0
on network: net_ether_01 {up}
host1_l1_boot2 {up}
with IP address: 10.2.11.1
on interface: en2
on network: net_ether_01 {up}
host1_l2_boot1 {up}
with IP address: 10.2.2.1
on interface: en1
on network: net_ether_02 {up}
host1_l2_boot2 {up}
with IP address: 10.2.12.1
on interface: en3
on network: net_ether_02 {up}
host2
Network interfaces:
host2_2 {up}
device: /dev/mndhb_lv_01
on network: net_diskhbmulti_01 {up}
host2_l1_boot2 {up}
with IP address: 10.2.11.2
on interface: en2
on network: net_ether_01 {up}
host2_l1_boot1 {up}
with IP address: 10.2.1.22
on interface: en0
on network: net_ether_01 {up}
host2_l2_boot2 {up}
with IP address: 10.2.12.2
on interface: en3
on network: net_ether_02 {up}
host2_l2_boot1 {up}
with IP address: 10.2.2.2
on interface: en1
on network: net_ether_02 {up}
[host1][root][/]#
3.2.6. /etc/hosts环境检查
正常情况下,2台互备的/etc/hosts应该是一致的,当然如果是主备机方式,可能备机会多些IP地址和主机名。通过对比2个文件的不同,可以确认是否存在问题。
[host1][root][/]>rsh host2 cat /etc/hosts >/tmp/host2_hosts
[host1][root][/]>diff /etc/hosts /tmp/host2_hosts
3.2.7. 脚本检查需要注意以下事项:
1. 应用的变更需要及时修正脚本,两边的脚本需要及时同步,并及时申请时间测试。
2. 上一点需要维护人员充分与应用人员沟通,运行环境的任何变更必须通过维护人员实施。
3. 维护人员启停应用要养成使用这些脚本启停系统的习惯,尽量避免手工启停。
[host1][root][/home/scripts]>rsh host2 "cd /home/scripts;ls -l host1 host2 comm" >/tmp/host2_scripts
[host1][root][/home/scripts]> ls -l host1 host2 comm" >/tmp/host1_scripts
[host1][root][/]>diff /tmp/host1_scripts /tmp/host2_scripts
3.2.8. 用户检查正常情况下,2台互备的HA使用到的用户情况应该是一致的,当然如果是主备机方式,可能备机会多些用户。通过对比2节点的配置不同,可以确认是否存在问题。
[host1][root][/]>
rsh host2 lsuser -f orarun,orarunc,tuxrun,bsx1,xcom >/tmp/host2_users
[host1][root][/]>
lsuser -f orarun,orarunc,tuxrun,bsx1,xcom >/tmp/host2_users >/tmp/host1_users
[host1][root][/]>diff /tmp/host1_user /tmp/host2_user
注:两边的必然有些不同,如上次登录时间等等,只要主要部分相同就可以了。
还有两边 .profile的对比,用户环境的对比。
[host1][root][/]>rsh host2 su - orarun -c set >/tmp/host2.set
[host1][root][/]> su - orarun -c set >/tmp/host1.set
[host1][root][/]>diff /tmp/host1.set /tmp/host2.set
3.2.9. 心跳检查
由于心跳在HACMP启动后一直由HACMP在用,所以需要强制停掉HACMP进行检查。
1)察看心跳服务:
从topsvcs可以看到网络的状况,也包括心跳网络,报错为零或比率远低于1%。
[host2][root][/]#lssrc -ls topsvcs
Subsystem Group PID Status
topsvcs topsvcs 9371838 active
Network Name Indx Defd Mbrs St Adapter ID Group ID
net_ether_01_0 [ 0] 2 2 S 10.2.1.22 10.2.1.22
net_ether_01_0 [ 0] en0 0x42366504 0x42366d24
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 0 Current group: 0
Packets sent : 15690 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 18345 ICMP 0 Dropped: 0
NIM's PID: 7929856
net_ether_01_1 [ 1] 2 2 S 10.2.11.2 10.2.11.2
net_ether_01_1 [ 1] en2 0x42366505 0x42366d25
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 0 Current group: 0
Packets sent : 15690 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 18347 ICMP 0 Dropped: 0
NIM's PID: 9044088
net_ether_02_0 [ 2] 2 2 S 10.2.2.2 10.2.2.2
net_ether_02_0 [ 2] en1 0x42366506 0x42366d26
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 0 Current group: 0
Packets sent : 15688 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 18345 ICMP 0 Dropped: 0
NIM's PID: 6881402
net_ether_02_1 [ 3] 2 2 S 10.2.12.2 10.2.12.2
net_ether_02_1 [ 3] en3 0x42366507 0x42366d27
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 0 Current group: 0
Packets sent : 15687 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 18344 ICMP 0 Dropped: 0
NIM's PID: 6684902
diskhbmulti_0 [ 4] 2 2 S 255.255.10.1 255.255.10.1
diskhbmulti_0 [ 4] rmndhb_lv_01.2_1 0x8236653e 0x82366d48
HB Interval = 3.000 secs. Sensitivity = 6 missed beats
Missed HBs: Total: 0 Current group: 0
Packets sent : 5021 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 4754 ICMP 0 Dropped: 0
NIM's PID: 6553654
2 locally connected Clients with PIDs:
haemd(7602388) hagsd(9699456)
Fast Failure Detection available but off.
Dead Man Switch Enabled:
reset interval = 1 seconds
trip interval = 36 seconds
Client Heartbeating Disabled.
Configuration Instance = 1
Daemon employs no security
Segments pinned: Text Data.
Text segment size: 862 KB. Static data segment size: 1497 KB.
Dynamic data segment size: 8897. Number of outstanding malloc: 269
User time 1 sec. System time 0 sec.
Number of page faults: 151. Process swapped out 0 times.
Number of nodes up: 2. Number of nodes down: 0.
2)串口心跳检查:
u 察看tty速率
确认速率不超过9600
[host1][root][/]>stty -a </dev/tty0
[host2][root][/]>cat /etc/hosts >/dev/tty0
host1上显示
speed 9600 baud; 0 rows; 0 columns;
eucw 1:1:0:0, scrw 1:1:0:0:
….
u 检查连接和配置
[host1][root][/]>host1: cat /etc/hosts>/dev/tty0
[host2][root][/]>host2:cat</dev/tty0
在host2可看到host1上/etc/hosts的内容。
同样反向检测一下。
3)串口心跳检查:
利用dhb_read确认磁盘的心跳连接
[host1][root][/]#/usr/sbin/rsct/bin/dhb_read -p hdisk5 -r
DHB CLASSIC MODE
First node byte offset: 61440
Second node byte offset: 62976
Handshaking byte offset: 65024
Test byte offset: 64512
Receive Mode:
Waiting for response . . .
Magic number = 0x87654321
Magic number = 0x87654321
Magic number = 0x87654321
Magic number = 0x87654321
Magic number = 0x87654321
Magic number = 0x87654321
Magic number = 0x87654321
Link operating normally
[host2][root][/]#/usr/sbin/rsct/bin/dhb_read -p hdisk5 -r
DHB CLASSIC MODE
First node byte offset: 61440
Second node byte offset: 62976
Handshaking byte offset: 65024
Test byte offset: 64512
Receive Mode:
Waiting for response . . .
Magic number = 0x87654321
Magic number = 0x87654321
Magic number = 0x87654321
....
Magic number = 0x87654321
Magic number = 0x87654321
Link operating normally
[host1][root][/]#/usr/sbin/rsct/bin/dhb_read -p hdisk5 -t
DHB CLASSIC MODE
First node byte offset: 61440
Second node byte offset: 62976
Handshaking byte offset: 65024
Test byte offset: 64512
Transmit Mode:
Magic number = 0x87654321
Detected remote utility in receive mode. Waiting for response . . .
Magic number = 0x87654321
Magic number = 0x87654321
Link operating normally
最后报Link operating normally.正常即可,同样反向也检测一下。
3.2.10. errpt的检查
虽然有了以上许多检查,但我们最常看的errpt不要忽略,因为有些报错,需要大家引起注意,由于crontab里HACMP会增加这样一行:
0 0 * * * /usr/es/sbin/cluster/utilities/clcycle 1>/dev/null 2>/dev/null # HACMP for AIX Logfile rotation
即实际上每天零点,系统会自动执行HACMP的检查,如果发现问题,会在errpt看到。
除了HACMP检查会报错,其他运行过程中也有可能报错,大都是由于心跳连接问题或负载过高导致HACMP进程无法处理,需要引起注意,具体分析解决。
3.3.由于维护的过程出现的情况远比集成实施阶段要复杂,即使红皮书也不能覆盖所有情况。这里只就大家常见的情况加以说明,对于更为复杂或者更为少见的情况,还是请大家翻阅红皮书,实在不行计划停机重新配置也许也是一个快速解决问题的笨方法。
这里的变更原则上是不希望停机,但实际上HACMP的变更,虽然说部分支持DARE(dynamic reconfiguration),部分操作支持Force stop 完成,我们还是建议有条件的话停机完成。
对于动态DARE,我不是非常赞成使用,因为使用不当会造成集群不可控,危险性更大。我一般喜欢使用先强制停止HACMP,再进行以下操作,结束同步确认后再start HACMP。
3.3.1. 卷组变更-增加磁盘到使用的VG里:
注意,pvid一定要先认出来,否则盘会没有或不正常。
1. 集群的各个节点机器运行cfgmgr,设置pvid
[host1][root][/]>cfgmgr
[host1][root][/]>lspv
….
hdisk2 00f6f1569990a1ef host1vg
hdisk3 00f6f1569990a12c host2vg
hdisk4 none none
[host1][root][/]>chdev -l hdisk2 -a pv=yes
[host1][root][/]>lspv
….
hdisk4 00c1eedffc677bfe none
在host2上也要做同样操作。
2. 运行C-SPOC增加盘到host2vg:
smitty hacmp->System Management (C-SPOC)
-> Storage
-> Volume Groups
-> Set Characteristics of a Volume Group
-> Add a Volume to a Volume Group
选择VG、磁盘增加即可
Add a Volume to a Volume Group
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
VOLUME GROUP name host2vg
Resource Group Name host2_RG
Node List host1,host2
VOLUME names hdisk4
Physical Volume IDs 00f6f1562fd2853e
完成后两边都可看到
hdisk3 00f6f1569990a12c host2vg active
hdisk4 00f6f1562fd2853e host2vg active
3.3.2. 逻辑卷lv变更1) lv本身变更:
目前支持增加lv的拷贝,减少,增加空间,改名;
这里以裸设备lv增加空间举例:
smitty hacmp->System Management (C-SPOC)
-> Storage
-> Shared Logical Volumes
>Set Characteristics of a Logical Volume
-> Increase the Size of a Logical Volume
2) lv属性变更
效果和单机环境一致,但还是建议慎重操作,充分考虑改动后对业务的影响:
smitty hacmp->System Management (C-SPOC)
-> Storage
->Logical Volume
->Change a Logical Volume
->Change a Logical Volume on the Cluster选择lv
Volume Group Name host2vg
Resource Group Name host2_RG
* Logical volume NAME ora11runlv
Logical volume TYPE [jfs2]
POSITION on physical volume outer_middle
RANGE of physical volumes minimum
MAXIMUM NUMBER of PHYSICAL VOLUMES [32]
to use for allocation
Allocate each logical partition copy yes
on a SEPARATE physical volume?
RELOCATE the logical volume during yes
RELOCATE the logical volume during yes
reorganization?
Logical volume LABEL [/ora11run]
MAXIMUM NUMBER of LOGICAL PARTITIONS [512]
SCHEDULING POLICY for writing logical parallel
partition copies
PERMISSIONS read/write
Enable BAD BLOCK relocation? yes
Enable WRITE VERIFY? no
Mirror Write Consistency? active
Serlialize I/O? no
3.3.3. 文件系统变更
smitty hacmp->System Management (C-SPOC)
-> Storage
- >File Systems
->Change / Show Characteristics of a File System
Volume group name host1vg
Resource Group Name host1_RG
* Node Names host2,host1
* File system name /ora11runc
NEW mount point [/ora11runc] /
SIZE of file system
Unit Size 512bytes +
Number of Units [10485760] #
Mount GROUP []
Mount AUTOMATICALLY at system restart? no +
PERMISSIONS read/write +
Mount OPTIONS []
Mount AUTOMATICALLY at system restart? no +
PERMISSIONS read/write +
Mount OPTIONS [] +
Start Disk Accounting? no +
Block Size (bytes) 4096
Inline Log? no
Inline Log size (MBytes) [0] #
Extended Attribute Format [v1]
ENABLE Quota Management? no +
Allow Small Inode Extents? [yes] +
Logical Volume for Log host1_loglv
3.3.4. 增加服务IP地址(仅DARE支持)1) 修改/etc/hosts,增加以下行
10.66.201.1 host1_l2_svc2
10.66.201.2 host2_l2_svc2
注意:2边都要增加。
2) 增加服务地址
smitty hacmp->Extended Configuration
-> HACMP Extended Resources Configuration
-> Configure HACMP Service IP Labels/Addresses
-> Add a Service IP Label/Address
-> Configurable on Multiple Nodes选择网络
-> Add a Service IP Label/Address configurable on Multiple Nodes (extended)
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
* IP Label/Address host1_svc2
* Network Name net_ether_02
Alternate HW Address to accompany IP Label/Address []
同样增加host2_svc2
3) 修正资源组
smitty hacmp->Extended Configuration
->Extended Resource Configuration
->HACMP Extended Resource Group Configuration
->Change/Show Resources and Attributes for a Resource Group
->Change/Show All Resources and Attributes for a Resource Group
4) HACMP同步
触发新增服务ip生效。
这时netstat -in,可以看到地址生效了。
3.3.5. 修改服务IP地址
如果应用服务使用的IP地址,自然是需要停止应用进行修改。比如要将原地址10.2.200.x改为10.2.201.x,路由改为10.2.201.254步骤如下:
1. 正常停止HACMP
smitty clstop ->Bring Resource Groups offline
2. 所有节点修改/etc/hosts将服务地址修改为需要的地址
10.2.201.1 host1_l2_svc host1
10.2.201.2 host2_l2_svc host2
注意同时要修正 /usr/es/sbin/cluster/etc/clhosts
3. 修改启动脚本的路由部分(如果需要)
GATEWAY=10.2.201.254
4. 在一个节点修改HACMP的配置
smitty hacmp->Extended Configuration
-> Extended Resource Configuration
->HACMP Extended Resources Configuration
->Configure HACMP Service IP Labels/Addresses
->Change/Show a Service IP Label/Address选择host1_l2_svc
不做修改,直接回车即可,同样修改host2_l2_svc。
smitty hacmp->Extended Configuration
->Extended Resource Configuration
->HACMP Extended Resource Group Configuration
->Change/Show Resources and Attributes for a Resource Group
->Change/Show All Resources and Attributes for a Resource Group
选择host1_RG
不做修改,直接回车即可,同样修改host2_RG
5. 同步HACMP
6. 重新启动HACMP确认
触发新服务IP地址生效。
注:如果修改的不是应用服务要用的地址,或者修改期间对该地址的服务可以暂停,则可以将1改为强制停止,增加第7步,整个过程可以不停应用服务。
7.去除原有服务IP地址
netstat -in找到该服务IP地址所在网卡比如为en2
ifconfig en2 alias delete 10.2.200.1
3.3.6. boot地址变更1. smitty tcpip修改网卡的地址
2. 修改/etc/hosts的boot地址,
注意同时要修正 /usr/es/sbin/cluster/etc/clhosts
3. 修改HACMP配置
smitty hacmp ->Extended Configuration
-> Extended Topology Configuration
-> Extended Topology Configuration
Change/Show a Communication Interface
Node Name [bgbcb04]
Network Interface en1
IP Label/Address host1_boot1
Network Type ether
* Network Name [net_ether_01]
不做修改,直接回车即可,同样修改其他boot地址。
4. 同步HACMP
5. 重新启动HACMP确认
注意修改启动参数使得启动时重新申请资源,触发新boot IP地址生效,否则clstat看到的boot地址将是down。
3.3.7. 用户变更修改用户口令
由于安全策略的原因,系统可能需要更改口令,利用HACMP会方便不少,也避免切换过去后因时隔太久,想不起口令需要强制修改的烦恼。
唯一设计不合理的是,必须root才能使用这个功能。
smitty HACMP ->Extended Configuration
-> Security and Users Configuration
-> Passwords in an HACMP cluster
-> Change a User's Password in the Cluster
Selection nodes by resource group host2_RG
*** No selection means all nodes! ***
* User NAME [orarun]
User must change password on first login? false
此时需要你输入新口令更改:
COMMAND STATUS
Command: running stdout: no stderr: no
Before command completion, additional instructions may appear below.
orarun's New password:
Enter the new password again:
OK即成功
修改用户属性
以下步骤可变更用户属性,值得注意的是,虽然可以直接修改用户的UID,但实际上和在单独的操作系统一样,不会自动修改该用户原有的文件和目录的属性,必须事后自己修改,所以建议UID在规划阶段就早做合理规划。
smitty HACMP ->Extended Configuration
-> Security and Users Configuration
->Users in an HACMP cluster
-> Change / Show Characteristics of a User in the Cluster
选择资源组和用户
除开头1行,其他使用均等同于独立操作系统。
Change User Attributes on the Cluster
Resource group eai1d0_RG
* User NAME test
User ID [301]
ADMINISTRATIVE USER? false
….
如果觉得我的文章对您有用,请点赞。您的支持将鼓励我继续创作!
赞5
添加新评论2 条评论
2020-02-24 20:41
2015-06-15 09:55