IBM DB2数据库的高可用性及灾难恢复主要有几种方式实现:
1,自带高可用性组件实现高可用性(db2 v9.5,9.7里面的db2haicu工具配置);
2,第三方集群软件实现高可用性(如HACMP等)
3,配置DB2的HADR,实现高可用性灾难恢复,类似oracle的dataguard;
4,集群软件+HADR 实现的高可用性及灾难恢复综合方案;
本文完整介绍了通过DB2 V9.7的高可用性 (HA) 特性和高可用性实例配置工具 (db2haicu),为在两台机器之间共享磁盘存储的单分区DB2实例数据库,实现自动化软件故障恢复解决方案的实施测试过程。
本案例使用VMWARE虚拟机软件安装了两台虚拟机器,创建共享存储盘,安装Red Hat linux系统,作为实施测试环境,客户端通过虚拟IP地址方式访问数据库服务器。
注:不需要在系统上手动安装 SA MP,它已经预先捆绑到 DB2 9.7 中了。
2 系统环境 2.1 实验环境
OS:RedHat EL 5.3
数据库软件:DB2 V9.7
虚拟机软件:VMare workstation 6.5
2.2 虚拟机环境1. 配置2个虚拟机
2. 安装redhat 系统
3. cp出一个虚拟机
VMware WorkStation上安装两个虚拟主机(也可安装并配置好一台后用VMàClone功能复制一台出来
4. 添加共享磁盘
5. 共享文件系统:
以下文件系统规划用于实例及数据库,存放在共享存储盘上:
/db2home
/hafs01
/hafs02
/hafs03
2.3 逻辑拓扑结构
2.4 网络配置 2.4.1 主机IP配置
节点1:
Linux1 130.30.3.252(255.255.255.0)
节点2:
Linux2 130.30.3.253(255.255.255.0)
虚拟IP地址VIP:
130.30.3.251
2.4.2 /etc/hosts配置
Linux1 /etc/hosts
130.30.3.252 linux1 db2host
130.30.3.253 linux2
Linux2 /etc/hosts
130.30.3.252 linux1
130.30.3.253 linux2 db2host
2.4.3 验证网络连通性
验证,确保网络连通性是否正常及配置正确
Ping linux1
Ping linux2
Ping 130.30.3.252
Ping 130.30.3.253
3 DB2软件安装 3.1 系统内核参数调整
用root用户登录,编辑/etc/sysctl.conf文件,修改需调整的内核参数,增加如下:
kernel.sem=250 256000 32 1024
kernel.shmmax=268435456
kernel.shmall=8388608
kernel.msgmax=65535
kernel.msgmnb=65535
执行sysctl –p,从/etc/sysctl.conf文件装入sysctl设置。
运行ipcs –l命令显示验证当前的内核参数设置
# ipcs -l
------ Shared Memory Limits --------
max number of segments = 4096 // SHMMNI
max seg size (kbytes) = 32768 // SHMMAX
max total shared memory (kbytes) = 8388608 // SHMALL
min seg size (bytes) = 1
------ Semaphore Limits --------
max number of arrays = 1024 // SEMMNI
max semaphores per array = 250 // SEMMSL
max semaphores system wide = 256000 // SEMMNS
max ops per semop call = 32 // SEMOPM
semaphore max value = 32767
------ Messages: Limits --------
max queues system wide = 1024 // MSGMNI
max size of message (bytes) = 65536 // MSGMAX
default max size of queue (bytes) = 65536 // MSGMNB
3.2 用户资源限制调整
设置DB2实例用户Data,nofiles,fsize资源的操作系统硬限制修改为较大值或者无限制。
可通过修改文件/etc/security/limits.conf 设置
* soft nproc 3000
* hard nproc 16384
* soft nofile 65536
* hard nofile 65536
可通过ulimit -H -f -n -d命令查询当前用值的限制
3.3 创建文件系统
在主节点按以下步骤创建共享文件系统,完成后导入备节点机器。
创建PV,卷组
[root@linux1 db2]# pvcreate /dev/sdb1
[root@linux1 db2]# vgcreate datavg /dev/sdb1
新建LV,文件系统
[root@linux1 ~]# lvcreate -n db2homelv -L 1G /dev/datavg
Logical volume "db2homelv" created
[root@linux1 ~]# lvcreate -n hafs01lv -L 2G /dev/datavg
Logical volume "hafs01lv" created
[root@linux1 ~]# lvcreate -n hafs02lv -L 500M /dev/datavg
Logical volume "hafs02lv" created
[root@linux1 ~]# lvcreate -n hafs03lv -L 500M /dev/datavg
Logical volume "hafs03lv" created
# mkfs.ext3 /dev/datavg/db2homelv
# mkfs.ext3 /dev/datavg/hafs01lv
# mkfs.ext3 /dev/datavg/hafs02lv
# mkfs.ext3 /dev/datavg/hafs03lv
新建mount点
# mkdir /db2home
# mkdir /hafs01
# mkdir /hafs02
# mkdir /hafs03
挂载文件系统
# mount /dev/datavg/db2homelv /db2home
# mount /dev/datavg/hafs01lv /hafs01
# mount /dev/datavg/hafs02lv /hafs02
# mount /dev/datavg/hafs03lv /hafs03
增加挂载点条目
编辑/etc/fstab文件,添加以下条目,不自动挂载
/dev/datavg/db2homelv /db2home ext3 noauto 0 0
/dev/datavg/hafs01lv /hafs01 ext3 noauto 0 0
/dev/datavg/hafs02lv /hafs02 ext3 noauto 0 0
/dev/datavg/hafs03lv /hafs03 ext3 noauto 0 0
备机挂载文件系统验证测试
在备机挂载以上共享文件系统,确保备机也能正常挂载,验证挂载后文件系统权限正常。
3.4 创建用户及组
分别在主、备节点机器上创建以下用户组及用户,确保两台机器使用完全相同的UID,GID。
新建用户组
通过输入下列命令,为实例所有者创建一个组(例如,db2iadm1),为将要执行 UDF 或存储过程的用户创建一个组(例如,db2fadm1),并为管理服务器创建一个组(例如,dasadm1):
groupadd -g 999 db2iadm1
groupadd -g 998 db2fadm1
groupadd -g 997 dasadm1
新建用户
通过使用下列命令,为前一步骤中创建的每个组创建一个用户。每个用户的主目录将是先前创建的共享DB2主目录(db2home)。
useradd -u 999 -g db2iadm1 -m -d /db2home/db2inst1 db2inst1
useradd -u 998 -g db2fadm1 -m -d /db2home/db2fenc1 db2fenc1
useradd -u 997 -g dasadm1 -m -d /db2home/dasusr1 dasusr1
设置新建用户的初始密码
# passwd db2inst1
# passwd db2fenc1
# passwd dasusr1
修改文件系统属主
# chown db2inst1:db2iadm1 /db2home
# chown db2inst1:db2iadm1 /hafs01
# chown db2inst1:db2iadm1 /hafs02
# chown db2inst1:db2iadm1 /hafs03
3.5 安装DB2软件
依次在主、备节点分别完成DB2数据库软件的安装,一般建议使用root安装,非root用户安装会有一些限制。
解压安装下载的安装介质
[root@linux1 db2]# tar zxvf v9.7_linuxia32_server.tar.gz
使用db2_install命令执行安装
[root@linux1 server]# ./db2_install
选择DB2安装目录,可以更改安装目录,linux下缺省安装目录为/opt/ibm/db2/V9.7
用于安装产品的缺省目录 - /opt/ibm/db2/V9.7
***********************************************************
要选择另一个目录用于安装吗?[是/否]
否
安装DB2产品选择ESE
指定下列其中一个关键字以安装 DB2 产品。
ESE
CONSV
WSE
EXP
PE
CLIENT
RTCL
按“帮助”以重新显示产品名称。
按“退出”以退出。
***********************************************************
ESE
等待安装执行过程,完成后会提示安装成功信息
正在初始化 DB2 安装。
要执行的任务总数为:46
要执行的所有任务的总估计时间为:1876
任务 #1 启动
描述:正在检查许可协议的接受情况
估计时间 1 秒
任务 #1 结束
…………………………………
…………………………………S
已成功完成执行。
有关更多信息,请参阅 "/tmp/db2_install.log.18940" 上的 DB2安装日志。
3.6 配置NTP网络时间同步
建议同步集群节点上的时间和日期(但不是必须的)。
4 DB2实例安装配置 4.1 在共享磁盘创建实例1.确认两节点实例用户ID、组ID相同
[db2inst1@linux1 ~]$ id
uid=999(db2inst1) gid=999(db2iadm1) groups=999(db2iadm1) context=root:system_r:unconfined_t:SystemLow-SystemHigh
[db2inst1@linux2 ~]$ id
uid=999(db2inst1) gid=999(db2iadm1) groups=999(db2iadm1) context=root:system_r:unconfined_t:SystemLow-SystemHigh
2.在主节点上挂载共享磁盘上的各文件系统
Mount /db2home
Mount /hafs01
Mount /hafs02
Mount /hafs03
3.创建实例
[root@linux1 instance]cd /opt/ibm/db2/V9.7/instance
[root@linux1 instance]# ./db2icrt -u db2fenc1 db2inst1
DBI1070I Program db2icrt completed successfully.
[root@linux1 instance]# ./db2ilist
db2inst1
更新数据库管理器对应参数
[root@linux1 instance]# su - db2inst1
[db2inst1@linux1 ~]$ db2start
01/27/2011 17:37:46 0 0 SQL1063N DB2START processing was successful.
SQL1063N DB2START processing was successful.
[db2inst1@linux1 ~]$ db2set db2comm=tcpip
[db2inst1@linux1 ~]$ cat /etc/services |grep DB2_
DB2_db2inst1 60000/tcp
DB2_db2inst1_1 60001/tcp
DB2_db2inst1_2 60002/tcp
DB2_db2inst1_END 60003/tcp
[db2inst1@linux1 ~]$ db2 update dbm cfg using svcename DB2_db2inst1
DB20000I The UPDATE DATABASE MANAGER CONFIGURATION command completed successfully.
4.2 创建管理服务器
[root@linux1 ~]# cd /opt/ibm/db2/V9.7/instance
[root@linux1 instance]# ./dascrt -u dasusr1
SQL4406W The DB2 Administration Server was started successfully.
DBI1070I Program dascrt completed successfully.
[root@linux1 instance]# su - dasusr1
[dasusr1@linux1 ~]$ db2admin start
SQL4409W The DB2 Administration Server is already active.
4.3 配置/etc/services创建实例时会在所在节点自动完成文件的添加,但是在备用机器需要手工添加,要确保主备节点一致。
[db2inst1@linux1 ~]$ cat /etc/services|grep DB2_
DB2_db2inst1 60000/tcp
DB2_db2inst1_1 60001/tcp
DB2_db2inst1_2 60002/tcp
DB2_db2inst1_END 60003/tcp
[root@linux2 ~]# cat /etc/services|grep DB2_
DB2_db2inst1 60000/tcp
DB2_db2inst1_1 60001/tcp
DB2_db2inst1_2 60002/tcp
DB2_db2inst1_END 60003/tcp
4.4 修改/etc/hosts
修改/etc/hosts配置文件,增加一个db2host主机别名,每个节点上都指向本机
Linux1节点的/etc/hosts:
130.30.3.252 linux1 db2host
130.30.3.253 linux2
Linux2节点的/etc/hosts:
130.30.3.252 linux1
130.30.3.253 linux2 db2host
4.5 修改db2node.cfg
修改db2nodes.cfg,将主机名修改为db2host
[db2inst1@linux1 ~]$ db2stop
2011-01-28 10:06:11 0 0 SQL1064N DB2STOP processing was successful.
SQL1064N DB2STOP processing was successful.
[db2inst1@linux1 ~]$ vi sqllib/db2nodes.cfg
[db2inst1@linux1 ~]$ cat sqllib/db2nodes.cfg
0 db2host 0
[db2inst1@linux1 ~]$ db2start
01/28/2011 10:22:37 0 0 SQL1063N DB2START processing was successful.
SQL1063N DB2START processing was successful.
5 创建测试数据库 5.1 创建数据库创建示例数据库进行测试
[db2inst1@linux1 ~]$ db2sampl -dbpath /db2home/db2inst1/sample
5.2 创建表空间
[db2inst1@linux1 hafs03]$ db2 "create tablespace tbs1 managed by database using (file '/hafs01/tbs1.1' 500) "
DB20000I The SQL command completed successfully.
[db2inst1@linux1 hafs03]$ db2 "create tablespace tbs2 managed by database using (file '/hafs01/tbs2.1' 500)"
DB20000I The SQL command completed successfully.
[db2inst1@linux1 hafs03]$ db2 "alter tablespace tbs2 add (file '/hafs02/tbs2.2' 500)"
DB20000I The SQL command completed successfully.
6 db2haicu配置集群 6.1 配置准备
使用db2haicu工具前,必须配置主备节点,保证安全的环境。
使用root用户,在所有节点执行以下命令:
/usr/sbin/rsct/install/bin/recfgct
/usr/sbin/rsct/bin/preprpnode linux1 linux2
/usr/sbin/rsct/install/bin/recfgct命令来重新设置节点 ID;/usr/sbin/rsct/bin/preprpnode则用于初始环境,必须执行的。
[root@linux1 bin]# /usr/sbin/rsct/install/bin/recfgct
0513-071 已经添加了 ctcas 子系统。
0513-071 已经添加了 ctrmc 子系统。
0513-059 ctrmc 子系统已启动。子系统 PID 是 18535。
[root@linux1 bin]# /usr/sbin/rsct/bin/preprpnode linux1 linux2
[root@linux2 ITM]# /usr/sbin/rsct/install/bin/recfgct
0513-071 已经添加了 ctcas 子系统。
0513-071 已经添加了 ctrmc 子系统。
0513-059 ctrmc 子系统已启动。子系统 PID 是 12725。
[root@linux2 ITM]# /usr/sbin/rsct/bin/preprpnode linux1 linux2
本例中备节点的操作系统是克隆的虚拟机Linux 映像,在集群的配置出现了错误,运行 /usr/sbin/rsct/install/bin/recfgct 命令来重新设置节点 ID。
6.2 配置集群
db2haicu 必须运行在容纳 DB2 实例的节点上,即主节点上。首先挂载相应文件系统,启动数据库管理器
[root@linux1 ~]# mount /db2home
[root@linux1 ~]# mount /hafs01
[root@linux1 ~]# mount /hafs02
[root@linux1 ~]# mount /hafs03
[root@linux1 ~]# su - db2inst1
[db2inst1@linux1 ~]$ db2start
02/11/2011 09:48:13 0 0 SQL1063N DB2START processing was successful.
SQL1063N DB2START processing was successful.
运行db2haicu命令,配置DB2 HA集群
[db2inst1@linux1 ~]$ db2haicu
Welcome to the DB2 High Availability Instance Configuration Utility (db2haicu).
You can find detailed diagnostic information in the DB2 server diagnostic log file called db2diag.log. Also, you can use the utility called db2pd to query the status of the cluster domains you create.
For more information about configuring your clustered environment using db2haicu, see the topic called 'DB2 High Availability Instance Configuration Utility (db2haicu)' in the DB2 Information Center.
db2haicu determined the current DB2 database manager instance is db2inst1. The cluster configuration that follows will apply to this instance.
db2haicu is collecting information on your current setup. This step may take some time as db2haicu will need to activate all databases for the instance to discover all paths ...
When you use db2haicu to configure your clustered environment, you create cluster domains. For more information, see the topic 'Creating a cluster domain with db2haicu' in the DB2 Information Center. db2haicu is searching the current machine for an existing active cluster domain ...
db2haicu did not find a cluster domain on this machine. db2haicu will now query the system for information about cluster nodes to create a new cluster domain ...
db2haicu did not find a cluster domain on this machine. To continue configuring your clustered environment for high availability, you must create a cluster domain; otherwise, db2haicu will exit.
创建一个damain,输入domain名称
Create a domain and continue? [1]
1. Yes
2. No
Create a unique name for the new domain:
ha_domain
Nodes must now be added to the new domain.
添加节点,按提示输入计划添加的节点个数及对应节点
How many cluster nodes will the domain ha_domain contain?
2
Enter the host name of a machine to add to the domain:
linux1
Enter the host name of a machine to add to the domain:
linux2
db2haicu can now create a new domain containing the 2 machines that you specified. If you choose not to create a domain now, db2haicu will exit.
Create the domain now? [1]
1. Yes
2. No
Creating domain ha_domain in the cluster ...
Creating domain ha_domain in the cluster was successful.
You can now configure a quorum device for the domain. For more information, see the topic "Quorum devices" in the DB2 Information Center. If you do not configure a quorum device for the domain, then a human operator will have to manually intervene if subsets of machines in the cluster lose connectivity.
配置一个仲裁设备,这里用网络仲裁,输入一个网络地址,保证两节点一直可以ping通,这里选择的是两节点的网关
Configure a quorum device for the domain called ha_domain? [1]
1. Yes
2. No
The following is a list of supported quorum device types:
1. Network Quorum
Enter the number corresponding to the quorum device type to be used: [1]
Specify the network address of the quorum device:
130.30.3.129
Configuring quorum device for domain ha_domain ...
Configuring quorum device for domain ha_domain was successful.
The cluster manager found 2 network interface cards on the machines in the domain. You can use db2haicu to create networks for these network interface cards. For more information, see the topic 'Creating networks with db2haicu' in the DB2 Information Center.
创建网络
Create networks for these network interface cards? [1]
1. Yes
2. No
Enter the name of the network for the network interface card: eth0 on cluster node: linux1
1. Create a new public network for this network interface card.
2. Create a new private network for this network interface card.
Enter selection:
1
Are you sure you want to add the network interface card eth0 on cluster node linux1 to the network db2_public_network_0? [1]
1. Yes
2. No
Adding network interface card eth0 on cluster node linux1 to the network db2_public_network_0 ...
Adding network interface card eth0 on cluster node linux1 to the network db2_public_network_0 was successful.
Enter the name of the network for the network interface card: eth0 on cluster node: linux2
1. db2_public_network_0
2. Create a new public network for this network interface card.
3. Create a new private network for this network interface card.
Enter selection:
1
Are you sure you want to add the network interface card eth0 on cluster node linux2 to the network db2_public_network_0? [1]
1. Yes
2. No
Adding network interface card eth0 on cluster node linux2 to the network db2_public_network_0 ...
Adding network interface card eth0 on cluster node linux2 to the network db2_public_network_0 was successful.
配置高可用性
Retrieving high availability configuration parameter for instance db2inst1 ...
The cluster manager name configuration parameter (high availability configuration parameter) is not set. For more information, see the topic "cluster_mgr - Cluster manager name configuration parameter" in the DB2 Information Center. Do you want to set the high availability configuration parameter?
这里选择TSA
The following are valid settings for the high availability configuration parameter:
1.TSA
2.Vendor
Enter a value for the high availability configuration parameter: [1]
Setting a high availability configuration parameter for instance db2inst1 to TSA.
Now you need to configure the failover policy for the instance db2inst1. The failover policy determines the machines on which the cluster manager will restart the database manager if the database manager is brought offline unexpectedly.
选择高可用性策略,这里选择了主/备模式
The following are the available failover policies:
1. Local Restart -- during failover, the database manager will restart in place on the local machine
2. Round Robin -- during failover, the database manager will restart on any machine in the cluster domain
3. Active/Passive -- during failover, the database manager will restart on a specific machine
4. M+N -- during failover, the database partitions on one machine will failover to any other machine in the cluster domain (used with DPF instances)
5. Custom -- during failover, the database manager will restart on a machine from a user-specified list
Enter your selection:
3
是否需要配置非关键mount点,非关键mount点不会切换到其他机器
You can identify mount points that are noncritical for failover. For more information, see the topic 'Identifying mount points that are noncritical for failover' in the DB2 Information Center. Are there any mount points that you want to designate as noncritical? [2]
1. Yes
2. No
配置主、备节点
Active/Passive failover policy was chosen. You need to specify the host names of an active/passive pair.
Enter the host name for the active cluster node:
linux1
Enter the host name for the passive cluster node:
linux2
Adding DB2 database partition 0 to the cluster ...
Adding DB2 database partition 0 to the cluster was successful.
配置虚拟IP地址
Do you want to configure a virtual IP address for the DB2 partition: 0? [2]
1. Yes
2. No
1
Enter the virtual IP address:
130.30.3.251
Enter the subnet mask for the virtual IP address 130.30.3.251: [255.255.255.0]
255.255.255.0
Select the network for the virtual IP 130.30.3.251:
1. db2_public_network_0
Enter selection:
1
Adding virtual IP address 130.30.3.251 to the domain ...
Adding virtual IP address 130.30.3.251 to the domain was successful.
自动检测到已经创建的数据库,配置高可用性
The following databases can be made highly available:
Database: SAMPLE
Do you want to make all active databases highly available? [1]
1. Yes
2. No
Adding database SAMPLE to the cluster domain ...
Adding database SAMPLE to the cluster domain was successful.
All cluster configurations have been completed successfully. db2haicu exiting ...
6.3 查看集群状态
配置完成,可通过lssam或db2pd命令查看HA状态
[db2inst1@linux1 hafs03]$ lssam
Online IBM.ResourceGroup:db2_db2inst1_0-rg Nominal=Online
|- Online IBM.Application:db2_db2inst1_0-rs
|- Online IBM.Application:db2_db2inst1_0-rs:linux1
'- Offline IBM.Application:db2_db2inst1_0-rs:linux2
|- Online IBM.Application:db2mnt-db2home-rs
|- Online IBM.Application:db2mnt-db2home-rs:linux1
'- Offline IBM.Application:db2mnt-db2home-rs:linux2
|- Online IBM.Application:db2mnt-hafs01-rs
|- Online IBM.Application:db2mnt-hafs01-rs:linux1
'- Offline IBM.Application:db2mnt-hafs01-rs:linux2
|- Online IBM.Application:db2mnt-hafs02-rs
|- Online IBM.Application:db2mnt-hafs02-rs:linux1
'- Offline IBM.Application:db2mnt-hafs02-rs:linux2
'- Online IBM.ServiceIP:db2ip_130_30_3_251-rs
|- Online IBM.ServiceIP:db2ip_130_30_3_251-rs:linux1
'- Offline IBM.ServiceIP:db2ip_130_30_3_251-rs:linux2
[db2inst1@linux1 hafs03]$ db2pd -ha
DB2 HA Status
Instance Information:
Instance Name = db2inst1
Number Of Domains = 1
Number Of RGs for instance = 1
Domain Information:
Domain Name = ha_domain
Cluster Version = 2.5.1.4
Cluster State = Online
Number of nodes = 2
Node Information:
Node Name State
--------------------- -------------------
linux1 Online
linux2 Online
Resource Group Information:
Resource Group Name = db2_db2inst1_0-rg
Resource Group LockState = Unlocked
Resource Group OpState = Online
Resource Group Nominal OpState = Online
Number of Group Resources = 5
Number of Allowed Nodes = 2
Allowed Nodes
-------------
linux1
linux2
Member Resource Information:
Resource Name = db2mnt-hafs02-rs
Resource State = Online
Resource Type = Mount
Mount Resource Path = /hafs02
Number of Allowed Nodes = 2
Allowed Nodes
-------------
linux1
linux2
Resource Name = db2mnt-hafs01-rs
Resource State = Online
Resource Type = Mount
Mount Resource Path = /hafs01
Number of Allowed Nodes = 2
Allowed Nodes
-------------
linux1
linux2
Resource Name = db2mnt-db2home-rs
Resource State = Online
Resource Type = Mount
Mount Resource Path = /db2home
Number of Allowed Nodes = 2
Allowed Nodes
-------------
linux1
linux2
Resource Name = db2_db2inst1_0-rs
Resource State = Online
Resource Type = DB2 Partition
DB2 Partition Number = 0
Number of Allowed Nodes = 2
Allowed Nodes
-------------
linux1
linux2
Resource Name = db2ip_130_30_3_251-rs
Resource State = Online
Resource Type = IP
Network Information:
Network Name Number of Adapters
----------------------- ------------------
db2_public_network_0 2
Node Name Adapter Name
----------------------- ------------------
linux1 eth0
linux2 eth0
Quorum Information:
Quorum Name Quorum State
------------------------------------ --------------------
db2_Quorum_Network_130_30_3_129:9_58_50 Online
Fail Offline
Operator Offline
7 高可用性集群测试 7.1 客户端编目
以集群配置的VIP虚拟服务器地址进行编目,验证可连接性。
db2 catalog tcpip node HADB remote 130.30.3.251 server 60000
db2 catalog db sample as HADB at node HADB
7.2 扩容表空间
扩容表空间,使用文件系统/hafs03创建容器
[db2inst1@linux1 hafs03]$ db2 "alter tablespace tbs2 add (file '/hafs03/tbs2.3' 500,file '/hafs03/tbs2.4' 500)"
DB20000I The SQL command completed successfully.
[db2inst1@linux1 hafs03]$
文件系统/hafs03会自动加入集群
[db2inst1@linux1 hafs03]$ lssam
Online IBM.ResourceGroup:db2_db2inst1_0-rg Nominal=Online
|- Online IBM.Application:db2_db2inst1_0-rs
|- Online IBM.Application:db2_db2inst1_0-rs:linux1
'- Offline IBM.Application:db2_db2inst1_0-rs:linux2
|- Online IBM.Application:db2mnt-db2home-rs
|- Online IBM.Application:db2mnt-db2home-rs:linux1
'- Offline IBM.Application:db2mnt-db2home-rs:linux2
|- Online IBM.Application:db2mnt-hafs01-rs
|- Online IBM.Application:db2mnt-hafs01-rs:linux1
'- Offline IBM.Application:db2mnt-hafs01-rs:linux2
|- Online IBM.Application:db2mnt-hafs02-rs
|- Online IBM.Application:db2mnt-hafs02-rs:linux1
'- Offline IBM.Application:db2mnt-hafs02-rs:linux2
|- Online IBM.Application:db2mnt-hafs03-rs
|- Online IBM.Application:db2mnt-hafs03-rs:linux1
'- Offline IBM.Application:db2mnt-hafs03-rs:linux2
'- Online IBM.ServiceIP:db2ip_130_30_3_251-rs
|- Online IBM.ServiceIP:db2ip_130_30_3_251-rs:linux1
'- Offline IBM.ServiceIP:db2ip_130_30_3_251-rs:linux2
7.3 故障接管测试
本例以重启主节点机器,观察集群状态转换,备机接管过程,测试验证接管后数据库的可用性。其他故障,如网络异常,主机宕机等情况类似。
7.3.1 重启主节点机器以root用户登录主节点linux1系统,重启机器。
7.3.2 查看备机接管过程登录备机查看集群状态,发现VIP已经被备机节点正常接管
[root@linux2 ~]# ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:0C:29:34:5B:37
inet addr:130.30.3.253 Bcast:130.30.3.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fe34:5b37/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:11839 errors:0 dropped:0 overruns:0 frame:0
TX packets:4740 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1177870 (1.1 MiB) TX bytes:688513 (672.3 KiB)
Interrupt:67 Base address:0x2024
eth0:0 Link encap:Ethernet HWaddr 00:0C:29:34:5B:37
inet addr:130.30.3.251 Bcast:130.30.3.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:67 Base address:0x2024
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:87 errors:0 dropped:0 overruns:0 frame:0
TX packets:87 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:25395 (24.7 KiB) TX bytes:25395 (24.7 KiB)
sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
查看集群状态,资源组正常完成备机的接管,原linux1节点状态变成'Failed offline'
[root@linux2 ~]# lssam
Online IBM.ResourceGroup:db2_db2inst1_0-rg Nominal=Online
|- Online IBM.Application:db2_db2inst1_0-rs
|- Failed offline IBM.Application:db2_db2inst1_0-rs:linux1 Node=Offline
'- Online IBM.Application:db2_db2inst1_0-rs:linux2
|- Online IBM.Application:db2mnt-db2home-rs
|- Failed offline IBM.Application:db2mnt-db2home-rs:linux1 Node=Offline
'- Online IBM.Application:db2mnt-db2home-rs:linux2
|- Online IBM.Application:db2mnt-hafs01-rs
|- Failed offline IBM.Application:db2mnt-hafs01-rs:linux1 Node=Offline
'- Online IBM.Application:db2mnt-hafs01-rs:linux2
|- Online IBM.Application:db2mnt-hafs02-rs
|- Failed offline IBM.Application:db2mnt-hafs02-rs:linux1 Node=Offline
'- Online IBM.Application:db2mnt-hafs02-rs:linux2
|- Online IBM.Application:db2mnt-hafs03-rs
|- Failed offline IBM.Application:db2mnt-hafs03-rs:linux1 Node=Offline
'- Online IBM.Application:db2mnt-hafs03-rs:linux2
'- Online IBM.ServiceIP:db2ip_130_30_3_251-rs
|- Failed offline IBM.ServiceIP:db2ip_130_30_3_251-rs:linux1 Node=Offline
'- Online IBM.ServiceIP:db2ip_130_30_3_251-rs:linux2
7.3.3 Linux1重新加入集群
在linux1机器重启完成后,集群软件自动检测并加入集群。原linux1节点状态由'Failed Offline'状态转变成 'offline' 状态。
[root@linux2 ~]# lssam
Online IBM.ResourceGroup:db2_db2inst1_0-rg Nominal=Online
|- Online IBM.Application:db2_db2inst1_0-rs
|- Offline IBM.Application:db2_db2inst1_0-rs:linux1
'- Online IBM.Application:db2_db2inst1_0-rs:linux2
|- Online IBM.Application:db2mnt-db2home-rs
|- Offline IBM.Application:db2mnt-db2home-rs:linux1
'- Online IBM.Application:db2mnt-db2home-rs:linux2
|- Online IBM.Application:db2mnt-hafs01-rs
|- Offline IBM.Application:db2mnt-hafs01-rs:linux1
'- Online IBM.Application:db2mnt-hafs01-rs:linux2
|- Online IBM.Application:db2mnt-hafs02-rs
|- Offline IBM.Application:db2mnt-hafs02-rs:linux1
'- Online IBM.Application:db2mnt-hafs02-rs:linux2
|- Online IBM.Application:db2mnt-hafs03-rs
|- Offline IBM.Application:db2mnt-hafs03-rs:linux1
'- Online IBM.Application:db2mnt-hafs03-rs:linux2
'- Online IBM.ServiceIP:db2ip_130_30_3_251-rs
|- Offline IBM.ServiceIP:db2ip_130_30_3_251-rs:linux1
'- Online IBM.ServiceIP:db2ip_130_30_3_251-rs:linux2
7.3.4 验证数据库的可用性
经测试,可使用客户端命令行工具,通过VIP正常连接HADB数据库。
7.3.5 备机接管过程以下系统日志,完整的展示了DB2的HA集群故障检测及备机接管过程:集群软件检测到节点故障,集群管理程序进行故障恢复,进行资源组备机接管,执行预设脚本完成数据库启动恢复的过程。
Feb 15 09:21:15 linux2 ConfigRM[2954]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,PeerDomain.C,1.99.16.20,16887 :::CONFIGRM_PENDINGQUORUM_ER 活动对等域的操作定额状态已经更改为 PENDING_QUORUM。该状态通常表明在对等域中定义的节点 正好有一半联机。在该状态中,虽然不会显式停止任何集群资源, 但是它们都无法被恢复。
Feb 15 09:21:15 linux2 RecoveryRM[4181]: (Recorded using libct_ffdc.a cv 2):::Error ID: 825....9IRKB/dvv.0ul.x1...................:::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,Protocol.C,1.54.1.23,2555 :::RECOVERYRM_INFO_4_ST 某个成员已离开。 节点号 = 1
Feb 15 09:21:16 linux2 samtb_net[26475]: op=reserve ip=130.30.3.129 rc=0 log=1 count=2
Feb 15 09:21:46 linux2 samtb_net[26807]: op=heartbeat ip=130.30.3.129 rc=0 log=1 count=2
Feb 15 09:21:46 linux2 ConfigRM[2954]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,PeerDomain.C,1.99.16.20,16883 :::CONFIGRM_HASQUORUM_ST 活动对等域的操作定额状态已经更改为 HAS_QUORUM。 在该状态中,管理应用程序可以根据需要 恢复和控制集群资源。
Feb 15 09:21:48 linux2 GblResRM[4182]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,ServiceIP.C,1.60,1867 :::GBLRESRM_IPONLINE IBM.ServiceIP 已将地址指定给给设备。 IBM.ServiceIP 130.30.3.251 eth0:0
Feb 15 09:21:48 linux2 mountV97_start.ksh[26829]: Entered (/hafs03)
Feb 15 09:21:48 linux2 mountV97_start.ksh[26831]: Entered (/hafs02)
Feb 15 09:21:48 linux2 mountV97_start.ksh[26832]: Entered (/hafs01)
Feb 15 09:21:48 linux2 mountV97_start.ksh[26830]: Entered (/db2home)
Feb 15 09:21:48 linux2 avahi-daemon[4764]: Registering new address record for 130.30.3.251 on eth0.
Feb 15 09:21:48 linux2 avahi-daemon[4764]: Withdrawing address record for 130.30.3.251 on eth0.
Feb 15 09:21:48 linux2 avahi-daemon[4764]: Registering new address record for 130.30.3.251 on eth0.
Feb 15 09:21:49 linux2 mountV97_start.ksh[26897]: Returning 0 for /db2home
Feb 15 09:21:49 linux2 db2V97_start.ksh[26923]: Entered /usr/sbin/rsct/sapolicies/db2/db2V97_start.ksh, db2inst1, 0
Feb 15 09:21:50 linux2 kernel: kjournald starting. Commit interval 5 seconds
Feb 15 09:21:50 linux2 kernel: EXT3 FS on dm-7, internal journal
Feb 15 09:21:50 linux2 kernel: EXT3-fs: recovery complete.
Feb 15 09:21:50 linux2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Feb 15 09:21:50 linux2 kernel: kjournald starting. Commit interval 5 seconds
Feb 15 09:21:50 linux2 kernel: EXT3 FS on dm-6, internal journal
Feb 15 09:21:50 linux2 kernel: EXT3-fs: recovery complete.
Feb 15 09:21:50 linux2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Feb 15 09:21:50 linux2 kernel: kjournald starting. Commit interval 5 seconds
Feb 15 09:21:50 linux2 kernel: EXT3-fs warning: maximal mount count reached, running e2fsck is recommended
Feb 15 09:21:50 linux2 kernel: EXT3 FS on dm-9, internal journal
Feb 15 09:21:50 linux2 kernel: EXT3-fs: recovery complete.
Feb 15 09:21:50 linux2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Feb 15 09:21:50 linux2 mountV97_start.ksh[26934]: Returning 0 for /hafs03
Feb 15 09:21:50 linux2 kernel: kjournald starting. Commit interval 5 seconds
Feb 15 09:21:50 linux2 kernel: EXT3 FS on dm-8, internal journal
Feb 15 09:21:50 linux2 kernel: EXT3-fs: recovery complete.
Feb 15 09:21:50 linux2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Feb 15 09:21:50 linux2 mountV97_start.ksh[27011]: Returning 0 for /hafs01
Feb 15 09:21:51 linux2 mountV97_start.ksh[27014]: Returning 0 for /hafs02
Feb 15 09:22:23 linux2 samtb_net[27563]: op=heartbeat ip=130.30.3.129 rc=0 log=1 count=2
Feb 15 09:22:51 linux2 db2V97_start.ksh[28254]: Returning 0 from /usr/sbin/rsct/sapolicies/db2/db2V97_start.ksh ( db2inst1, 0)
Feb 15 09:22:53 linux2 samtb_net[28262]: op=heartbeat ip=130.30.3.129 rc=0 log=1 count=2
Feb 15 09:23:23 linux2 samtb_net[28674]: op=heartbeat ip=130.30.3.129 rc=0 log=1 count=2
Feb 15 09:23:31 linux2 samtb_net[28825]: op=release ip=130.30.3.129 rc=0 log=1 count=2
Feb 15 09:23:39 linux2 RecoveryRM[4181]: (Recorded using libct_ffdc.a cv 2):::Error ID: 825....PKRKB/TT9/0ul.x1...................:::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,Protocol.C,1.54.1.23,2543 :::RECOVERYRM_INFO_3_ST 新成员已加入。 节点号 = 1
8 其他问题 8.1 DB2 HA 限制和支持1. 文件系统限制
数据库管理器会根据文件系统类型自动确定 DB2 软件使用的文件系统是否需要作为挂载资
源定义到集群。对于可以配置高可用性的文件系统类型也有一些限制。
只有本地文件系统可以添加高可用性,例如:
jfs2
ext2
ext3
zfs
这些文件系统不能添加高可用性:
共享文件系统,比如 NFS
集群文件系统,比如 GPFS,CFS
任何在根 (/) 目录上挂载的文件系统
任何虚拟文件系统,比如 /proc
2.没有联合支持
CREATE/DROP WRAPPER 语句不会为包装器库路径添加或移除集群管理器挂载资源。
3.没有 db2relocatedb 支持
在此解决方案中未提供明确的 db2relocatedb 支持。必须重新运行 db2haicu 工具来为新数据
库路径创建挂载资源并移除不再使用的挂载资源。
4.如果为 db2haicu 提供了多个域 XML 文件,则仅处理适用于运行在本地节点上的域的部分。
5.DB2 高可用性功能不支持同一资源组中的多个实例。而且,DB2 资源组不应彼此依赖。
此类关系将在集群管理器和 DB2 软件之间导致不需要的、预料之外的行为。
6.如果要通过删除所有存储路径和数据库目录,手动清除数据库,集群管理器不会删除相应的挂载资源。必须使用 db2haicu 工具来移除高可用的数据库(选项 3)或移除整个集群并重新创建它(选项1)。
8.2 故障诊断处于调试和故障排除目的,必要的数据将记录在syslog 和 DB2 服务器诊断日志文件 (db2diag.log) 两个文件中,可从此文件中查看分析故障原因。
对应路径为:$HOME/sqllib/db2dump/db2diag.log和/var/log/messages
注意:
1. 可能会收到下面这样的错误消息:
“2632-044 the domain cannot be created due to the following errors that were detected while harvesting information from the target nodes:
node1: 2632-068 this node has the same internal identifier as node2 and cannot be included in the domain definition.”
如果克隆了 Linux 映像,常常会发生这个错误。集群的配置出现了错误,应该重新设置整个配置。为了解决这样的问题,可以在错误消息中指出的节点上,运行 /usr/sbin/rsct/install/bin/recfgct 命令来重新设置节点 ID。
然后从 preprpnode 命令开始继续设置。
2. 还可能会收到下面这样的错误消息:
“2632-044 The domain cannot be created due to the following errors that were detected while harvesting information from the target nodes:
node1: 2610-418 Permission is denied to access the resources or resource class specified in this command.”
为了解决这个问题,应该检查主机名解析。在所有节点上的本地 /etc/hosts 文件中,确保每个集群节点的所有条目和名称服务器条目是相同的。
切换资源组
rgreq -o move -n linux2 db2_db2inst1_0-rg
8.3 参考资料
1. IBM 红皮书:Linux、UNIX 和 Windows 上的 DB2 高可用性和灾难恢复选项
http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg247363.html
2. IBM Tivoli System Automation for Multiplatforms(Version 2 Release 2)产品/技术文档:
http://publib.boulder.ibm.com/tividd/td/IBMTivoliSystemAutomationforMultiplatforms2.2.html
3. Reliable Scalable Cluster Technology (RSCT) 管理指南
http://publib.boulder.ibm.com/infocenter/clresctr
4. 针对 Linux、UNIX 和 Windows 的 IBM DB2 9.5 和 DB2 9.7 网上信息中心
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp
9 附录 9.1 SAMP安装条件
有关与 IBM 数据服务器集成在一起的 SA MP 的许可证的详细信息,请参阅有关使用与 IBM 数据服务器集成在一起的 IBM Tivoli System Automation for Multiplatforms (SA MP) 的许可条款。
有关 SA MP支持的软件和硬件的更多信息,请参阅IBM Tivoli System Automation for Multiplatforms (SA MP) 支持的软件和硬件。
如果作为非 root 用户来安装 IBM 数据服务器,那么可以从 IBM 数据服务器安装介质中单独安装 SA MP。当单独安装 SA MP时,您仍然必须具有 root 用户权限。
9.2 SAMP配置管理参考 9.2.1 配置TSA 集群域的命令
preprpnode:这个命令为集群中包含的节点准备安全设置。当发出这个命令时,在节点之间交换公共密钥并修改 RMC 访问控制列表(ACL),让集群的所有节点都能够访问集群资源。
mkrpdomain:这个命令创建一个新的集群定义。它用来指定集群的名称以及要添加进集群的节点列表。
lsrpdomain:这个命令列出运行这个命令的节点所属集群的相关信息。
startrpdomain / stoprpdomain:这些命令分别使集群在线和离线。
addrpnode:在定义并运行集群之后,使用这个命令在集群中添加新节点。
startrpnode / stoprpnode:这些命令分别使集群中的单独节点在线和离线。在执行系统维护时常常使用这些命令。停止节点,执行修复或维护,然后重新启动节点,这时它会重新加入集群。
lsrpnode:这个命令用来查看为集群定义的节点列表,以及每个节点的操作状态(OpState)。注意,这个命令只在集群中的在线节点上有效;在离线节点上,它不显示节点列表。
rmrpdomain:这个命令删除一个定义的集群。
rmrpnode:这个命令从集群定义中删除一个或多个节点。
对这些命令的详细描述,请参考以下手册,这些手册都可以在 IBM TSA CD 上找到:
IBM Reliable Scalable Cluster Technology for Linux, Administration Guide, SA22-7892
IBM Reliable Scalable Cluster Technology for Linux, Technical Reference, SA22-7893
IBM Reliable Scalable Cluster Technology for AIX 5L: Administration Guide, SA22-7889
IBM Reliable Scalable Cluster Technology for AIX 5L: Technical Reference, SA22-7890
9.2.2 定义和管理集群下面的场景展示如何创建集群、在集群中添加节点以及检查 IBM TSA 守护进程(IBM.RecoveryRM)的状态。
9.2.2.1 创建2节点的TSA集群域为了创建这个集群,需要执行以下步骤:
1. 作为 root 在集群中的每个节点上登录。
2. 在每个节点上设置环境变量 CT_MANAGEMENT_SCOPE=2:
export CT_MANAGEMENT_SCOPE=2
3. 在所有节点上发出 preprpnode 命令,从而使集群节点能够相互通信。
preprpnode node01 node02
4. 现在,可以创建名为 “SA_Domain” 的集群,它在 Node1 和 Node2 上运行。可以从任何节点发出以下命令:
mkrpdomain SA_Domain node01 node02
注意: 在使用 mkrpdomain 命令创建 RSCT 对等域(集群)时,对等域名使用的字符只限于以下的 ASCII 字符:A-Z、a-z、0-9、.(点号)和 _(下划线)。
5. 要查看 SA_Domain 的状态,发出 lsrpdomain 命令:
Output:
Name-------OpState-------RSCTActiveVersion-------MixedVersions-------TSPort-------GSPort
SA_Domain--Offline-------2.3.3.0---------------------No--------------------12347--------12348
集群已经定义了,但是处于离线状态。
6.发出 startrpdomain 命令,让集群在线:
startrpdomain SA_Domain
当再次运行 lsrpdomain 命令时,会看到集群仍然处于启动过程中,OpState 是 Pending Online。
|
|
在创建 2 节点集群之后,可以按照以下方法在 SA_Domain 中添加第三个节点:
1. 作为根用户发出 lsrpdomain 命令,查看集群是否在线:
|
|
2. 发出 lsrpnode 命令,查看哪些节点在线:
|
|
3. 作为根用户发出以下的 preprpnode 命令,让现有节点和新节点能够相互通信。
作为根用户登录 Node3 并输入:
preprpnode node01 node02
作为根用户登录 Node2 并输入:
preprpnode node03
作为根用户登录 Node1 并输入:
preprpnode node03
确保在所有节点上执行 preprpnode 命令。强烈建议这样做。
4. 为了将 Node3 添加到集群定义中,作为根用户在 Node1 或 Node2(这两个节点应该已经在集群中在线)上发出 addrpnode 命令:
addrpnode node03
作为根用户发出 lsrpnode 命令,查看所有节点的状态:
|
|
5. 作为根用户,从一个在线节点启动 Node3:
startrpnode node03
经过短暂的延迟之后,Node3 应该也在线了。
如果觉得我的文章对您有用,请点赞。您的支持将鼓励我继续创作!
赞0
添加新评论16 条评论
2015-10-06 10:27
2012-03-20 16:13
2011-11-09 18:18
2011-09-19 16:52
2011-09-14 00:13
express C可以否!
2011-07-20 13:26
2011-07-20 13:25
2011-07-18 09:02
2011-06-10 11:59
https://www-304.ibm.com/support/docview.wss?rs=71&uid=swg27007053&wv=1
2011-06-08 12:45
2011-05-23 09:03
2011-05-20 08:05
2011-05-13 15:59
2011-05-13 11:25
值得参考!
2011-05-10 13:51
2011-05-10 10:17