在db2 purescale 环境中,模拟 主CF 故障,备CF接管失败?

在4台物理机上搭建db2 purescale 测试环境系统环境: 物理主机资源:CPU:16C MEM:32G操作系统版本:suse 11 sp4数据库版本:v10.5fp10_linuxx64_universal_fixpack 架构图:owzkoki6z2hwdg7qbx64q9y79q1umbh5ra179jax2i9srfd3zp0t4f ###########测试前状态########db2inst1@db2sr0...显示全部

在4台物理机上搭建db2 purescale 测试环境
系统环境:
物理主机资源:CPU:16C MEM:32G
操作系统版本:suse 11 sp4
数据库版本:v10.5fp10_linuxx64_universal_fixpack

架构图:
owzkoki6z2

owzkoki6z2

hwdg7qbx64q
hwdg7qbx64q

9y79q1umbh5
9y79q1umbh5

ra179jax2i9
ra179jax2i9

srfd3zp0t4f
srfd3zp0t4f

###########测试前状态########
db2inst1@db2sr01:~> db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME


0 MEMBER STARTED db2sr02 db2sr02 NO 0 0 db2sr02
1 MEMBER STARTED db2sr04 db2sr04 NO 0 0 db2sr04
128 CF PRIMARY db2sr01 db2sr01 NO - 0 db2sr01
129 CF PEER db2sr03 db2sr03 NO - 0 db2sr03

HOSTNAME STATE INSTANCE_STOPPED ALERT


db2sr03 ACTIVE NO NO
db2sr01 ACTIVE NO NO
db2sr04 ACTIVE NO NO
db2sr02 ACTIVE NO NO

手工执行 db2stop CF 128 #### db2sr03能正常接管CF,MEM1,MEM2能正常读写数据库

db2inst1@db2sr02:~> db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME


0 MEMBER STARTED db2sr02 db2sr02 NO 0 0 db2sr02
1 MEMBER STARTED db2sr04 db2sr04 NO 0 0 db2sr04
128 CF PEER db2sr01 db2sr01 NO - 0 db2sr01
129 CF PRIMARY db2sr03 db2sr03 NO - 0 db2sr03

HOSTNAME STATE INSTANCE_STOPPED ALERT


db2sr03 ACTIVE NO NO
db2sr01 ACTIVE NO NO
db2sr04 ACTIVE NO NO
db2sr02 ACTIVE NO NO
db2inst1@db2sr02:~>

###############测试db2sr01宕机#####
db2inst1@db2sr01:~> db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME


0 MEMBER STARTED db2sr02 db2sr02 NO 0 0 db2sr02
1 MEMBER STARTED db2sr04 db2sr04 NO 0 0 db2sr04
128 CF PRIMARY db2sr01 db2sr01 NO - 0 db2sr01
129 CF PEER db2sr03 db2sr03 NO - 0 db2sr03

HOSTNAME STATE INSTANCE_STOPPED ALERT


db2sr03 ACTIVE NO NO
db2sr01 ACTIVE NO NO
db2sr04 ACTIVE NO NO
db2sr02 ACTIVE NO NO
db2inst1@db2sr01:~>

拔掉 db2sr01电源后,状态如下,MEM1,MEM2无法读写数据库,db2sr03无法接管CF

db2inst1@db2sr03:~> db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME


0 MEMBER STARTED db2sr02 db2sr02 NO 0 0 db2sr02
1 MEMBER STARTED db2sr04 db2sr04 NO 0 0 db2sr04
128 CF ERROR db2sr01 db2sr01 YES - 0 db2sr01
129 CF PEER db2sr03 db2sr03 NO - 0 db2sr03

HOSTNAME STATE INSTANCE_STOPPED ALERT


db2sr03 ACTIVE NO NO
db2sr02 ACTIVE NO NO
db2sr04 ACTIVE NO NO
db2sr01 INACTIVE NO YES
There is currently an alert for a member, CF, or host in the data-sharing instance. For more information on the alert, its impact, and how to clear it, run the following command: 'db2cluster -cm -list -alert'.
db2inst1@db2sr03:~>

在未开启db2sr01系统,CF状态一直保持ERROR状态,数据库不可用,对db2sr01进行加电后正常进入系统后,状态如下:

db2inst1@db2sr02:~> db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME


0 MEMBER STARTED db2sr02 db2sr02 NO 0 0 db2sr02
1 MEMBER STARTED db2sr04 db2sr04 NO 0 0 db2sr04
128 CF PEER db2sr01 db2sr01 NO - 0 db2sr01
129 CF PRIMARY db2sr03 db2sr03 NO - 0 db2sr03

HOSTNAME STATE INSTANCE_STOPPED ALERT


db2sr03 ACTIVE NO NO
db2sr01 ACTIVE NO NO
db2sr04 ACTIVE NO NO
db2sr02 ACTIVE NO NO

只要4台主机的中的某一台重启,共享的GPFS文件系统都会卡大约10s以上。

日志如下:
l4u0v1qpfdc

l4u0v1qpfdc

kjj6d5g7f4a
kjj6d5g7f4a

########################################自己猜测的原因:
1、会不会是GPFS文件系统引起,切换不成功。

后端的iscsi存储也是通过x86服务器搭建的,通过本地磁盘划分分区,映射给4台数据库服务器,本地磁盘不支持SCSI-3协议,导致GPFS文件系统出现故障时,不能快速切换。

/usr/lpp/mmfs/bin/mmchconfig usePersistentReserve=yes (修改参数,报错磁盘不支持)
aj345qp8wks

aj345qp8wks

##########有什么办法能解决?需要修改参数?恳请大家帮忙分析一下原因。##########

收起

查看其它 1 个回答xukaishi的回答

xukaishixukaishi  系统工程师 , 广西集成商

非常感谢,下午我把相关的日志贴上来。

对 db2sr01断电操作

在db2sr03上收集日志

tail -f /var/adm/ras/mmfs.log.latest

Thu Jan 3 13:39:09.812 2019: [D] Leave protocol detail info: LA: 65 LFLG: 17184422 LFLG delta: 65
Thu Jan 3 13:39:09.815 2019: [I] Recovering nodes: 10.20.30.11
Thu Jan 3 13:39:09.817 2019: [I] Recovery: db2instance, delay 4 sec. for safe recovery.
Thu Jan 3 13:39:13.833 2019: [I] Recovered 1 nodes for file system db2data.

db2instance -list

db2inst1@db2sr03:~> db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME


0 MEMBER STARTED db2sr02 db2sr02 NO 0 0 db2sr02
1 MEMBER STARTED db2sr04 db2sr04 NO 0 0 db2sr04
128 CF ERROR db2sr01 db2sr01 YES - 0 db2sr01
129 CF PEER db2sr03 db2sr03 NO - 0 db2sr03

HOSTNAME STATE INSTANCE_STOPPED ALERT


db2sr03 ACTIVE NO NO
db2sr02 ACTIVE NO NO
db2sr04 ACTIVE NO NO
db2sr01 INACTIVE NO YES
There is currently an alert for a member, CF, or host in the data-sharing instance. For more information on the alert, its impact, and how to clear it, run the following command: 'db2cluster -cm -list -alert'.
db2inst1@db2sr03:~>

lssam

db2sr03:/var/ct/db2domain_20181224120019/log/mc/IBM.GblResRM # lssam
Online IBM.ResourceGroup:ca_db2inst1_0-rg Control=MemberInProblemState Nominal=Online
'- Online IBM.Application:ca_db2inst1_0-rs Control=MemberInProblemState
|- Failed offline IBM.Application:ca_db2inst1_0-rs:db2sr01 Node=Offline
'- Online IBM.Application:ca_db2inst1_0-rs:db2sr03
Online IBM.ResourceGroup:db2_db2inst1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst1_0-rs
|- Online IBM.Application:db2_db2inst1_0-rs:db2sr02
'- Offline IBM.Application:db2_db2inst1_0-rs:db2sr04
Online IBM.ResourceGroup:db2_db2inst1_1-rg Nominal=Online
'- Online IBM.Application:db2_db2inst1_1-rs
|- Offline IBM.Application:db2_db2inst1_1-rs:db2sr02
'- Online IBM.Application:db2_db2inst1_1-rs:db2sr04
Online IBM.ResourceGroup:db2mnt-db2data-rg Nominal=Online
'- Online IBM.Application:db2mnt-db2data-rs
|- Online IBM.Application:db2mnt-db2data-rs:db2sr02
'- Online IBM.Application:db2mnt-db2data-rs:db2sr04
Online IBM.ResourceGroup:db2mnt-db2instance-rg Control=MemberInProblemState Nominal=Online
'- Online IBM.Application:db2mnt-db2instance-rs Control=MemberInProblemState
|- Failed offline IBM.Application:db2mnt-db2instance-rs:db2sr01 Node=Offline
|- Online IBM.Application:db2mnt-db2instance-rs:db2sr02
|- Online IBM.Application:db2mnt-db2instance-rs:db2sr03
'- Online IBM.Application:db2mnt-db2instance-rs:db2sr04
Online IBM.ResourceGroup:idle_db2inst1_997_db2sr02-rg Nominal=Online
'- Online IBM.Application:idle_db2inst1_997_db2sr02-rs
'- Online IBM.Application:idle_db2inst1_997_db2sr02-rs:db2sr02
Online IBM.ResourceGroup:idle_db2inst1_997_db2sr04-rg Nominal=Online
'- Online IBM.Application:idle_db2inst1_997_db2sr04-rs
'- Online IBM.Application:idle_db2inst1_997_db2sr04-rs:db2sr04
Online IBM.ResourceGroup:idle_db2inst1_998_db2sr02-rg Nominal=Online
'- Online IBM.Application:idle_db2inst1_998_db2sr02-rs
'- Online IBM.Application:idle_db2inst1_998_db2sr02-rs:db2sr02
Online IBM.ResourceGroup:idle_db2inst1_998_db2sr04-rg Nominal=Online
'- Online IBM.Application:idle_db2inst1_998_db2sr04-rs
'- Online IBM.Application:idle_db2inst1_998_db2sr04-rs:db2sr04
Online IBM.ResourceGroup:idle_db2inst1_999_db2sr02-rg Nominal=Online
'- Online IBM.Application:idle_db2inst1_999_db2sr02-rs
'- Online IBM.Application:idle_db2inst1_999_db2sr02-rs:db2sr02
Online IBM.ResourceGroup:idle_db2inst1_999_db2sr04-rg Nominal=Online
'- Online IBM.Application:idle_db2inst1_999_db2sr04-rs
'- Online IBM.Application:idle_db2inst1_999_db2sr04-rs:db2sr04
Pending online IBM.ResourceGroup:primary_db2inst1_900-rg Control=MemberInProblemState Nominal=Online
'- Offline IBM.Application:primary_db2inst1_900-rs Control=MemberInProblemState
|- Failed offline IBM.Application:primary_db2inst1_900-rs:db2sr01 Node=Offline
'- Offline IBM.Application:primary_db2inst1_900-rs:db2sr03
Online IBM.Equivalency:ca_db2inst1_0-rg_group-equ

TSA 日志

01/03/19 13:21:07.426482 T(4123798384) _GBD Monitor detect OpState change for resource Name=primary_db2inst1_900-rs OldOpState=5 NewOpState=1 Handle=0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff7 0x08efa5f8
01/03/19 13:29:45.748125 T(4104928112) _GBD Monitor detect OpState change for resource Name=cacontrol_db2inst1_129_db2sr03 OldOpState=1 NewOpState=2 Handle=0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff5 0xdb520678
01/03/19 13:29:45.766898 T(4103945072) _GBD Taking application resource offline: Name=primary_db2inst1_900-rs Handle=0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff7 0x08efa5f8
01/03/19 13:29:45.766986 T(4103748464) _GBD Monitor detect OpState change for resource Name=primary_db2inst1_900-rs OldOpState=1 NewOpState=6 Handle=0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff7 0x08efa5f8
01/03/19 13:29:47.062209 T(4123798384) _GBD STOP command for application resource "primary_db2inst1_900-rs" (handle 0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff7 0x08efa5f8) succeeded with exit code 0
01/03/19 13:29:48.032666 T(4104928112) _GBD Monitor detect OpState change for resource Name=primary_db2inst1_900-rs OldOpState=6 NewOpState=2 Handle=0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff7 0x08efa5f8
01/03/19 13:29:48.050650 T(4103945072) _GBD Taking application resource offline: Name=ca_db2inst1_0-rs Handle=0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff6 0x09c38568
01/03/19 13:29:48.050725 T(4103748464) _GBD Monitor detect OpState change for resource Name=ca_db2inst1_0-rs OldOpState=1 NewOpState=6 Handle=0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff6 0x09c38568
01/03/19 13:29:49.473035 T(4123798384) _GBD STOP command for application resource "ca_db2inst1_0-rs" (handle 0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff6 0x09c38568) succeeded with exit code 0
01/03/19 13:29:50.142250 T(4104928112) _GBD Monitor detect OpState change for resource Name=ca_db2inst1_0-rs OldOpState=6 NewOpState=2 Handle=0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff6 0x09c38568
01/03/19 13:31:39.917030 T(4104928112) _GBD Monitor detect OpState change for resource Name=cacontrol_db2inst1_129_db2sr03 OldOpState=2 NewOpState=1 Handle=0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff5 0xdb520678
01/03/19 13:31:40.007051 T(4103945072) _GBD Bringing application resource online: Name=ca_db2inst1_0-rs Handle=0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff6 0x09c38568
01/03/19 13:31:40.007552 T(4103748464) _GBD Monitor detect OpState change for resource Name=ca_db2inst1_0-rs OldOpState=2 NewOpState=5 Handle=0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff6 0x09c38568
01/03/19 13:31:45.062421 T(4123798384) _GBD START command for application resource "ca_db2inst1_0-rs" (handle 0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff6 0x09c38568) succeeded with exit code 0
01/03/19 13:31:45.062549 T(4123798384) _GBD Monitor detect OpState change for resource Name=ca_db2inst1_0-rs OldOpState=5 NewOpState=1 Handle=0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff6 0x09c38568
01/03/19 13:38:12.155458 T(4103748464) _GBD Running cleanup command "/db2home/db2inst1/sqllib/adm/db2rocme 1 PRIMARY db2inst1 900 CLEANUP" for resource 0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff7 0x08efa5f8. Supporter: 0x0000 0x0000 0x00000000 0x00000000 0x00000000 0x00000000.
01/03/19 13:38:12.157403 T(4103748464) _GBD Running cleanup command "/db2home/db2inst1/sqllib/adm/db2rocme 1 CF db2inst1 128 CLEANUP" for resource 0x6028 0xffff 0x6b8042d8 0xa01d9c44 0x15732ff6 0x09c38568. Supporter: 0x6028 0xffff 0xf6b0c7f8 0x181dbce7 0x15732ff3 0xd40ef300.

附件是 tsa 详细日志 trace.8.sp.txt

/var/ct/db2domain_20181224120019/log/mc/IBM.RecoveryRM/trace.3.sp

附件:

附件图标trace.8.sp.txt (2.06 MB)

附件图标trace.3.sp.txt (1.46 MB)

 2019-01-03
浏览1240
  • db2cluster -cm -list -alert的结果是啥? 还有就是查看下cf和member的db2diag,看看当时在处理什么
    2019-01-03
  • 查看了 db2diag没有什么特别的报错
    2019-01-03
  • db2sdin1@suse2:~> db2cluster -cm -list -alert 1. Alert: Cluster node 'suse1' is not responding and has been placed in the INACTIVE state. Action: Correct the condition causing this alert by performing the following troubleshooting steps: 1) Verify that the host machine is powered on. 2) Verify the status of the cluster manager by issuing the 'db2cluster -cm -list -host -state' command on the host machine. If the cluster manager is stopped on the host machine, start the cluster manager by issuing the 'db2cluster -cm -start -host <hostname>' command. 3) Verify the connectivity of the host machine. This alert will automatically clear itself when the host is ACTIVE. Impact: While the host is INACTIVE, the DB2 members on this host will be in restart light mode on other hosts and will be in the WAITING_FOR_FAILBACK state. Any CF defined on the host will not be able to start, and the host will not be available as a target for restart light.
    2019-01-03

回答者

xukaishi系统工程师, 广西集成商

回答状态

  • 发布时间:2019-01-03
  • 关注会员:3 人
  • 回答浏览:1240
  • 关于TWT  使用指南  社区专家合作  厂商入驻社区  企业招聘  投诉建议  版权与免责声明  联系我们
    © 2020  talkwithtrend — talk with trend,talk with technologist 京ICP备09031017号-30