最近很忙,很长时间没上了,最近DB2 处理的一个case,分享一下!
DB2 EDU代理异常crash导致分区宕库分析处理过程
1.问题:
8点15接到经分应用电话,delete某表大量记录过程程序报错,数据库诊断日志有报错。让检查一下数据库是否正常。
2.分析
登上主机查看日志发现18:24分,数据库5节点报进程号2044578的EDU调度进程crash了,其他节点无法正常与5节点通讯,FCM连接关闭。
2010-05-06-18.24.38.239533+480 E304503685A337 LEVEL: Error
PID : 5501506 TID : 1 PROC : db2sysc 5
INSTANCE: db2inst1 NODE : 005
FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:15
DATA #1 :
An EDU crashed. Process id: 2044578, OSS term code: 0x101, signal: 6
2010-05-06-18.24.38.310633+480 I304504023A334 LEVEL: Error
PID : 5501506 TID : 1 PROC : db2sysc 5
INSTANCE: db2inst1 NODE : 005
FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:20
MESSAGE : Return code from removing fmp from pool:
DATA #1 : signed integer, 4 bytes
-888
2010-05-06-18.24.38.326266+480 I304504358A321 LEVEL: Error
PID : 5501506 TID : 1 PROC : db2sysc 5
INSTANCE: db2inst1 NODE : 005
FUNCTION: DB2 UDB, base sys utilities, sqleRunSysCtlr, probe:600
DATA #1 :
Reason code for abrupt termination of system controller: 0
2010-05-06-18.24.38.396691+480 I304505011A316 LEVEL: Error
PID : 1778550 TID : 1 PROC : db2fcmr 1
[40;1H[K[7mStandard input[m[40;1H[40;1H[KINSTANCE: db2inst1 NODE : 001
FUNCTION: DB2 UDB, fast comm manager, sqkfTcpServices::handleRecvBufferError, probe:20
MESSAGE : Recv err -2124873621 for node 5; link state 5.
2010-05-06-18.24.38.441904+480 I304505328A316 LEVEL: Error
PID : 2212140 TID : 1 PROC : db2fcmr 6
INSTANCE: db2inst1 NODE : 006
FUNCTION: DB2 UDB, fast comm manager, sqkfTcpServices::handleRecvBufferError, probe:20
MESSAGE : Recv err -2124873621 for node 5; link state 5.
2010-05-06-18.24.38.444594+480 I304505645A316 LEVEL: Error
PID : 176512 TID : 1 PROC : db2fcmr 2
INSTANCE: db2inst1 NODE : 002
FUNCTION: DB2 UDB, fast comm manager, sqkfTcpServices::handleRecvBufferError, probe:20
MESSAGE : Recv err -2124873621 for node 5; link state 5.
2010-05-06-18.24.38.757313+480 I304506870A340 LEVEL: Error
PID : 151936 TID : 1 PROC : db2fcms 3
INSTANCE: db2inst1 NODE : 003
FUNCTION: DB2 UDB, fast comm manager, sqkfSendConduit::ValidateConnectedLinks, probe:100
RETCODE : ZRC=0x8159006B=-2124873621=SQLKF_CONN_CLOSED "FCM connection closed"
检查分布在多个分区的分区表是否可正常访问,查询报错:
db2 => select count(*) from pdw.t_dwu_user_stat_m1004
SQL1229N The current transaction has been rolled back because of a system
error. SQLSTATE=40504
db2 => select dbpartitionnum(CITY_CDE) as dbpartition_num,count(*) as rows from MARTRPT.TB_RPT_MART_USER_M group by dbpartitionnum(CITY_CDE) order by dbpartitionnum(CITY_CDE) desc
SQL1229N The current transaction has been rolled back because of a system
error. SQLSTATE=40504
手工指定连接第5个分区节点,抛出错误,无法连接到130.30.6.38的指定分区:
[DWE3:/db2home/db2inst1/sqllib/db2dump]export DB2NODE=5
[DWE3:/db2home/db2inst1/sqllib/db2dump]db2 terminate
DB20000I The TERMINATE command completed successfully.
[DWE3:/db2home/db2inst1/sqllib/db2dump]db2 connect to bssdb
SQL30081N A communication error has been detected. Communication protocol
being used: "TCP/IP". Communication API being used: "SOCKETS". Location
where the error was detected: "130.30.6.38". Communication function detecting
the error: "connect". Protocol specific error code(s): "79", "*", "*".
SQLSTATE=08001
登录到6.38主机,检查发现pid为2044578的edu进程不存在,db2inst1用户的所有进程都已经退出。
[DWE1:/db2home/db2inst1]ps -ef|grep 2044578
db2inst1 614516 520914 0 21:05:41 pts/3 0:00 grep 2044578
[DWE1:/db2home/db2inst1]ps -ef|grep db2inst1
db2inst1 1298520 520914 3 21:06:12 pts/3 0:00 ps -ef
db2inst1 520914 360492 0 21:05:06 pts/3 0:00 -ksh
db2inst1 5456618 520914 0 21:06:12 pts/3 0:00 grep db2inst1
检查发现其他主机上各个分区正常,考虑手动单独启动130.30.6.38上有4,5两个分区节点,看能否恢复,不行就要重新停掉整个实例,重启数据库管理器。收起