Aminy
作者Aminy·2017-07-03 21:55
项目经理·德州银行

DB2千万级数据去重

字数 4120阅读 551评论 0赞 0

db2 数据库版本8.2,表FACT_WBCKTJB数据为每日增量,日数据量千万,字段rq+zh的索引。发现6月19日数据包含大量完全重复的数据,需要删除重复记录中多余的记录,留下一条。

本来是个很简单的问题,一条sql搞定。

delete from (select rq,zh,row_number() over(partition by rq,zh, order by rq,zh) as row_num from FACT_WBCKTJB)  where row_num >1;

实际考虑到数据量太大,delete+分组+排序的操作导致实例hang住。计划用临时表先挑出重复的数据,再去重导入原表。类似于一下步骤。

--1.创建临时表
create table FACT_WBCKTJB_TMP like FACT_WBCKTJB;
--2.查找重复项
--select * from FACT_WBCKTJB where (rq,zh)) in(select rq,zh from FACT_WBCKTJB group by rq,zh having count(*)>1);
insert into FACT_WBCKTJB_TMP select * from FACT_WBCKTJB where (rq,zh)) in(select rq,zh from FACT_WBCKTJB group by rq,zh having count(*)>1);
--3.删除原表重复数据
delete from FACT_WBCKTJB where (rq,zh)) in(select rq,zh from FACT_WBCKTJB group by rq,zh having count(*)>1);
--4.导入去重后数据
INSERT INTO FACT_WBCKTJB SELECT RQ, ZH, XM, JGH, BZ, YE_YB, YE_MY, YJS, YRJ, NJS, NRJ   FROM  FACT_WBCKTJB_TMP GROUP BY  RQ, ZH, XM, JGH, BZ, YE_YB, YE_MY, YJS, YRJ, NJS, NRJ;

实际操作时,第2步就过不去了,数据量太大,将第2步换成导入19日后全部数据。

insert into FACT_WBCKTJB_TMP select * from FACT_WBCKTJB where rq>='2017/06/19';

这边命令执行了一个小时,坑啊——,万幸没把数据库搞挂。
其实执行到四十分钟时候,数据就已经全部插到TMP表了。

select count(1) from  FACT_WBCKTJB where rq='2017/07/02';
select count(1) from  FACT_WBCKTJB_TMP where rq='2017/07/02';

上面两个条数在四十分钟的时候已经一致了。以为是事务日志的原因,关了事务日志。

db2 commit;
db2 alter table FACT_WBCKTJB activate not logged initially;
db2 commit;

重新执行又是一个小时,没用。临时表还hang住了,drop不掉,查询没数据。8.2没有db2top,用db2pd找到对应的application,然后force掉。

db2pd -db bankdm -locks -transactions -applications -dynamic;
db2 force application 11111;
db2 drop table FACT_WBCKTJB_TMP;

换成用游标的方式把数据导出,为了避免前面步骤的第3步delete操作再卡住,直接把全部数据导出来,最终执行的步骤如下。

--1.创建临时表
--19日以前的正确数据
create table FACT_WBCKTJB_18 like FACT_WBCKTJB;
--19日以后的重复数据
create table FACT_WBCKTJB_19 like FACT_WBCKTJB;
--2.将数据以6月19日为界,导入两个临时表
db2 connect to bankdm ;
db2 "declare C1 cursor for  SELECT *  FROM  DB2INST1.FACT_WBCKTJB  where  rq<'2017/06/19' with ur ";
db2 "LOAD FROM  C1 of cursor insert INTO   DB2INST1.FACT_WBCKTJB_18 NONRECOVERABLE" >>./load.log
db2 "declare C2 cursor for  SELECT *  FROM  DB2INST1.FACT_WBCKTJB  where  rq
>='2017/06/19' with ur ";
db2 "LOAD FROM  C2 of cursor insert INTO   DB2INST1.FACT_WBCKTJB_19 NONRECOVERABLE" >>./load.log
db2 connect reset;
--3.将19日之后的数据去重后导入原表
db2 connect to bankdm ;
db2 "declare C1 cursor for  SELECT RQ, ZH, XM, JGH, BZ, YE_YB, YE_MY, YJS, YRJ, NJS, NRJ   FROM  DB2INST1.FACT_WBCKTJB_19 where  rq>='2017/06/19' and rq<'2017/06/24' GROUP BY  RQ, ZH, XM, JGH, BZ, YE_YB, YE_MY, YJS, YRJ, NJS, NRJ  with ur ";
db2 "LOAD FROM  C1 of cursor insert INTO   DB2INST1.FACT_WBCKTJB NONRECOVERABLE" >>./load.log
db2 "declare C2 cursor for  SELECT RQ, ZH, XM, JGH, BZ, YE_YB, YE_MY, YJS, YRJ, NJS, NRJ   FROM  DB2INST1.FACT_WBCKTJB_19 where  rq>='2017/06/24' and rq<'2017/06/29' GROUP BY  RQ, ZH, XM, JGH, BZ, YE_YB, YE_MY, YJS, YRJ, NJS, NRJ  with ur ";
db2 "LOAD FROM  C2 of cursor insert INTO   DB2INST1.FACT_WBCKTJB NONRECOVERABLE" >>./load.log
db2 "declare C3 cursor for  SELECT RQ, ZH, XM, JGH, BZ, YE_YB, YE_MY, YJS, YRJ, NJS, NRJ   FROM  DB2INST1.FACT_WBCKTJB_19 where  rq>='2017/06/29' GROUP BY  RQ, ZH, XM, JGH, BZ, YE_YB, YE_MY, YJS, YRJ, NJS, NRJ  with ur ";
db2 "LOAD FROM  C3 of cursor insert INTO   DB2INST1.FACT_WBCKTJB NONRECOVERABLE" >>./load.log
db2 connect reset;
--上面的语句放shell里执行了,按日期分了三个批次,执行的时候仅仅用了三四分钟,可能不分批也没事。
--4.导入19日以前的数据
db2 connect to bankdm ;
db2 "declare C1 cursor for  SELECT * FROM  DB2INST1.FACT_WBCKTJB_18 with ur ";
db2 "LOAD FROM  C1 of cursor insert INTO   DB2INST1.FACT_WBCKTJB NONRECOVERABLE" >>./load.log
--5.清理临时表,直接贴执行过程。
$ touch 1.del;
$ db2 connect to bankdm

   Database Connection Information

 Database server        = DB2/AIX64 8.2.0
 SQL authorization ID   = ODS
 Local database alias   = BANKDM

$ db2 "load from  1.del of del   replace into  db2inst1.FACT_WBCKTJB_19 nonrecoverable"
SQL3109N  The utility is beginning to load data from file "/ods/1.del".

SQL2036N  The path for the file or device "/ods/1.del" is not valid.

SQL3107W  There is at least one warning message in the message file.


Number of rows read         = 0
Number of rows skipped      = 0
Number of rows loaded       = 0
Number of rows rejected     = 0
Number of rows deleted      = 0
Number of rows committed    = 0

$ db2 drop table db2inst1.FACT_WBCKTJB_18
$ db2 "load from  1.del of del   replace into  db2inst1.FACT_WBCKTJB_19 nonrecoverable"
SQL3109N  The utility is beginning to load data from file "/ods/1.del".

SQL2036N  The path for the file or device "/ods/1.del" is not valid.

SQL3107W  There is at least one warning message in the message file.


Number of rows read         = 0
Number of rows skipped      = 0
Number of rows loaded       = 0
Number of rows rejected     = 0
Number of rows deleted      = 0
Number of rows committed    = 0

$ db2 drop table db2inst1.FACT_WBCKTJB_18
$ db2 connect reset

总结。
1.cursor是个好东西,最好的数据数据搬运车,这两年工作多亏他了;
2.insert的执行过程还需要细细研究;
3.db2该学习oracle的rowid这种唯一列标识;
4.db2 V8明年终于可以sayounara了;
5.有空的时候需要锻炼锻炼颈椎和腰椎了。

如果觉得我的文章对您有用,请点赞。您的支持将鼓励我继续创作!

0

添加新评论0 条评论

Ctrl+Enter 发表

相关文章

相关问题

相关资料

X社区推广