GreenPlum高效去除表重复数据

时间:2024-03-03 06:59:12

1.针对PostgreSQL数据库表的去重复方法基本有三种,这是在网上查找的方法,在附录1给出。但是这些方法对GreenPlum来说都不管用。

 

2.数据表分布在不同的节点上,每个节点的ctid是唯一的,但是不同的节点就有ctid重复的可能,因此GreenPlum必须借助gp_segment_id来进行去重复处理。

 

3.在网上找到了一个相对繁琐的方法,在附录2给出:

 

4.最终的方法是:

delete from test where (gp_segment_id, ctid) not in (select gp_segment_id, min(ctid) from test group by x, gp_segment_id);

 

验证通过。

 

附录1:PostgreSQL数据表去重复的三种方法:

引用自:http://my.oschina.net/swuly302/blog/144933

 

采用PostgreSQL 9.2 官方文档例子为例: 

CREATE TABLE weather (
city      varchar(80),
temp_lo   int,          -- low temperature
temp_hi   int,          -- high temperature
prcp      real,         -- precipitation
date      date
);

INSERT INTO weather VALUES
(\'San Francisco\', 46, 50, 0.25, \'1994-11-27\'),
(\'San Francisco\', 43, 57, 0, \'1994-11-29\'),
(\'Hayward\', 37, 54, NULL, \'1994-11-29\'),
(\'Hayward\', 37, 54, NULL, \'1994-11-29\');   --- duplicated row

 

这里有3中方法: 

第一种:替换法 

-- 剔除重复行的数据转存到新表weather_temp
SELECT DISTINCT city, temp_lo, temp_hi, prcp, date 
INTO weather_temp 
FROM weather; 
-- 删除原表
DROP TABLE weather;
-- 将新表重命名为weather
ALTER TABLE weather_temp RENAME TO weather;
或者 

-- 创建与weather一样的表weather_temp
CREATE TABLE weather_temp (LIKE weather INCLUDING CONSTRAINTS);
-- 用剔除重复行的数据填充到weather_temp中
INSERT INTO weather_temp SELECT DISTINCT * FROM weather;
-- 删除原表
DROP TABLE weather;
-- 将新重命名为weather.
ALTER TABLE weather_temp RENAME TO weather;
通俗易懂,有很多毁灭性的操作如DROP,而且当数据量大时,耗时耗空间。不推荐。 

第二种: 添加字段法
-- 添加一个新字段,类型为serial
ALTER TABLE weather ADD COLUMN id SERIAL;
-- 删除重复行
DELETE FROM weather WHERE id 
NOT IN (
SELECT max(id) 
FROM weather 
GROUP BY city, temp_lo, temp_hi, prcp, date
);
-- 删除添加的字段
ALTER TABLE weather DROP COLUMN id;
需要添加字段,「暂时不知道Postgres是如何处理添加字段的,是直接在原表追加呢,还是复制原表组成新表呢?」,如果是原表追加,可能就会因为新字段的加入而导致分页(一般block: 8k),如果是复制的话那就罪过了。不好。 

第三种:系统字段[查看 System Columns] 

DELETE FROM weather 
WHERE ctid 
NOT IN (
SELECT max(ctid) 
FROM weather 
GROUP BY city, temp_lo, temp_hi, prcp, date
);
针对性强[Postgres独有],但是简单。

 

 

 

----------------但是对GreenPlum的表来说,表分割在各个节点上,不能单纯的用ctid来做去重复处理。

 

附录2:

https://discuss.pivotal.io/hc/zh-cn/community/posts/206428018-What-is-the-most-efficient-way-of-deleting-duplicate-records-from-a-table-

What is the most efficient way of deleting duplicate records from a table?

Currently we use Primary Keys to avoid loading duplicate data into our tables, but PK brings many restrictions. Since we can’t easily identify or prevent duplicates arriving from the variety of 3rd party upstream systems, we wanted to investigate the ‘load everything, remove duplicates afterwards’ approach.

In Postgres, you can use an efficient method such as:

DELETE FROM test
WHERE ctid NOT IN (
SELECT min(ctid)
FROM test
GROUP BY x); 
(where \'x\' is the unique column list)

 

However in Greenplum ‘ctid’ is only unique per segment.

One approach would be:

DELETE FROM test USING 
(select gp_segment_id, ctid from 
(select gp_segment_id, ctid, rank() over (partition by x order by gp_segment_id, ctid) as rk from test ) foo 
WHERE rk <> 1) rows_to_delete 
WHERE test.gp_segment_id=rows_to_delete.gp_segment_id 
AND test.ctid=rows_to_delete.ctid;

 

But the use of window functions, subqueries etc. feels pretty inefficient.

Is there a better form?

Note that in our use case our unique column list varies up to ~10 columns so we don’t have a single unique key field – hence the RANK in the example. I suppose adding a sequence column could be used, but how much overhead does this add when doing bulk data loading?