Amazon红移键没有强制执行——如何防止重复数据?

时间:2022-04-07 01:51:48

Just testing out AWS Redshift, and having discovered some dupe data on an insert that I'd hoped would just fail on duplication in the key column, reading the docs reveal that primary key constraints aren't "enforced".

只是测试AWS Redshift,并在一个insert上发现了一些dupe数据,我希望它在键列上的复制失败,读取文档说明主键约束不是“强制的”。

Anyone figured out how to prevent duplication on primary key (per "traditional" expectation).

任何人都知道如何防止主键上的重复(根据“传统”期望)。

Thanks to any Redshift pioneers!

感谢红移先锋!

6 个解决方案

#1


8  

I assign UUIDs when the records are created. If the record is inherently unique, I use type 4 UUIDs (random), and when they aren't I use type 5 (SHA-1 hash) using the natural keys as input.
Then you can follow this instruction by AWS very easily to perform UPSERTs. If your input has duplicates, you should be able to clean up by issuing a SQL that looks something like this in your staging table:

当创建记录时,我分配uuid。如果记录本身是唯一的,我使用类型4 uid (random),如果不是,我使用类型5 (SHA-1散列),使用自然键作为输入。然后你就可以很容易地按照这条指令执行任务了。如果您的输入有重复,您应该能够通过发出SQL来清理,SQL在staging表中类似如下:

CREATE TABLE cleaned AS
SELECT
  pk_field,
  field_1,
  field_2,
  ...  
FROM (
       SELECT
         ROW_NUMBER() OVER (PARTITION BY pk_field order by pk_field) AS r,
       t.*
       from table1 t
     ) x
where x.r = 1

#2


6  

If it is too late to add an identity column to use as rowid (ALTER won't allow you to add an IDENTITY column in Redshift) you can do this:

如果添加标识列作为rowid (ALTER不允许在Redshift中添加标识列)为时已晚,那么可以这样做:

  • Fetch all dupe rows to a temporary table (use DISTINCT to get rid of dupes)
  • 将所有dupe行获取到临时表中(使用DISTINCT来消除dupe)
  • Delete these rows from the main table
  • 从主表中删除这些行
  • Reinsert rows to the main table
  • 重新插入到主表的行

Here's a sample: (let's assume id is your key to check dupes against, and data_table is your table)

这里有一个示例:(假设id是检查dupes的关键,data_table是您的表)

CREATE TEMP TABLE delete_dupe_row_list AS
    SELECT t.id FROM data_table t WHERE t.id IS NOT NULL GROUP BY t.id HAVING COUNT(t.id)>1;
CREATE TEMP TABLE delete_dupe_rows AS
    SELECT DISTINCT d.* FROM data_table d JOIN delete_dupe_row_list l ON l.id=d.id;
START TRANSACTION;
DELETE FROM data_table USING delete_dupe_row_list l WHERE l.id=data_table.id;
INSERT INTO data_table SELECT * FROM delete_dupe_rows;
COMMIT;
DROP TABLE delete_dupe_rows;
DROP TABLE delete_dupe_row_list;

#3


1  

Yeah You can not do that. For the time being, I think you should just insert duplicate data(basically duplicate keys) with an extra column of timestamp. So it will have all versions of that particular row, since update is also an insert and while you query Redshift, make sure you pick the latest one.

是的,你不能那样做。目前,我认为您应该使用额外的时间戳列插入重复的数据(基本上是重复的键)。因此,它将具有该特定行的所有版本,因为update也是一个insert,当您查询Redshift时,请确保选择最新的。

#4


1  

A quick and dirty way is to use group by

使用group by是一种快速而肮脏的方法

select max(<column_a>), max(<column_a>), <pk_column1>, <pk_column2>
from <table_name>
group by <pk_column1>, <pk_column2>

#5


1  

Confirmed, they don't enforce it:

当然,他们没有强制执行

Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.

惟一性、主键和外键约束仅为信息性约束;它们并不是由亚马逊红移来执行的。尽管如此,主键和外键被用作计划提示,如果您的ETL进程或应用程序中的其他进程强制其完整性,则应该声明它们。

For example, the query planner uses primary and foreign keys in certain statistical computations, to infer uniqueness and referential relationships that affect subquery decorrelation techniques, to order large numbers of joins, and to eliminate redundant joins.

例如,查询计划器在某些统计计算中使用主键和外键,以推断影响子查询修饰关系技术的惟一性和引用关系,订购大量连接,并消除冗余连接。

The planner leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique. Do not define key constraints for your tables if you doubt their validity. On the other hand, you should always declare primary and foreign keys and uniqueness constraints when you know that they are valid.

策划者利用这些关键关系,但它假定Amazon红移表中的所有键都是有效的。如果应用程序允许无效的外键或主键,一些查询可能返回不正确的结果。例如,如果主键不是惟一的,那么SELECT DISTINCT查询可能返回重复的行。如果怀疑表的有效性,不要为表定义关键约束。另一方面,当您知道主键和外键和惟一性约束是有效的时,您应该始终声明它们。

Amazon Redshift does enforce NOT NULL column constraints.

Amazon Redshift不执行非空列约束。

http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html

http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html

#6


-1  

I'm using IDENTITY to auto increment my primary key.

我使用IDENTITY来自动递增主键。

Here is a question I asked on the AWS forums:

以下是我在AWS论坛上提出的一个问题:

https://forums.aws.amazon.com/message.jspa?messageID=450157#450157

https://forums.aws.amazon.com/message.jspa?messageID=450157 # 450157

#1


8  

I assign UUIDs when the records are created. If the record is inherently unique, I use type 4 UUIDs (random), and when they aren't I use type 5 (SHA-1 hash) using the natural keys as input.
Then you can follow this instruction by AWS very easily to perform UPSERTs. If your input has duplicates, you should be able to clean up by issuing a SQL that looks something like this in your staging table:

当创建记录时,我分配uuid。如果记录本身是唯一的,我使用类型4 uid (random),如果不是,我使用类型5 (SHA-1散列),使用自然键作为输入。然后你就可以很容易地按照这条指令执行任务了。如果您的输入有重复,您应该能够通过发出SQL来清理,SQL在staging表中类似如下:

CREATE TABLE cleaned AS
SELECT
  pk_field,
  field_1,
  field_2,
  ...  
FROM (
       SELECT
         ROW_NUMBER() OVER (PARTITION BY pk_field order by pk_field) AS r,
       t.*
       from table1 t
     ) x
where x.r = 1

#2


6  

If it is too late to add an identity column to use as rowid (ALTER won't allow you to add an IDENTITY column in Redshift) you can do this:

如果添加标识列作为rowid (ALTER不允许在Redshift中添加标识列)为时已晚,那么可以这样做:

  • Fetch all dupe rows to a temporary table (use DISTINCT to get rid of dupes)
  • 将所有dupe行获取到临时表中(使用DISTINCT来消除dupe)
  • Delete these rows from the main table
  • 从主表中删除这些行
  • Reinsert rows to the main table
  • 重新插入到主表的行

Here's a sample: (let's assume id is your key to check dupes against, and data_table is your table)

这里有一个示例:(假设id是检查dupes的关键,data_table是您的表)

CREATE TEMP TABLE delete_dupe_row_list AS
    SELECT t.id FROM data_table t WHERE t.id IS NOT NULL GROUP BY t.id HAVING COUNT(t.id)>1;
CREATE TEMP TABLE delete_dupe_rows AS
    SELECT DISTINCT d.* FROM data_table d JOIN delete_dupe_row_list l ON l.id=d.id;
START TRANSACTION;
DELETE FROM data_table USING delete_dupe_row_list l WHERE l.id=data_table.id;
INSERT INTO data_table SELECT * FROM delete_dupe_rows;
COMMIT;
DROP TABLE delete_dupe_rows;
DROP TABLE delete_dupe_row_list;

#3


1  

Yeah You can not do that. For the time being, I think you should just insert duplicate data(basically duplicate keys) with an extra column of timestamp. So it will have all versions of that particular row, since update is also an insert and while you query Redshift, make sure you pick the latest one.

是的,你不能那样做。目前,我认为您应该使用额外的时间戳列插入重复的数据(基本上是重复的键)。因此,它将具有该特定行的所有版本,因为update也是一个insert,当您查询Redshift时,请确保选择最新的。

#4


1  

A quick and dirty way is to use group by

使用group by是一种快速而肮脏的方法

select max(<column_a>), max(<column_a>), <pk_column1>, <pk_column2>
from <table_name>
group by <pk_column1>, <pk_column2>

#5


1  

Confirmed, they don't enforce it:

当然,他们没有强制执行

Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.

惟一性、主键和外键约束仅为信息性约束;它们并不是由亚马逊红移来执行的。尽管如此,主键和外键被用作计划提示,如果您的ETL进程或应用程序中的其他进程强制其完整性,则应该声明它们。

For example, the query planner uses primary and foreign keys in certain statistical computations, to infer uniqueness and referential relationships that affect subquery decorrelation techniques, to order large numbers of joins, and to eliminate redundant joins.

例如,查询计划器在某些统计计算中使用主键和外键,以推断影响子查询修饰关系技术的惟一性和引用关系,订购大量连接,并消除冗余连接。

The planner leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique. Do not define key constraints for your tables if you doubt their validity. On the other hand, you should always declare primary and foreign keys and uniqueness constraints when you know that they are valid.

策划者利用这些关键关系,但它假定Amazon红移表中的所有键都是有效的。如果应用程序允许无效的外键或主键,一些查询可能返回不正确的结果。例如,如果主键不是惟一的,那么SELECT DISTINCT查询可能返回重复的行。如果怀疑表的有效性,不要为表定义关键约束。另一方面,当您知道主键和外键和惟一性约束是有效的时,您应该始终声明它们。

Amazon Redshift does enforce NOT NULL column constraints.

Amazon Redshift不执行非空列约束。

http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html

http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html

#6


-1  

I'm using IDENTITY to auto increment my primary key.

我使用IDENTITY来自动递增主键。

Here is a question I asked on the AWS forums:

以下是我在AWS论坛上提出的一个问题:

https://forums.aws.amazon.com/message.jspa?messageID=450157#450157

https://forums.aws.amazon.com/message.jspa?messageID=450157 # 450157