如何在所有列中找到由键重复而非重复的行?

时间:2021-12-09 23:54:19

I am working with a table which is an extract of a set of other tables. All of the rows of the extract table should be unique according to keys D1, D2 and D3. They are not. It appears that an earlier developer attempted to solve this problem by using a SELECT DISTINCT across all columns being queried from this table. This will work, but only if every row which is a duplicate on (D1, D2, D3) is also a duplicate across the non-key columns (ignoring the IDENTITY column that was added to the extract table).

我正在处理一个表,它是一组其他表的摘录。根据键D1、D2和D3,提取表的所有行都应该是唯一的。他们不是。前面的开发人员似乎试图通过在从该表查询的所有列中使用SELECT DISTINCT来解决这个问题。这将起作用,但只有在(D1、D2、D3)上的每一行都是跨非键列的副本(忽略被添加到提取表的标识列)。

In other words, given rows as follows:

换句话说,给定行如下:

D1  D2  D3  C4  C5  C6
=== === === === === ===
A   B   C   X1  X2  X3
A   B   C   X1  X2  X3

then

然后

SELECT DISTINCT D1, D2, D3, C4, C5, C6
FROM BAD_TABLE

will "work", as there's no difference between the rows which are duplicated on (D1,D2,D3). But if the table contained

将“工作”,因为在(D1、D2、D3)上复制的行之间没有区别。但如果桌子上有

D1  D2  D3  C4  C5  C6
=== === === === === ===
A   B   C   X1  X2  X3
A   B   C   X1  X2  X4

then SELECT DISTINCT would return two rows for the key (A,B,C). Furthermore, we would have to decide which of X3 or X4 was the "correct" value.

然后SELECT DISTINCT将为键返回两行(A、B、C)。此外,我们还必须决定X3和X4中哪个是“正确的”值。

I know how to find the duplicates on (D1,D2,D3). I even know how to find the duplicates across all the columns (other than the IDENTITY column):

我知道如何找到副本(D1,D2,D3)。我甚至知道如何在所有列(除了标识列)中查找重复的内容:

;
WITH DUPLICATES(D1,D2,D3) AS
(
    SELECT D1, D2, D3
    FROM SOURCE
    GROUP BY D1, D2, D3
    HAVING COUNT(*)>1
)
SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
INNER JOIN DUPLICATES D
    ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3
ORDER BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6

The question is, how do I find the subset of the above resultset which are duplicates on (D1,D2,D3), but not duplicates on (D1,D2,D3,C4,C5,C6)?

问题是,如何找到上面的结果集的子集,它们是在(D1,D2,D3)上重复的,而不是在(D1,D2,D3,C4,C5,C6)上重复的?

5 个解决方案

#1


4  

You can do it by joining the table on itself, saying the D's are all equal and at least one of the C's are not equal.

你可以把它和表连在一起,说D都是相等的而且至少有一个C是不相等的。

CREATE TABLE #Source (
    D1 VARCHAR(2),
    D2 VARCHAR(2),
    D3 VARCHAR(2),
    C4 VARCHAR(2),
    C5 VARCHAR(2),
    C6 VARCHAR(2) );

INSERT INTO #Source VALUES ('A', 'B', 'C', 'X1', 'X2', 'X3');
INSERT INTO #Source VALUES ('A', 'B', 'C', 'X1', 'X2', 'X4');
INSERT INTO #Source VALUES ('A', 'B', 'D', 'X1', 'X2', 'X3');
INSERT INTO #Source VALUES ('A', 'B', 'D', 'X1', 'X2', 'X3');

SELECT S1.D1, S1.D2, S1.D3, S1.C4 C4_1, S2.C4 C4_2, S1.C5 C5_1, S2.C5 C5_2, S1.C6 C6_1, S2.C6 C6_2
FROM
    #Source S1
            INNER JOIN
    #Source S2
            ON
        (       S1.D1 = S2.D1 
            AND S1.D2 = S2.D2
            AND S1.D3 = S2.D3
            AND (   S1.C4 <> S2.C4
                 OR S1.C5 <> S2.C5
                 OR S1.C6 <> S2.C6
                 )
        );

DROP TABLE #Source;

Gives the following results:

给下面的结果:

D1   D2   D3   C4_1 C4_2 C5_1 C5_2 C6_1 C6_2
---- ---- ---- ---- ---- ---- ---- ---- ----
A    B    C    X1   X1   X2   X2   X4   X3
A    B    C    X1   X1   X2   X2   X3   X4

Also note that this is compatible with MS SQL 2000 as you later indicated is required in How to Convert a SQL Query using Common Table Expressions to One Without (for SQL Server 2000).

还要注意,这与MS SQL 2000是兼容的,您稍后会指出,如何使用公共表表达式将SQL查询转换为没有的SQL查询(对于SQL Server 2000)。

#2


3  

I haven't had a chance to try Conrad's answer yet, but came up with one of my own. It's rather a "duh" moment.

我还没有机会尝试康拉德的答案,但我想到了我自己的答案。这是一个相当“废话”的时刻。

So, if you want to find all the rows in set A except for those that are in set B, you use the EXCEPT operator:

所以,如果你想要找到集合A中的所有行除了集合B中的所有行,你可以使用除运算符:

; 
WITH KEYDUPLICATES(D1,D2,D3) AS 
( 
    SELECT D1, D2, D3 
    FROM SOURCE 
    GROUP BY D1, D2, D3 
    HAVING COUNT(*)>1 
),
KEYDUPLICATEROWS AS
( 
    SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    FROM SOURCE S 
    INNER JOIN KEYDUPLICATES D 
        ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3 
),
FULLDUPLICATES AS
(
    SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    FROM SOURCE S
    GROUP BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    HAVING COUNT(*)>1
)
SELECT KR.D1, KR.D2, KR.D3, KR.C4, KR.C5, KR.C6
FROM KEYDUPLICATEROWS AS KR
EXCEPT
SELECT FD.D1, FD.D2, FD.D3, FD.C4, FD.C5, FD.C6
FROM FULLDUPLICATES AS FD
ORDER BY D1, D2, D3, C4, C5, C6

This seems to be showing me 1500 rows which are duplicates across (D1,D2,D3), but which are only duplicates across a subset of (D1,D2,D3,C4,C5,C6). In fact, it appears they are duplicates across (D1,D2,D3,C4,C5).

这似乎显示了1500行是重复的(D1,D2,D3),但是它们只是在一个子集(D1,D2,D3,C4,C5,C6)中重复。事实上,它们似乎是跨(D1,D2,D3,C4,C5)的重复。

How to confirm that will be the subject of another question.

如何确认这将是另一个问题的主题。

#3


2  

Any reason you don't just create another table expression to cover more fields and join to that one?

有什么理由不创建另一个表表达式来覆盖更多的字段并加入到其中?

WITH DUPLICATEKEY(D1,D2,D3) AS
(
    SELECT D1, D2, D3
    FROM SOURCE
    GROUP BY D1, D2, D3
    HAVING COUNT(*)>1
)
WITH NODUPES(D1,D2,D3,C4,C5,C6) AS
(
SELECT 
S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
GROUP BY
 S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
HAVING COUNT(*)=1
)

SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
INNER JOIN DUPLICATEKEY D
    ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3

INNER JOIN NODUPES D2
    ON S.D1 = D2.D1 AND S.D2 = D2.D2 AND S.D3 = D2.D3

ORDER BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6

#4


2  

This would have performance limitations, but is much easier to understand:

这将有性能上的限制,但更容易理解:

SELECT D1, D2, D3
FROM TEST_KEY TK
WHERE (D1, D2, D3) IN
        (SELECT D1, D2, D3 FROM TEST_KEY TK2
         GROUP BY D1, D2, D3
         HAVING COUNT(*) > 1)
  AND (D1, D2, D3) IN
        (SELECT D1, D2, D3 FROM TEST_KEY TK2
         GROUP BY D1, D2, D3, C4, C5, C6
         HAVING COUNT(*) < 2)

Unable to test on SQL-Server, hope the syntax is good.

无法在SQL-Server上进行测试,希望语法良好。

Again, not sure if you have analytic functions in SQL-Server, but this one works in Oracle and might be faster:

同样,不确定SQL-Server中是否有解析函数,但是这个函数在Oracle中工作,可能更快:

WITH BAD_DUP AS (
SELECT TK.*,
       COUNT(1) OVER (PARTITION BY D1, D2, D3, C4, C5, C6 ORDER BY D1) FULL_DUP,
       COUNT(1) OVER (PARTITION BY D1, D2, D3 ORDER BY D1) KEY_DUP
FROM TEST_KEY TK)
SELECT * FROM BAD_DUP
WHERE FULL_DUP < KEY_DUP

Would like to get it down to a single query....

想把它弄下来一个查询....

#5


2  

I know this is an old question, but I saw activity on the question and the technique I always use for these is not presented here as an answer, and it's really quite simple, so I figured I'd present it.

我知道这是一个老问题,但我看到了这个问题的活动,我经常使用的技巧并没有在这里作为答案,它真的很简单,所以我想我应该把它呈现出来。

SELECT D1, D2, D3, MIN(C4), MAX(C4), MIN(C5), MAX(C5), MIN(C6), MAX(C6)
FROM BAD_TABLE
GROUP BY D1, D2, D3
HAVING MIN(C4) <> MAX(C4)
    OR MIN(C5) <> MAX(C5)
    OR MIN(C6) <> MAX(C6)

This will show all the keys of duplicates on key but with differences on non-keys, with the range of differences duplicates.

这将显示在key上的重复键的所有键,但是在非键上的差异,以及差异重复的范围。

To see all the rows within that, you would need to join back to BAD_TABLE as your example in the original question.

要查看其中的所有行,您需要返回BAD_TABLE作为原始问题中的示例。

#1


4  

You can do it by joining the table on itself, saying the D's are all equal and at least one of the C's are not equal.

你可以把它和表连在一起,说D都是相等的而且至少有一个C是不相等的。

CREATE TABLE #Source (
    D1 VARCHAR(2),
    D2 VARCHAR(2),
    D3 VARCHAR(2),
    C4 VARCHAR(2),
    C5 VARCHAR(2),
    C6 VARCHAR(2) );

INSERT INTO #Source VALUES ('A', 'B', 'C', 'X1', 'X2', 'X3');
INSERT INTO #Source VALUES ('A', 'B', 'C', 'X1', 'X2', 'X4');
INSERT INTO #Source VALUES ('A', 'B', 'D', 'X1', 'X2', 'X3');
INSERT INTO #Source VALUES ('A', 'B', 'D', 'X1', 'X2', 'X3');

SELECT S1.D1, S1.D2, S1.D3, S1.C4 C4_1, S2.C4 C4_2, S1.C5 C5_1, S2.C5 C5_2, S1.C6 C6_1, S2.C6 C6_2
FROM
    #Source S1
            INNER JOIN
    #Source S2
            ON
        (       S1.D1 = S2.D1 
            AND S1.D2 = S2.D2
            AND S1.D3 = S2.D3
            AND (   S1.C4 <> S2.C4
                 OR S1.C5 <> S2.C5
                 OR S1.C6 <> S2.C6
                 )
        );

DROP TABLE #Source;

Gives the following results:

给下面的结果:

D1   D2   D3   C4_1 C4_2 C5_1 C5_2 C6_1 C6_2
---- ---- ---- ---- ---- ---- ---- ---- ----
A    B    C    X1   X1   X2   X2   X4   X3
A    B    C    X1   X1   X2   X2   X3   X4

Also note that this is compatible with MS SQL 2000 as you later indicated is required in How to Convert a SQL Query using Common Table Expressions to One Without (for SQL Server 2000).

还要注意,这与MS SQL 2000是兼容的,您稍后会指出,如何使用公共表表达式将SQL查询转换为没有的SQL查询(对于SQL Server 2000)。

#2


3  

I haven't had a chance to try Conrad's answer yet, but came up with one of my own. It's rather a "duh" moment.

我还没有机会尝试康拉德的答案,但我想到了我自己的答案。这是一个相当“废话”的时刻。

So, if you want to find all the rows in set A except for those that are in set B, you use the EXCEPT operator:

所以,如果你想要找到集合A中的所有行除了集合B中的所有行,你可以使用除运算符:

; 
WITH KEYDUPLICATES(D1,D2,D3) AS 
( 
    SELECT D1, D2, D3 
    FROM SOURCE 
    GROUP BY D1, D2, D3 
    HAVING COUNT(*)>1 
),
KEYDUPLICATEROWS AS
( 
    SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    FROM SOURCE S 
    INNER JOIN KEYDUPLICATES D 
        ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3 
),
FULLDUPLICATES AS
(
    SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    FROM SOURCE S
    GROUP BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    HAVING COUNT(*)>1
)
SELECT KR.D1, KR.D2, KR.D3, KR.C4, KR.C5, KR.C6
FROM KEYDUPLICATEROWS AS KR
EXCEPT
SELECT FD.D1, FD.D2, FD.D3, FD.C4, FD.C5, FD.C6
FROM FULLDUPLICATES AS FD
ORDER BY D1, D2, D3, C4, C5, C6

This seems to be showing me 1500 rows which are duplicates across (D1,D2,D3), but which are only duplicates across a subset of (D1,D2,D3,C4,C5,C6). In fact, it appears they are duplicates across (D1,D2,D3,C4,C5).

这似乎显示了1500行是重复的(D1,D2,D3),但是它们只是在一个子集(D1,D2,D3,C4,C5,C6)中重复。事实上,它们似乎是跨(D1,D2,D3,C4,C5)的重复。

How to confirm that will be the subject of another question.

如何确认这将是另一个问题的主题。

#3


2  

Any reason you don't just create another table expression to cover more fields and join to that one?

有什么理由不创建另一个表表达式来覆盖更多的字段并加入到其中?

WITH DUPLICATEKEY(D1,D2,D3) AS
(
    SELECT D1, D2, D3
    FROM SOURCE
    GROUP BY D1, D2, D3
    HAVING COUNT(*)>1
)
WITH NODUPES(D1,D2,D3,C4,C5,C6) AS
(
SELECT 
S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
GROUP BY
 S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
HAVING COUNT(*)=1
)

SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
INNER JOIN DUPLICATEKEY D
    ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3

INNER JOIN NODUPES D2
    ON S.D1 = D2.D1 AND S.D2 = D2.D2 AND S.D3 = D2.D3

ORDER BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6

#4


2  

This would have performance limitations, but is much easier to understand:

这将有性能上的限制,但更容易理解:

SELECT D1, D2, D3
FROM TEST_KEY TK
WHERE (D1, D2, D3) IN
        (SELECT D1, D2, D3 FROM TEST_KEY TK2
         GROUP BY D1, D2, D3
         HAVING COUNT(*) > 1)
  AND (D1, D2, D3) IN
        (SELECT D1, D2, D3 FROM TEST_KEY TK2
         GROUP BY D1, D2, D3, C4, C5, C6
         HAVING COUNT(*) < 2)

Unable to test on SQL-Server, hope the syntax is good.

无法在SQL-Server上进行测试,希望语法良好。

Again, not sure if you have analytic functions in SQL-Server, but this one works in Oracle and might be faster:

同样,不确定SQL-Server中是否有解析函数,但是这个函数在Oracle中工作,可能更快:

WITH BAD_DUP AS (
SELECT TK.*,
       COUNT(1) OVER (PARTITION BY D1, D2, D3, C4, C5, C6 ORDER BY D1) FULL_DUP,
       COUNT(1) OVER (PARTITION BY D1, D2, D3 ORDER BY D1) KEY_DUP
FROM TEST_KEY TK)
SELECT * FROM BAD_DUP
WHERE FULL_DUP < KEY_DUP

Would like to get it down to a single query....

想把它弄下来一个查询....

#5


2  

I know this is an old question, but I saw activity on the question and the technique I always use for these is not presented here as an answer, and it's really quite simple, so I figured I'd present it.

我知道这是一个老问题,但我看到了这个问题的活动,我经常使用的技巧并没有在这里作为答案,它真的很简单,所以我想我应该把它呈现出来。

SELECT D1, D2, D3, MIN(C4), MAX(C4), MIN(C5), MAX(C5), MIN(C6), MAX(C6)
FROM BAD_TABLE
GROUP BY D1, D2, D3
HAVING MIN(C4) <> MAX(C4)
    OR MIN(C5) <> MAX(C5)
    OR MIN(C6) <> MAX(C6)

This will show all the keys of duplicates on key but with differences on non-keys, with the range of differences duplicates.

这将显示在key上的重复键的所有键,但是在非键上的差异,以及差异重复的范围。

To see all the rows within that, you would need to join back to BAD_TABLE as your example in the original question.

要查看其中的所有行,您需要返回BAD_TABLE作为原始问题中的示例。