查询行中是否有两个字段相等的查询？

I have to maintain a scary legacy database that is very poorly designed. All the tables have more than 100 columns - one has 650. The database is very denormalized and I have found that often the same data is expressed in several columns in the same row.

我必须维护一个设计非常糟糕的可怕遗留数据库。所有表都有超过100列 - 一个有650列。数据库非常非规范化,我发现通常相同的数据表示在同一行的几列中。

For instance, here is a sample of columns for one of the tables:

例如,以下是其中一个表的列示例:

[MEMBERADDRESS] [varchar](331) NULL,
[DISPLAYADDRESS] [varchar](max) NULL,

[MEMBERINLINEADDRESS] [varchar](max) NULL,
[DISPLAYINLINEADDRESS] [varchar](250) NULL,

[__HISTISDN] [varchar](25) NULL,
[HISTISDN] [varchar](25) NULL,
[MYDIRECTISDN] [varchar](25) NULL,
[MYISDN] [varchar](25) NULL,

[__HISTALT_PHONE] [varchar](25) NULL,
[HISTALT_PHONE] [varchar](25) NULL,

It turns out that MEMBERADDRESS and DISPLAYADDRESS have the same value for all rows in the table. The same is true for the other clusters of fields I have shown here.

事实证明,MEMBERADDRESS和DISPLAYADDRESS对于表中的所有行具有相同的值。我在这里展示的其他领域也是如此。

It will be very difficult and time consuming to identify all cases like this manually. Is it possible to create a query that would identify if two fields have the same value in every row in a table?

手动识别所有这样的情况将是非常困难和耗时的。是否可以创建一个查询来确定两个字段在表中的每一行中是否具有相同的值?

If not, are there any existing tools that will help me identify these sorts of problems?

如果没有,是否有任何现有的工具可以帮助我识别这些问题?

2 个解决方案

#1

The following approach uses unpivot to create triples. It makes some assumptions: values are not null; each row has an id; and columns have compatible types.

以下方法使用unpivot创建三元组。它做了一些假设:值不为空;每一行都有一个id;和列具有兼容的类型。

select t.which, t2.which 
from (select id, which, value
      from MEMBERADDRESS
      unpivot (value for which in (<list of columns here>)) up
     ) t full outer join
     (select id, which, value
      from MEMBERADDRESS
      unpivot (value for which in (<list of columns here>)) up
     ) t2
     on t.id = t2.id and t.which <> t2.which
group by t.which, t2.which
having sum(case when t.value = t2.value then 1 else 0 end) = count(*)

It works by creating a new table with three columns: id, which column, and the value in the column. It then does a self join on id (to keep comparisons within one row) and value (to get matching values). This self-join should always match, because the columns are the same in the two halves of the query.

它的工作原理是创建一个包含三列的新表:id,列,以及列中的值。然后它在id上进行自连接(以便在一行内保持比较)和值(以获得匹配值)。此自连接应始终匹配,因为查询的两半中的列是相同的。

The having then counts the number of values that are the same on both sides for a given pair of columns. When all these are the same, then the match is successful.

然后,具有对给定列对的两侧的值的数量进行计数。当所有这些都相同时,匹配成功。

You can also leave out the having clause and use something like:

你也可以省略having子句并使用类似的东西:

select t.which, t2.which, sum(case when t.value = t2.value then 1 else 0 end) as Nummatchs,
       count(*) as NumRows

To get more complete information.

获取更完整的信息。

#2

There are two approaches I see to simplify this query:

我看到有两种方法可以简化此查询:

Write a script that generates your queries - feed your script the name of the table and the suspected columns, and let it produce a query that checks each pair of columns for equality. This is the fastest approach to implement in a one-of situation like yours.

编写一个生成查询的脚本 - 向脚本提供表名和可疑列,并让它生成一个查询,检查每对列是否相等。这是在像您这样的情况下实施的最快方法。

Write a query that "normalizes" your data, and search against it - self-join the query to itself, then filter out the duplicates.

编写一个“规范化”数据并对其进行搜索的查询 - 将查询自连接到自身,然后过滤掉重复项。

Here is a quick illustration of the second approach:

以下是第二种方法的快速说明:

SELECT id, name, val FROM (
    SELECT id, MEMBERADDRESS as val,'MEMBERADDRESS' as name FROM MyTable
    UNION ALL
    SELECT id, DISPLAYADDRESS as val,'DISPLAYADDRESS' as name FROM MyTable
    UNION ALL
    SELECT id, MEMBERINLINEADDRESS as val,'MEMBERINLINEADDRESS' as name FROM MyTable
    UNION ALL
    ...
) first
JOIN (
    SELECT id, MEMBERADDRESS as val,'MEMBERADDRESS' as name FROM MyTable
    UNION ALL
    SELECT id, DISPLAYADDRESS as val,'DISPLAYADDRESS' as name FROM MyTable
    UNION ALL
    SELECT id, MEMBERINLINEADDRESS as val,'MEMBERINLINEADDRESS' as name FROM MyTable
    UNION ALL
    ...
) second ON first.id=second.id AND first.value=second.value

There is a lot of manual work for 100 columns (at least it does not grow as N^2, as in the first approach, but it is still a lot of manual typing). You may be better off generating the selects connected with UNION ALL using a small script.

100列的手动工作很多(至少它不像N ^ 2那样增长,就像第一种方法一样,但它仍然是很多手动打字)。使用小脚本生成与UNION ALL连接的选择可能会更好。

#1