如何根据条件获取每个列的惟一值?

时间:2022-11-24 22:04:24

I have been trying to find an optimal solution to select unique values from each column. My problem is I don't know column names in advance since different table has different number of columns. So first, I have to find column names and I could use below query to do it:

我一直在试图找到一个最优的解决方案,从每个列中选择唯一的值。我的问题是,由于不同的表有不同的列数,所以我不提前知道列名。首先,我需要找到列名,我可以用下面的查询来做:

select column_name from information_schema.columns
where table_name='m0301010000_ds' and column_name like 'c%' 

Sample output for column names:

列名的输出示例:

c1, c2a, c2b, c2c, c2d, c2e, c2f, c2g, c2h, c2i, c2j, c2k, ...

Then I would use returned column names to get unique/distinct value in each column and not just distinct row.

然后,我将使用返回的列名在每一列中获得唯一/不同的值,而不仅仅是不同的行。

I know a simplest and lousy way is to write select distict column_name from table where column_name = 'something' for every single column (around 20-50 times) and its very time consuming too. Since I can't use more than one distinct per column_name, I am stuck with this old school solution.

我知道一种最简单、最糟糕的方法是在表格中写入select distict column_name,其中column_name = 'something'表示每一列(大约20-50次),而且非常耗时。由于每个column_name不能使用多于一个的不同名称,所以我只能使用这个旧的学校解决方案。

I am sure there would be a faster and elegant way to achieve this, and I just couldn't figure how. I will really appreciate any help on this.

我相信会有一种更快、更优雅的方式来实现这个目标,但我不知道该怎么做。我将非常感谢你的帮助。

2 个解决方案

#1


3  

You can't just return rows, since distinct values don't go together any more.

不能只返回行,因为不同的值不再一起。

You could return arrays, which can be had simpler than you may have expected:

你可以返回数组,这些数组比你预期的要简单:

SELECT array_agg(DISTINCT c1)  AS c1_arr
      ,array_agg(DISTINCT c2a) AS c2a_arr
      ,array_agg(DISTINCT c2b) AS c2ba_arr
      , ...
FROM   m0301010000_ds;

This returns distinct values per column. One array (possibly big) for each column. All connections between values in columns (what used to be in the same row) are lost in the output.

每个列返回不同的值。每个列有一个数组(可能很大)。列中的所有值之间的连接(过去在同一行中的值)都在输出中丢失。

Build SQL automatically

CREATE OR REPLACE FUNCTION f_build_sql_for_dist_vals(_tbl regclass)
  RETURNS text AS
$func$
SELECT 'SELECT ' || string_agg(format('array_agg(DISTINCT %1$I) AS %1$I_arr'
                                     , attname)
                              , E'\n      ,' ORDER  BY attnum)
        || E'\nFROM   ' || _tbl
FROM   pg_attribute
WHERE  attrelid = _tbl            -- valid, visible table name 
AND    attnum >= 1                -- exclude tableoid & friends
AND    NOT attisdropped           -- exclude dropped columns
$func$  LANGUAGE sql;

Call:

电话:

SELECT f_build_sql_for_dist_vals('public.m0301010000_ds');

Returns an SQL string as displayed above.

返回如上所示的SQL字符串。

I use the system catalog pg_attribute instead of the information schema. And the object identifier type regclass for the table name. More explanation in this related answer:
PLpgSQL function to find columns with only NULL values in a given table

我使用系统目录pg_attribute而不是信息模式。以及表名的对象标识符类型regclass。在这个相关的答案中有更多的解释:PLpgSQL函数查找给定表中只有空值的列

#2


3  

If you need this in "real time", you won't be able to archive it using a SQL that needs to do a full table scan to archive it.

如果您“实时”需要它,那么您将无法使用需要执行全表扫描来归档的SQL对它进行归档。

I would advise you to create a separated table containing the distinct values for each column (initialized with SQL from @Erwin Brandstetter ;) and maintain it using a trigger on the original table.

我建议您创建一个单独的表,其中包含每个列的不同值(使用@Erwin Brandstetter的SQL初始化),并使用原始表上的触发器来维护它。

Your new table will have one column per field. # of row will be equals to the max number of distinct values for one field.

新表每个字段将有一列。# of row将等于一个字段中不同值的最大值。

For on insert: for each field to maintain check if that value is already there or not. If not, add it.

对于插入:对于要维护的每个字段,检查该值是否已经存在。如果不是,将它添加。

For on update: for each field to maintain that has old value != from new value, check if the new value is already there or not. If not, add it. Regarding the old value, check if any other row has that value, and if not, remove it from the list (set field to null).

对于update:要维护的每个字段都有来自新值的旧值!=,检查新值是否已经存在。如果不是,将它添加。对于旧值,检查其他任何行是否有该值,如果没有,将其从列表中删除(将字段设置为null)。

For delete : for each field to maintain, check if any other row has that value, and if not, remove it from the list (set value to null).

对于delete:对于要维护的每个字段,检查其他行是否有该值,如果没有,从列表中删除它(将值设置为null)。

This way the load mainly moved to the trigger, and the SQL on the value list table will super fast.

这样,负载主要移动到触发器,并且值列表表上的SQL将非常快。

P.S.: Make sure to pass all you SQL from trigger to explain plan to make sure they use best index and execution plan as possible. For update/deletion, just check if old value exists (limit 1).

注::确保将所有SQL从触发器传递到explain plan,以确保它们尽可能使用最佳索引和执行计划。对于更新/删除,只需检查旧值是否存在(限制1)。

#1


3  

You can't just return rows, since distinct values don't go together any more.

不能只返回行,因为不同的值不再一起。

You could return arrays, which can be had simpler than you may have expected:

你可以返回数组,这些数组比你预期的要简单:

SELECT array_agg(DISTINCT c1)  AS c1_arr
      ,array_agg(DISTINCT c2a) AS c2a_arr
      ,array_agg(DISTINCT c2b) AS c2ba_arr
      , ...
FROM   m0301010000_ds;

This returns distinct values per column. One array (possibly big) for each column. All connections between values in columns (what used to be in the same row) are lost in the output.

每个列返回不同的值。每个列有一个数组(可能很大)。列中的所有值之间的连接(过去在同一行中的值)都在输出中丢失。

Build SQL automatically

CREATE OR REPLACE FUNCTION f_build_sql_for_dist_vals(_tbl regclass)
  RETURNS text AS
$func$
SELECT 'SELECT ' || string_agg(format('array_agg(DISTINCT %1$I) AS %1$I_arr'
                                     , attname)
                              , E'\n      ,' ORDER  BY attnum)
        || E'\nFROM   ' || _tbl
FROM   pg_attribute
WHERE  attrelid = _tbl            -- valid, visible table name 
AND    attnum >= 1                -- exclude tableoid & friends
AND    NOT attisdropped           -- exclude dropped columns
$func$  LANGUAGE sql;

Call:

电话:

SELECT f_build_sql_for_dist_vals('public.m0301010000_ds');

Returns an SQL string as displayed above.

返回如上所示的SQL字符串。

I use the system catalog pg_attribute instead of the information schema. And the object identifier type regclass for the table name. More explanation in this related answer:
PLpgSQL function to find columns with only NULL values in a given table

我使用系统目录pg_attribute而不是信息模式。以及表名的对象标识符类型regclass。在这个相关的答案中有更多的解释:PLpgSQL函数查找给定表中只有空值的列

#2


3  

If you need this in "real time", you won't be able to archive it using a SQL that needs to do a full table scan to archive it.

如果您“实时”需要它,那么您将无法使用需要执行全表扫描来归档的SQL对它进行归档。

I would advise you to create a separated table containing the distinct values for each column (initialized with SQL from @Erwin Brandstetter ;) and maintain it using a trigger on the original table.

我建议您创建一个单独的表,其中包含每个列的不同值(使用@Erwin Brandstetter的SQL初始化),并使用原始表上的触发器来维护它。

Your new table will have one column per field. # of row will be equals to the max number of distinct values for one field.

新表每个字段将有一列。# of row将等于一个字段中不同值的最大值。

For on insert: for each field to maintain check if that value is already there or not. If not, add it.

对于插入:对于要维护的每个字段,检查该值是否已经存在。如果不是,将它添加。

For on update: for each field to maintain that has old value != from new value, check if the new value is already there or not. If not, add it. Regarding the old value, check if any other row has that value, and if not, remove it from the list (set field to null).

对于update:要维护的每个字段都有来自新值的旧值!=,检查新值是否已经存在。如果不是,将它添加。对于旧值,检查其他任何行是否有该值,如果没有,将其从列表中删除(将字段设置为null)。

For delete : for each field to maintain, check if any other row has that value, and if not, remove it from the list (set value to null).

对于delete:对于要维护的每个字段,检查其他行是否有该值,如果没有,从列表中删除它(将值设置为null)。

This way the load mainly moved to the trigger, and the SQL on the value list table will super fast.

这样,负载主要移动到触发器,并且值列表表上的SQL将非常快。

P.S.: Make sure to pass all you SQL from trigger to explain plan to make sure they use best index and execution plan as possible. For update/deletion, just check if old value exists (limit 1).

注::确保将所有SQL从触发器传递到explain plan,以确保它们尽可能使用最佳索引和执行计划。对于更新/删除,只需检查旧值是否存在(限制1)。