在PostgreSQL中有处理无序数组(集合)的标准方法吗?

时间:2022-10-28 21:59:48

I have a table that contains pairs of words in two separate columns. The order of the words is often important, but there are times when I simply want to aggregate based on the two words, regardless of order. Is there a simple way to treat two rows with the same words but with different orders (one row the opposite of the other) as the same "set"? In other words, treat:

我有一个表,其中包含两个独立列中的单词对。单词的顺序通常很重要,但有时我只是想根据这两个词来聚合,不管顺序如何。是否有一种简单的方法可以用相同的单词来处理两行,但是使用不同的命令(一行与另一行相反)作为相同的“集合”?换句话说,治疗:

apple orange
orange apple

as:

为:

(apple,orange)
(apple,orange)

1 个解决方案

#1


7  

There's no built-in way at this time.

目前没有内置的方法。

As arrays

If you consistently normalize them on save you can treat arrays as sets, by always storing them sorted and de-duplicated. It'd be great if PostgreSQL had a built-in C function to do this, but it doesn't. I took a look at writing one but the C array API is horrible, so even though I've written a bunch of extensions I just backed carefully away from this one.

如果您在save上始终规范化它们,您可以将数组视为集合,通过始终存储它们进行排序和反复制。如果PostgreSQL有一个内置的C函数来实现这一点,那就太棒了,但它没有。我试着写一个,但是C数组API很糟糕,所以尽管我写了很多扩展,我还是小心翼翼地远离了这个。

If you don't mind moderately icky performance you can do it in SQL:

如果您不介意适度的性能问题,您可以使用SQL:

CREATE OR REPLACE FUNCTION array_uniq_sort(anyarray) RETURNS anyarray AS $$
SELECT array_agg(DISTINCT f ORDER BY f) FROM unnest($1) f;
$$ LANGUAGE sql IMMUTABLE;

then wrap all saves in calls to array_uniq_sort or enforce it with a trigger. You can then just compare your arrays for equality. You could avoid the array_uniq_sort calls for data from the app if you instead just did the sort/unique on the app side instead.

然后在调用array_uniq_sort时封装所有保存,或者使用触发器执行。然后你可以比较你的数组是否相等。你可以避免array_uniq_sort调用来自应用程序的数据,如果你只是在app端做排序/唯一操作。

If you do this please store your "sets" as array columns, like text[], not comma- or space-delimited text. See this question for some of the reasons.

如果您这样做,请将您的“集合”存储为数组列,比如text[],而不是逗号或空格分隔的文本。看这个问题的原因。

You need to watch out for a few things, like the fact that casts between arrays are stricter than casts between their base types. E.g.:

您需要注意一些事情,比如数组之间的强制类型转换比基类型之间的强制类型转换更严格。例如:

regress=> SELECT 'a' = 'a'::varchar, 'b' = 'b'::varchar;
 ?column? | ?column? 
----------+----------
 t        | t
(1 row)

regress=> SELECT ARRAY['a','b'] = ARRAY['a','b']::varchar[];
ERROR:  operator does not exist: text[] = character varying[]
LINE 1: SELECT ARRAY['a','b'] = ARRAY['a','b']::varchar[];
                              ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.
regress=> SELECT ARRAY['a','b']::varchar[] = ARRAY['a','b']::varchar[];
 ?column? 
----------
 t
(1 row)

Such columns are GiST-indexable for operations like array-contains or array-overlaps; see the PostgreSQL documentation on array indexing.

这类列可用于诸如array-contains或array-overlap之类的操作;参见关于数组索引的PostgreSQL文档。

As normalized rows

The other option is to just store normalized rows with a suitable key. I'd still use array_agg for sorting and comparing them, as SQL set operations can be clunky to use for this (especially given the lack of an XOR / double-sided set difference operation).

另一种方法是使用适当的键来存储规范化行。我仍然会使用array_agg对它们进行排序和比较,因为使用SQL集操作可能比较麻烦(特别是考虑到缺少XOR /双面集差异操作)。

This is generally known as EAV (entity-attribute-value). I'm not a fan myself, but it does have its place occasionally. Except you'd be using it without the value component.

这通常称为EAV(实体-属性值)。我自己不是粉丝,但它偶尔也有自己的位置。除非你没有使用value组件。

You create a table:

你创建一个表:

CREATE TABLE item_attributes (
    item_id integer references items(id),
    attribute_name text,
    primary key(item_id, attribute_name)
);

and insert a row for each set entry for each item, instead of having each item have an array-valued column. The unique constraint enforced by the primary key ensures that no item may have duplicates of a given attribute. Attribute ordering is irrelevant/undefined.

并为每个项目的每个设置项插入一行,而不是让每个项目都有一个数组值的列。主键执行的惟一约束确保没有任何项具有给定属性的重复。属性排序/未定义无关。

Comparisions can be done with SQL set operators like EXCEPT, or using array_agg(attribute_name ORDER BY attribute_name) to form consistently sorted arrays for comparison.

可以使用SQL set操作符(比如EXCEPT)或使用array_agg(attribute_name按attribute_name顺序排列)来形成一致排序的数组进行比较。

Indexing is limited to determining whether a given item has/doesn't have a given attribute.

索引仅限于确定给定项是否具有给定属性。

Personally I'd use arrays over this approach.

就我个人而言,我将在这种方法中使用数组。

hstore

You can also use hstores with empty values to store sets, as hstore de-duplicates keys. 9.4's jsonb will also work for this.

您还可以使用带有空值的hstores来存储集合,因为hstore会删除重复的键。9.4的jsonb也将适用于此。

regress=# create extension hstore;
CREATE EXTENSION
regress=# SELECT hstore('a => 1, b => 1') = hstore('b => 1, a => 1, b => 1');
 ?column? 
----------
 t
(1 row)

It's only really useful for text types, though. e.g.:

不过,它只对文本类型有用。例如:

regress=# SELECT hstore('"1.0" => 1, "2.0" => 1') = hstore('"1.00" => 1, "1.000" => 1, "2.0" => 1');
 ?column? 
----------
 f
(1 row)

and I think it's ugly. So again, I'd favour arrays.

我觉得它很丑。同样,我更喜欢数组。

For integer arrays only

The intarray extension provides useful, fast functions for treating arrays as sets. They're only available for integer arrays but they're really useful.

intarray扩展为将数组作为集合处理提供了有用的、快速的函数。它们只适用于整数数组,但非常有用。

#1


7  

There's no built-in way at this time.

目前没有内置的方法。

As arrays

If you consistently normalize them on save you can treat arrays as sets, by always storing them sorted and de-duplicated. It'd be great if PostgreSQL had a built-in C function to do this, but it doesn't. I took a look at writing one but the C array API is horrible, so even though I've written a bunch of extensions I just backed carefully away from this one.

如果您在save上始终规范化它们,您可以将数组视为集合,通过始终存储它们进行排序和反复制。如果PostgreSQL有一个内置的C函数来实现这一点,那就太棒了,但它没有。我试着写一个,但是C数组API很糟糕,所以尽管我写了很多扩展,我还是小心翼翼地远离了这个。

If you don't mind moderately icky performance you can do it in SQL:

如果您不介意适度的性能问题,您可以使用SQL:

CREATE OR REPLACE FUNCTION array_uniq_sort(anyarray) RETURNS anyarray AS $$
SELECT array_agg(DISTINCT f ORDER BY f) FROM unnest($1) f;
$$ LANGUAGE sql IMMUTABLE;

then wrap all saves in calls to array_uniq_sort or enforce it with a trigger. You can then just compare your arrays for equality. You could avoid the array_uniq_sort calls for data from the app if you instead just did the sort/unique on the app side instead.

然后在调用array_uniq_sort时封装所有保存,或者使用触发器执行。然后你可以比较你的数组是否相等。你可以避免array_uniq_sort调用来自应用程序的数据,如果你只是在app端做排序/唯一操作。

If you do this please store your "sets" as array columns, like text[], not comma- or space-delimited text. See this question for some of the reasons.

如果您这样做,请将您的“集合”存储为数组列,比如text[],而不是逗号或空格分隔的文本。看这个问题的原因。

You need to watch out for a few things, like the fact that casts between arrays are stricter than casts between their base types. E.g.:

您需要注意一些事情,比如数组之间的强制类型转换比基类型之间的强制类型转换更严格。例如:

regress=> SELECT 'a' = 'a'::varchar, 'b' = 'b'::varchar;
 ?column? | ?column? 
----------+----------
 t        | t
(1 row)

regress=> SELECT ARRAY['a','b'] = ARRAY['a','b']::varchar[];
ERROR:  operator does not exist: text[] = character varying[]
LINE 1: SELECT ARRAY['a','b'] = ARRAY['a','b']::varchar[];
                              ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.
regress=> SELECT ARRAY['a','b']::varchar[] = ARRAY['a','b']::varchar[];
 ?column? 
----------
 t
(1 row)

Such columns are GiST-indexable for operations like array-contains or array-overlaps; see the PostgreSQL documentation on array indexing.

这类列可用于诸如array-contains或array-overlap之类的操作;参见关于数组索引的PostgreSQL文档。

As normalized rows

The other option is to just store normalized rows with a suitable key. I'd still use array_agg for sorting and comparing them, as SQL set operations can be clunky to use for this (especially given the lack of an XOR / double-sided set difference operation).

另一种方法是使用适当的键来存储规范化行。我仍然会使用array_agg对它们进行排序和比较,因为使用SQL集操作可能比较麻烦(特别是考虑到缺少XOR /双面集差异操作)。

This is generally known as EAV (entity-attribute-value). I'm not a fan myself, but it does have its place occasionally. Except you'd be using it without the value component.

这通常称为EAV(实体-属性值)。我自己不是粉丝,但它偶尔也有自己的位置。除非你没有使用value组件。

You create a table:

你创建一个表:

CREATE TABLE item_attributes (
    item_id integer references items(id),
    attribute_name text,
    primary key(item_id, attribute_name)
);

and insert a row for each set entry for each item, instead of having each item have an array-valued column. The unique constraint enforced by the primary key ensures that no item may have duplicates of a given attribute. Attribute ordering is irrelevant/undefined.

并为每个项目的每个设置项插入一行,而不是让每个项目都有一个数组值的列。主键执行的惟一约束确保没有任何项具有给定属性的重复。属性排序/未定义无关。

Comparisions can be done with SQL set operators like EXCEPT, or using array_agg(attribute_name ORDER BY attribute_name) to form consistently sorted arrays for comparison.

可以使用SQL set操作符(比如EXCEPT)或使用array_agg(attribute_name按attribute_name顺序排列)来形成一致排序的数组进行比较。

Indexing is limited to determining whether a given item has/doesn't have a given attribute.

索引仅限于确定给定项是否具有给定属性。

Personally I'd use arrays over this approach.

就我个人而言,我将在这种方法中使用数组。

hstore

You can also use hstores with empty values to store sets, as hstore de-duplicates keys. 9.4's jsonb will also work for this.

您还可以使用带有空值的hstores来存储集合,因为hstore会删除重复的键。9.4的jsonb也将适用于此。

regress=# create extension hstore;
CREATE EXTENSION
regress=# SELECT hstore('a => 1, b => 1') = hstore('b => 1, a => 1, b => 1');
 ?column? 
----------
 t
(1 row)

It's only really useful for text types, though. e.g.:

不过,它只对文本类型有用。例如:

regress=# SELECT hstore('"1.0" => 1, "2.0" => 1') = hstore('"1.00" => 1, "1.000" => 1, "2.0" => 1');
 ?column? 
----------
 f
(1 row)

and I think it's ugly. So again, I'd favour arrays.

我觉得它很丑。同样,我更喜欢数组。

For integer arrays only

The intarray extension provides useful, fast functions for treating arrays as sets. They're only available for integer arrays but they're really useful.

intarray扩展为将数组作为集合处理提供了有用的、快速的函数。它们只适用于整数数组,但非常有用。