如何通过运营商从Hive组获取元素/包元素？

I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:-

我想按给定字段进行分组，并使用分组字段获取输出。以下是我想要实现的一个例子： -

Imagine a table named 'sample_table' with two columns as below:-

想象一下一个名为'sample_table'的表，其中有两列如下： -

I want to write Hive Query that will give the below output:-

我想编写Hive Query，它将提供以下输出： -

001 [111, 222, 123]
002 [222, 333]
003 [555]

In Pig, this can be very easily achieved by something like this:-

在Pig中，通过以下方式可以很容易地实现： -

grouped_relation = GROUP sample_table BY F1;

Can somebody please suggest if there is a simple way to do so in Hive? What I can think of is to write a User Defined Function (UDF) for this but this may be a very time consuming option.

有人可以建议在Hive中有一个简单的方法吗？我能想到的是为此编写用户定义函数（UDF），但这可能是一个非常耗时的选项。

2 个解决方案

#1

The built in aggregate function collect_set (doumented here) gets you almost what you want. It would actually work on your example input:

内置的聚合函数collect_set（在这里添加）几乎可以获得你想要的东西。它实际上适用于您的示例输入：

SELECT F1, collect_set(F2)
FROM sample_table
GROUP BY F1

Unfortunately, it also removes duplicate elements and I imagine this isn't your desired behavior. I find it odd that collect_set exists, but no version to keep duplicates. Someone else apparently thought the same thing. It looks like the top and second answer there will give you the UDAF you need.

不幸的是，它也删除了重复的元素，我想这不是你想要的行为。我发现collect_set存在很奇怪，但没有版本可以保留重复。其他人显然也想到了同样的事情。看起来顶部和第二个答案将为您提供所需的UDAF。

#2

collect_set actually works as expected since a set as per definition is a collection of well defined and distinct objects i.e. objects occur exactly once or not at all within a set.

collect_set实际上按预期工作，因为根据定义的集合是明确定义的和不同的对象的集合，即对象恰好在集合中出现一次或根本不出现。

#1