明显的vs组,哪个更好

时间:2022-01-27 16:32:40

for the simplest case we all refer to:

我们都提到的最简单的例子是:

select id from mytbl 
group by id

and

select distinct id from mytbl

as we know, they generate same query plan which had been repeatedly mentioned in some items like Which is better: Distinct or Group By

正如我们所知道的,它们生成了相同的查询计划,在某些项目中反复提到过,例如:不同的或组的。

In hive, however, the former only has one reduce task while the latter has many.

然而,在hive中,前者只有一个减少任务,而后者有很多。

From experiments, I founded that the GROUP BY is 10+ times faster than DISTINCT.

通过实验,我发现这个组比其他组快10倍以上。

They are different.

它们是不同的。

So what I learned is:

我学到的是:

GROUP-BY is anyway not worse than DISTINCT, and it is better sometimes.

不管怎么说,“群伴”并不比“群伴”更糟糕,有时甚至更好。

I would like to know:

我想知道:

1. If this conclusion is true.

1。如果这个结论是正确的。

2. If true, I shall consider DISTINCT as a approach for logical convenience, but why DISTINCT doesn't take GROUP-BY's better implementation?

2。如果是,我将考虑使用DISTINCT作为一种逻辑方便的方法,但是为什么不采用GROUP-BY更好的实现呢?

3. If false, I would be very eager to know its decent usage under big-data situation.

3所示。如果是假的,我将非常渴望知道它在大数据情况下的合理使用。

Thank you very much!!:)

非常感谢! !:)

1 个解决方案

#1


13  

Your experience is interesting. I have not seen the single reducer effect for distinct versus group by. Perhaps there is some subtle difference in the optimizer between the two constructs.

你的经历很有意思。我还没见过单独的还原剂效应。也许这两个构造之间的优化器有一些微妙的不同。

A "famous" example in Hive is:

在Hive中有一个著名的例子:

select count(distinct id)
from mytbl;

versus

select count(*)
from (select distinct id
      from mytbl
     ) t;

The former only uses one reducer and the latter operates in parallel. I have seen this both in my experience, and it is documented and discussed (for example, on slides 26 and 27 in this presentation). So, distinct can definitely take advantage of parallelism.

前者只使用一种减速器,后者是平行操作的。我在我的经历中已经看到了这一点,它被记录下来并被讨论(例如,在这个演示的幻灯片26和27中)。显然,distinct可以利用并行性。

I imagine that as Hive matures, such problems will be fixed. However, it is ironic that Postgres has a similar performance issue with COUNT(DISTINCT), although I think the underlying reason is a little bit different.

我想,随着蜂群的成熟,这些问题将会得到解决。然而,讽刺的是,Postgres与COUNT(不同的)有一个类似的性能问题,尽管我认为潜在的原因有点不同。

#1


13  

Your experience is interesting. I have not seen the single reducer effect for distinct versus group by. Perhaps there is some subtle difference in the optimizer between the two constructs.

你的经历很有意思。我还没见过单独的还原剂效应。也许这两个构造之间的优化器有一些微妙的不同。

A "famous" example in Hive is:

在Hive中有一个著名的例子:

select count(distinct id)
from mytbl;

versus

select count(*)
from (select distinct id
      from mytbl
     ) t;

The former only uses one reducer and the latter operates in parallel. I have seen this both in my experience, and it is documented and discussed (for example, on slides 26 and 27 in this presentation). So, distinct can definitely take advantage of parallelism.

前者只使用一种减速器,后者是平行操作的。我在我的经历中已经看到了这一点,它被记录下来并被讨论(例如,在这个演示的幻灯片26和27中)。显然,distinct可以利用并行性。

I imagine that as Hive matures, such problems will be fixed. However, it is ironic that Postgres has a similar performance issue with COUNT(DISTINCT), although I think the underlying reason is a little bit different.

我想,随着蜂群的成熟,这些问题将会得到解决。然而,讽刺的是,Postgres与COUNT(不同的)有一个类似的性能问题,尽管我认为潜在的原因有点不同。