hive的 order by & distribute by & cluter by

我们应该都清楚order by 的含义：

根据某个字段对输出的数据排序，因为只有一个reducer，所以查询效率较慢。

那么hive中，另外两个排序，distribute by和cluster by的含义是什么呢？

直接上例子，hive中有张工资表：salary，字段是部门id--department_id，姓名--name，薪水--salary

我们想要按部门分区，再按部门中员工工资由高到低排序：

select * from salary distribute by department_id sort by salary;

可以看到结果按部门做了分区，每个部门数据按薪水做了排序。

观察数据的话，每个文件中存储了相同分区的数据。

因为每个分区使用一个reducer，所以设置的reducer个数应该大于等于结果的分区数，

不然会报错。

补充：

设置reducer个数的方法：

set mapreduce.job.reduces=-1；

设为默认值，系统分配reduce的个数。

设置好后检查一下：

set mapreduce.job.reduces；

hive的 order by & distribute by & cluter by

（如果reducer的个数大于分区数会产生空文件，reducer的个数小于分区数则会报错）

说明：

当ditribute by ...sort by的字段是同一个字段时，可以用cluster by 代替。

但是cluster by的排序只支持倒序，不能指定asc或desc。

秒客网