如何设置动态分区，其中列键将是分区

So I have a table A and table B, where table A data was inserted from table B. essentially table A is same as table B, only difference is that table A has a date_partition column where table B does not have. the table A schema is as such: ID int school_bg_dt string log_on_count int active_count int

所以我有一个表A和表B,其中表A数据是从表B中插入的。本质上表A与表B相同,唯一的区别是表A有一个表B没有的date_partition列。表一个模式是这样的:ID int school_bg_dt string log_on_count int active_count int

table B schema is: ID int school_bg_dt bigint log_on_count int active_count int date_partition string

表B模式是:ID int school_bg_dt bigint log_on_count int active_count int date_partition string

here is my query of inserting table B to table A which have an error I coudlnt figure out:

这是我将表B插入表A的查询,其中有一个错误,我想知道:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE A PARTITION(date_partition=school_bg_dt)
SELECT ID, cast(school_bg_dt as BIGINT), log_on_count, active_count FROM table
B;

However, I got error that the inpurt does not recognize operation near the date_partition.. not sure whats to do here, please help... so the design it is to make each school_bg_dt key as a partition as it has many unique data with that key.

但是,我得到错误,因为inpurt无法识别date_partition附近的操作..不确定这里有什么操作,请帮助...所以设计是将每个school_bg_dt键作为分区,因为它有许多独特的数据键。

1 个解决方案

#1

From here:

In the dynamic partition inserts, users can give partial partition specifications, which means just specifying the list of partition column names in the PARTITION clause. The column values are optional. If a partition column value is given, we call this a static partition, otherwise it is a dynamic partition. Each dynamic partition column has a corresponding input column from the select statement. This means that the dynamic partition creation is determined by the value of the input column. The dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.

在动态分区插入中,用户可以提供部分分区规范,这意味着只需在PARTITION子句中指定分区列名称列表。列值是可选的。如果给出了分区列值,我们将其称为静态分区,否则它是动态分区。每个动态分区列都有一个来自select语句的相应输入列。这意味着动态分区创建由输入列的值确定。动态分区列必须在SELECT语句的列中最后指定,并且与它们在PARTITION()子句中出现的顺序相同。

So, try:

FROM B
INSERT OVERWRITE TABLE A PARTITION(date_partition)
SELECT ID, cast(school_bg_dt as BIGINT), log_on_count, active_count, school_bg_dt as date_partition;

Also, note that if you're creating many partitions, you should update the following conf settings:

另请注意,如果要创建多个分区,则应更新以下conf设置:

hive.exec.max.dynamic.partitions.pernode - Maximum number of dynamic partitions allowed to be created in each mapper/reducer node (default = 100)

hive.exec.max.dynamic.partitions.pernode - 允许在每个映射器/ reducer节点中创建的最大动态分区数(默认值= 100)

hive.exec.max.dynamic.partitions - Maximum number of dynamic partitions allowed to be created in total (default = 1000)

hive.exec.max.dynamic.partitions - 允许总共创建的最大动态分区数(默认值= 1000)

hive.exec.max.created.files - Maximum number of HDFS files created by all mappers/reducers in a MapReduce job (default = 100000)

hive.exec.max.created.files - MapReduce作业中所有映射器/缩减器创建的HDFS文件的最大数量(默认值= 100000)

#1