Hive 表分区

时间:2023-03-10 03:46:10
Hive 表分区

Hive表的分区就是一个目录,分区字段不和表的字段重复

创建分区表:

create table tb_partition(id string, name string)
PARTITIONED BY (month string)
row format delimited fields terminated by '\t';

加载数据到hive分区表中

方法一:通过load方式加载

load data local inpath '/home/hadoop/files/nameinfo.txt' overwrite into table tb_partition partition(month='');

方法二:insert select 方式

insert overwrite table tb_partition partition(month='') select id, name from name;
hive> insert into table tb_partition partition(month='') select id, name from name;
Query ID = hadoop_20170918222525_7d074ba1-bff9-44fc-a664-508275175849
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator

方法三:可通过手动上传文件到分区目录,进行加载

hdfs dfs -mkdir /user/hive/warehouse/tb_partition/month=201710
hdfs dfs -put nameinfo.txt /user/hive/warehouse/tb_partition/month=201710

虽然方法三手动上传文件到分区目录,但是查询表的时候是查询不到数据的,需要更新元数据信息。

更新源数据的两种方法:

方法一:msck repair table 表名

hive> msck repair table tb_partition;
OK
Partitions not in metastore: tb_partition:month=201710
Repair: Added partition to metastore tb_partition:month=201710
Time taken: 0.265 seconds, Fetched: 2 row(s)

方法二:alter table tb_partition add partition(month='201708');

hive> alter table tb_partition add partition(month='');
OK
Time taken: 0.126 seconds

查询表数据:

hive> select *from tb_partition ;
OK
1 Lily 201708
2 Andy 201708
3 Tom 201708
1 Lily 201709
2 Andy 201709
3 Tom 201709
1 Lily 201710
2 Andy 201710
3 Tom 201710
Time taken: 0.161 seconds, Fetched: 9 row(s)

查询分区信息: show partitions 表名

hive> show partitions tb_partition;
OK
month=201708
month=201709
month=201710
Time taken: 0.154 seconds, Fetched: 3 row(s)

查看hdfs中的文件结构

[hadoop@node11 files]$ hdfs dfs -ls /user/hive/warehouse/tb_partition/
17/09/18 22:33:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
drwxr-xr-x - hadoop supergroup 0 2017-09-18 22:25 /user/hive/warehouse/tb_partition/month=201707
drwxr-xr-x - hadoop supergroup 0 2017-09-18 22:15 /user/hive/warehouse/tb_partition/month=201708
drwxr-xr-x - hadoop supergroup 0 2017-09-18 05:55 /user/hive/warehouse/tb_partition/month=201709
drwxr-xr-x - hadoop supergroup 0 2017-09-18 22:03 /user/hive/warehouse/tb_partition/month=201710

创建多级分区

create table tb_mul_partition(id string, name string)
PARTITIONED BY (month string, code string)
row format delimited fields terminated by '\t';

加载数据:

load data local inpath '/home/hadoop/files/nameinfo.txt' into table tb_mul_partition partition(month='',code='');
load data local inpath '/home/hadoop/files/nameinfo.txt' into table tb_mul_partition partition(month='',code='');

查询数据:

hive> select *From tb_mul_partition where code='';
OK
1 Lily 201709 10000
2 Andy 201709 10000
3 Tom 201709 10000
1 Lily 201710 10000
2 Andy 201710 10000
3 Tom 201710 10000
Time taken: 0.208 seconds, Fetched: 6 row(s)

测试以下指定一个分区:

hive> load data local inpath '/home/hadoop/files/nameinfo.txt' into table tb_mul_partition partition(month='');
FAILED: SemanticException [Error 10006]: Line 1:95 Partition not found ''201708''
hive> load data local inpath '/home/hadoop/files/nameinfo.txt' into table tb_mul_partition partition(code='');
FAILED: SemanticException [Error 10006]: Line 1:95 Partition not found ''20000''

创建是多级分区,指定一个分区是不可以的。

查看一下在hdfs中存储的结构:

[hadoop@node11 files]$ hdfs dfs -ls /user/hive/warehouse/tb_mul_partition/month=201710
drwxr-xr-x - hadoop supergroup 0 2017-09-18 22:36 /user/hive/warehouse/tb_mul_partition/month=201710/code=10000

动态分区

回顾一下之前的向分区插入数据:

insert overwrite table tb_partition partition(month='201707') select id, name from name;

这里需要指定具体的分区信息‘201707’,这里通过动态操作,向表里插入数据。

新建表:

hive> create table tb_copy_partition like tb_partition;
OK
Time taken: 0.118 seconds

查看一下表结构:

hive> desc tb_copy_partition;
OK
id string
name string
month string # Partition Information
# col_name data_type comment month string
Time taken: 0.127 seconds, Fetched: 8 row(s)

接下来通过动态操作,向tb_copy_partitioon里面插入数据,

insert into table tb_copy_partition partition(month) select id, name, month from tb_partition; 这里注意需要将分区字段month放到最后。

hive> insert into table tb_copy_partition partition(month) select id, name, month from tb_partition;
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

这里报错,使用动态加载,需要 To turn this off set hive.exec.dynamic.partition.mode=nonstrict

那根据错误信息设置一下

hive> set hive.exec.dynamic.partition.mode=nonstrict;

查询设置信息,设置成功

hive> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=nonstrict

重新执行:

hive> insert into table tb_copy_partition partition(month) select id, name, month from tb_partition;
Query ID = hadoop_20170918230808_0bf202da-279f-4df3-a153-ece0e457c905
Total jobs =
Launching Job out of
Number of reduce tasks is set to since there's no reduce operator
Starting Job = job_1505785612206_0002, Tracking URL = http://node11:8088/proxy/application_1505785612206_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.-cdh5.10.0/bin/hadoop job -kill job_1505785612206_0002
Hadoop job information for Stage-: number of mappers: ; number of reducers:
-- ::, Stage- map = %, reduce = %
-- ::, Stage- map = %, reduce = %, Cumulative CPU 1.94 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 3.63 sec
MapReduce Total cumulative CPU time: seconds msec
Ended Job = job_1505785612206_0002
Stage- is selected by condition resolver.
Stage- is filtered out by condition resolver.
Stage- is filtered out by condition resolver.
Moving data to: hdfs://cluster1/user/hive/warehouse/tb_copy_partition/.hive-staging_hive_2017-09-18_23-08-01_475_7542657053989652968-1/-ext-10000
Loading data to table default.tb_copy_partition partition (month=null)
Time taken for load dynamic partitions :
Loading partition {month=}
Loading partition {month=}
Loading partition {month=}
Loading partition {month=}
Time taken for adding to write entity :
Partition default.tb_copy_partition{month=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]
Partition default.tb_copy_partition{month=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]
Partition default.tb_copy_partition{month=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]
Partition default.tb_copy_partition{month=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]
MapReduce Jobs Launched:
Stage-Stage-: Map: Cumulative CPU: 3.63 sec HDFS Read: HDFS Write: SUCCESS
Total MapReduce CPU Time Spent: seconds msec
OK
Time taken: 28.932 seconds

查询一下数据:

hive> select *From tb_copy_partition;
OK
1 Lily 201707
2 Andy 201707
3 Tom 201707
1 Lily 201708
2 Andy 201708
3 Tom 201708
1 Lily 201709
2 Andy 201709
3 Tom 201709
1 Lily 201710
2 Andy 201710
3 Tom 201710
Time taken: 0.121 seconds, Fetched: 12 row(s)

完成