Apache Hive 存储方式、压缩格式

时间:2023-02-07 12:10:59

简介:

Apache hive 存储方式跟压缩格式!

1、Text File

hive> create external table tab_textfile (
host string comment 'client ip address',
local_time string comment 'client access time',
api string comment 'request api',
request_type string comment 'request method, http version',
http_code int, body_bytes int, request_body map<string, string>,
referer string, user_agent string, upstr string, response_time string, request_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY '&' MAP KEYS TERMINATED BY '=';
OK
Time taken: 0.162 seconds

# 创建一张 Text File 存储格式、不压缩的外部表

hive> load data local inpath '/data/logs/api/201711/tvlog_20171101/bftvapi.20171101.log' overwrite into table tab_textfile;
Loading data to table tmpdb.tab_textfile
OK
Time taken: 1015.974 seconds

# 原始文件 9.8G,加载到该表中需要花费 1015.974 秒 ( 这里可以优化,不使用 load 指令,直接 put 文件到数据表目录 )

hive> select count(*) from tab_textfile;
...
Stage-Stage-: Map: Reduce: Cumulative CPU: 269.51 sec HDFS Read: HDFS Write: SUCCESS
Total MapReduce CPU Time Spent: minutes seconds msec
OK Time taken: 95.68 seconds, Fetched: row(s)

# 总共 27199202 行数据,用时 95.68 秒
# 优化点:set [ hive.exec.reducers.bytes.per.reducer=<number>, hive.exec.reducers.max=<number>, mapreduce.job.reduces=<number> ]

2、ORC File

# 官方文档:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

# ORC文档:https://orc.apache.org/docs

hive> create external table tab_orcfile (
host string comment 'client ip address',
local_time string comment 'client access time',
api string comment 'request api',
request_type string comment 'request method, http version',
http_code int, body_bytes int, request_body map<string, string>,
referer string, user_agent string, upstr string, response_time string, request_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY '&' MAP KEYS TERMINATED BY '='
STORED AS ORC tblproperties ("orc.compress"="NONE");
OK
Time taken: 0.058 seconds

# 创建一张 ORC File 存储格式、不压缩的外部表

hive> insert overwrite table tab_orcfile select * from tab_textfile;
...
Stage-Stage-: Map: Cumulative CPU: 2290.24 sec HDFS Read: HDFS Write: SUCCESS
Total MapReduce CPU Time Spent: minutes seconds msec
OK
Time taken: 289.954 seconds

# 向 tab_orcfile 中加载数据,注意:ORC File 不能直接 load data !!!

# 可以先创建 Text File 的临时表,将数据手动上传到该表指定目录,然后转换成 ORC File 格式。

hive> select count(*) from tab_orcfile;
OK Time taken: 2.555 seconds, Fetched: row(s)

# 额,同样的语句,上面执行花费 95.68 秒,现在只要 2.555 秒。
# 换一种方式测试,先查 tab_orcfile 表,然后再查 tab_textfile 表。

hive> select count(host) from tab_orcfile;
...
Stage-Stage-: Map: Reduce: Cumulative CPU: 81.02 sec HDFS Read: HDFS Write: SUCCESS
Total MapReduce CPU Time Spent: minutes seconds msec
OK Time taken: 33.55 seconds, Fetched: row(s)

# ORC File 花费 33.55 秒

hive> select count(host) from tab_textfile;
...
Stage-Stage-: Map: Reduce: Cumulative CPU: 349.77 sec HDFS Read: HDFS Write: SUCCESS
Total MapReduce CPU Time Spent: minutes seconds msec
OK Time taken: 87.308 seconds, Fetched: row(s)

# Text File 花费 87.308 秒,高下立见!

3、启用压缩

# ORC 文档:https://orc.apache.org/docs/hive-config.html

hive> create external table tab_orcfile_zlib (
host string comment 'client ip address',
local_time string comment 'client access time',
api string comment 'request api',
request_type string comment 'request method, http version',
http_code int, body_bytes int, request_body map<string, string>,
referer string, user_agent string, upstr string, response_time string, request_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY '&' MAP KEYS TERMINATED BY '='
STORED AS ORC;

# 默认的 ORC 压缩方式为 ZLIB,还支持 LZO、SNAPPY 等

hive> insert overwrite table tab_orcfile_zlib select * from tab_textfile;
...
Stage-Stage-: Map: Cumulative CPU: 2344.68 sec HDFS Read: HDFS Write: SUCCESS
Total MapReduce CPU Time Spent: minutes seconds msec
OK
Time taken: 299.204 seconds

# 数据加载完成

hive> select count(host) from tab_orcfile_zlib;
...
Stage-Stage-: Map: Reduce: Cumulative CPU: 43.66 sec HDFS Read: HDFS Write: SUCCESS
Total MapReduce CPU Time Spent: seconds msec
OK Time taken: 31.369 seconds, Fetched: row(s)

# 查询速度不受影响

hive> dfs -ls -h /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/
Found items
-rwxrwxrwx root supergroup 24.6 M -- : /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/000000_0
-rwxrwxrwx root supergroup 23.0 M -- : /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/000001_0
-rwxrwxrwx root supergroup 25.9 M -- : /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/000002_0
-rwxrwxrwx root supergroup 26.5 M -- : /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/000003_0

# 总共分成 39 个文件,每个平均 25M,总过不到 1G,原始文件 9.8G,这压缩比如何 ?