基于Hadoop2.7.3集群数据仓库Hive1.2.2的部署及使用

HBase是一种分布式、面向列的NoSQL数据库，基于HDFS存储，以表的形式存储数据，表由行和列组成，列划分到列族中。HBase不提供类SQL查询语言，要想像SQL这样查询数据，可以使用Phonix，让SQL查询转换成hbase的扫描和对应的操作，也可以使用现在说讲Hive仓库工具，让HBase作为Hive存储。

Hive是运行在Hadoop之上的数据仓库，将结构化的数据文件映射为一张数据库表，提供简单类SQL查询语言，称为HQL，并将SQL语句转换成MapReduce任务运算。有利于利用SQL语言查询、分析数据，适于处理不频繁变动的数据。Hive底层可以是HBase或者HDFS存储的文件。
两者都是基于Hadoop上不同的技术，相互结合使用，可处理企业中不同类型的业务，利用Hive处理非结构化离线分析统计，利用HBase处理在线查询。

1.安装hive通过二进制包安装
下载地址:http://mirrors.shuosc.org/apache/hive/stable/apache-hive-1.2.2-bin.tar.gz
tar -zxf apache-hive-1.2.2-bin.tar.gz

配置环境变量

# vi /etc/profile

HIVE_HOME=/data/yunva/apache-hive-1.2.-bin

PATH=$PATH:$HIVE_HOME/bin

export HIVE_NAME PATH

# source /etc/profile

2.安装mysql，存储hive相关的信息(此处因为资源使用问题，mysql安装在了另外的服务器中)

# yum install -y mariadb mariadb-server

# systemctl start mariadb

在MySQL创建Hive元数据存放库和连接用户

mysql>create database hive;

mysql>grant all on *.* to'hive'@'%' identified by 'hive';

mysql>flush privileges;

3.配置hive

cd /data/yunva/apache-hive-1.2.-bin/conf

cp hive-default.xml.template hive-default.xml

配置hive连接mysql的信息
# vim hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

    <property>

        <name>javax.jdo.option.ConnectionURL</name>

        <value>jdbc:mysql://10.10.11.214:3306/hive?createDatabaseIfNotExist=true</value>

        <description>JDBC connect string for a JDBC metastore</description>

    </property>

    <property>

        <name>javax.jdo.option.ConnectionDriverName</name>

        <value>com.mysql.jdbc.Driver</value>

        <description>Driver class name for a JDBC metastore</description>

    </property>          

    <property>

        <name>javax.jdo.option.ConnectionUserName</name>

        <value>hive</value>

        <description>username to use against metastore database</description>

    </property>

    <property>

        <name>javax.jdo.option.ConnectionPassword</name>

        <value>hive</value>

        <description>password to use against metastore database</description>

    </property>

</configuration>

4.安装java连接mysql的驱动
下载地址：https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.1.45.tar.gz
将解压的mysql-connector-java-5.1.45-bin.jar放到/data/yunva/apache-hive-1.2.2-bin/lib目录

5.启动Hive服务

# hive --service metastore &

[root@test3 apache-hive-1.2.-bin]# ps -ef|grep hive

root          : pts/    :: /usr/java/jdk1..0_65/bin/java -Xmx256m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/data/yunva/hadoop-2.7./logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/data/yunva/hadoop-2.7. -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /data/yunva/apache-hive-1.2.-bin/lib/hive-service-1.2..jar org.apache.hadoop.hive.metastore.HiveMetaStore

root           : pts/    :: grep hive

[root@test3 apache-hive-1.2.-bin]# jps

 HRegionServer

 Jps

 RunJa # hive会启动叫做RunJa的程序

客户端配置，需要集成Hadoop环境
scp -P 48490 -r apache-hive-1.2.2-bin 10.10.114.112:/data/yunva

配置环境变量：
vim /etc/profile

# hive client
HIVE_HOME=/data/yunva/apache-hive-1.2.2-bin
PATH=$PATH:$HIVE_HOME/bin
export HIVE_NAME PATH

# vi hive-site.xml(或者直接使用原有配置不变，此时hive就有两个服务端了)

<configuration>

<!--通过thrift方式连接hive-->

   <property>

       <name>hive.metastore.uris</name>

        <value>thrift://hive_server_ip:9083</value>

   </property>

</configuration>

简单测试：
执行hive命令会进入命令界面：

[root@test3 apache-hive-1.2.-bin]# hive

Logging initialized using configuration in jar:file:/data/yunva/apache-hive-1.2.-bin/lib/hive-common-1.2..jar!/hive-log4j.properties

hive> show databases;

OK

default

Time taken: 1.158 seconds, Fetched:  row(s)

hive> create database yunvatest;

hive> use yunvatest;

OK

Time taken: 0.021 seconds

hive> show databases;

OK

default

yunvatest

Time taken: 0.225 seconds, Fetched:  row(s)

hive> create table table_test(id string,name string);

OK

Time taken: 0.417 seconds

hive> show tables;

OK

table_test

Time taken: 0.033 seconds, Fetched:  row(s)

6.Hive常用SQL命令
6.1先创建一个测试库

hive> create database test;

hive> use test;

创建tb1表，并指定字段分隔符为tab键（否则会插入NULL）

hive> create table tb1(id int,name string) row format delimited fields terminated by '\t';

如果想再创建一个表，而且表结构和tb1一样，可以这样：
hive> create table table2 like tb1;

查看下表结构：
hive> describe table2;
OK
id int
name string
Time taken: 0.126 seconds, Fetched: 2 row(s)

6.2从本地文件中导入数据到Hive表
先创建数据文件，键值要以tab键空格：

# cat seasons.txt

    spring

    summer

    autumn

    winter

再导入数据：
hive> load data local inpath'/root/seasons.txt' overwrite into table tb1;
查询是否导入成功

hive> select * from tb1;

OK

    spring

    summer

    autumn

    winter

6.3从HDFS中导入数据到Hive表:

列出hdfs文件系统根目录下的目录
hadoop fs -ls /

创建test根目录
hadoop fs -mkdir /test
put 命令向/test目录写入文件为siji.txt
hadoop fs -put /root/seasons.txt /test/siji.txt

查看siji.txt文件内容

# hadoop fs -cat /test/siji.txt

// :: WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

    spring

    summer

    autumn

    winte

hive> load data inpath '/test/siji.txt' overwrite into table table2;
Loading data to table test.table2
Table test.table2 stats: [numFiles=1, numRows=0, totalSize=36, rawDataSize=0]
OK
Time taken: 0.336 seconds

查询是否导入成功

hive> select * from table2;

OK

    spring

    summer

    autumn

    winter

Time taken: 0.074 seconds, Fetched:  row(s)

6.4上面是基本表的简单操作，为了提高处理性能，Hive引入了分区机制，那我们就了解分区表概念：

1>.分区表是在创建表时指定的分区空间
2>.一个表可以有一个或多个分区，意思把数据划分成块
3>.分区以字段的形式在表结构中，不存放实际数据内容
分区表优点：将表中数据根据条件分配到不同的分区中，缩小查询范围，提高检索速度和处理性能

6.5单分区表：
创建单分区表tb2（HDFS表目录下只有一级目录）：
hive> create table tb2(id int,name string) partitioned by (dt string) row format delimited fields terminated by '\t';

注：dt可以理解为分区名称。

从文件中把数据导入到Hive分区表，并定义分区信息(需要已经存在的表)

hive> load data local inpath '/root/seasons.txt' into table tb2 partition (dt='2017-12-06');
hive> load data local inpath '/root/seasons.txt' into table tb2 partition (dt='2017-12-07');

查看表数据

hive> select * from tb2;

OK

    spring    --

    summer    --

    autumn    --

    winter    --

    spring    --

    summer    --

    autumn    --

    winter    --

Time taken: 0.086 seconds, Fetched:  row(s)

查看HDFS仓库中表目录变化

[root@test4_haili_dev ~]# hadoop fs -ls -R /user/hive/warehouse/test.db/tb2

// :: WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

drwxrwxrwx   - root supergroup           -- : /user/hive/warehouse/test.db/tb2/dt=--

-rwxrwxrwx    root supergroup          -- : /user/hive/warehouse/test.db/tb2/dt=--/seasons.txt

drwxrwxrwx   - root supergroup           -- : /user/hive/warehouse/test.db/tb2/dt=--

-rwxrwxrwx    root supergroup          -- : /user/hive/warehouse/test.db/tb2/dt=--/seasons.txt

可以看到tb2表导入的数据根据日期将数据划分到不同目录下

6.6多分区表：
创建多分区表tb3（HDFS表目录下有一级目录，一级目录下再有子级目录）

hive> create table table3(id int,name string) partitioned by (dt string,location string) row format delimited fields terminated by '\t';

从文件中把数据导入到Hive分区表，并定义分区信息

hive> load data local inpath '/root/seasons.txt' into table table3 partition (dt='2017-12-06',location='guangzhou');
hive> load data local inpath '/root/seasons.txt' into table table3 partition (dt='2017-12-07',location='shenzhen');

查看表数据

hive> select * from table3;

OK

    spring    --    guangzhou

    summer    --    guangzhou

    autumn    --    guangzhou

    winter    --    guangzhou

    spring    --    shenzhen

    summer    --    shenzhen

    autumn    --    shenzhen

    winter    --    shenzhen

查看HDFS仓库中表目录变化

[root@test3 yunva]# hadoop fs -ls -R /user/hive/warehouse/test.db/table3

// :: WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

drwxrwxrwx   - root supergroup           -- : /user/hive/warehouse/test.db/table3/dt=--

drwxrwxrwx   - root supergroup           -- : /user/hive/warehouse/test.db/table3/dt=--/location=guangzhou

-rwxrwxrwx    root supergroup          -- : /user/hive/warehouse/test.db/table3/dt=--/location=guangzhou/seasons.txt

drwxrwxrwx   - root supergroup           -- : /user/hive/warehouse/test.db/table3/dt=--

drwxrwxrwx   - root supergroup           -- : /user/hive/warehouse/test.db/table3/dt=--/location=shenzhen

-rwxrwxrwx    root supergroup          -- : /user/hive/warehouse/test.db/table3/dt=--/location=shenzhen/seasons.txt

可以看到表中一级dt分区目录下又分成了location分区。

查看表分区信息
hive> show partitions table3;
OK
dt=2017-12-06/location=guangzhou
dt=2017-12-07/location=shenzhen
Time taken: 0.073 seconds, Fetched: 2 row(s)

根据分区查询数据

hive> select name from table3 where dt='2017-12-06';
OK
spring
summer
autumn
winter
Time taken: 0.312 seconds, Fetched: 4 row(s)

重命名分区
hive> alter table table3 partition (dt='2017-12-06',location='guangzhou') rename to partition(dt='20171206',location='shanghai');

删除分区
hive> alter table table3 drop partition(dt='2017-12-06',location='guangzhou');
OK
Time taken: 0.113 seconds
可以看到已经查不出来了
hive> select name from table3 where dt='2017-12-06';
OK
Time taken: 0.078 seconds

模糊搜索表
hive> show tables 'tb*';
OK
tb1
tb2

给表新添加一列

hive> alter table tb1 add columns (comment string);

OK

Time taken: 0.106 seconds

hive> describe tb1;

OK

id                      int

name                    string

comment                 string

Time taken: 0.079 seconds, Fetched:  row(s)

重命名表
hive> alter table tb1 rename to new_tb1;
OK
Time taken: 0.095 seconds
hive> show tables;
OK
new_tb1
table2
table3
tb2

删除表
hive> drop table new_tb1;
OK
Time taken: 0.094 seconds
hive> show tables;
OK
table2
table3
tb2

秒客网

基于Hadoop2.7.3集群数据仓库Hive1.2.2的部署及使用

相关文章