【阿尼亚喜欢BigData】“红亚杯”常用数据分析Hive SQL应用专题赛——满分解析②

大家好，喜欢Bigdata的阿尼亚来了！希望大家会喜欢阿尼亚的文章！！哇酷哇酷！！！本次为师傅们带来的是“红亚杯”常用数据分析Hive SQL应用专题赛——满分解析系列的第①期，是“Hive专题赛(1)”篇章哦！第①期完整赛题的链接在下面，师傅们想看完整赛题的请安心享用：【阿尼亚喜欢BigData】“红亚杯”常用数据分析Hive SQL应用专题赛——满分解析①_爱波吉的阿尼亚的博客-CSDN博客目

爱波吉的阿尼亚

1441人浏览 · 2022-06-02 10:52:21

爱波吉的阿尼亚 · 2022-06-02 10:52:21 发布

大家好，喜欢Bigdata的阿尼亚来了！希望大家会喜欢阿尼亚的文章！！哇酷哇酷！！！

本次为师傅们带来的是“红亚杯”常用数据分析Hive SQL应用专题赛——满分解析系列的第②期，是“Hive专题赛(1)”篇章哦！

第①期完整赛题的链接在下面，师傅们想看完整赛题的请安心享用：

【阿尼亚喜欢BigData】“红亚杯”常用数据分析Hive SQL应用专题赛——满分解析①_爱波吉的阿尼亚的博客-CSDN博客

Hive专题赛(1)（600 / 600分）

4. 格式化Hive元数据库，进入Hive客户端

导入人口数据（100 / 100分）

1. hive中创建person数据库，并在person数据库下创建person外部表

2. 向person外部表中加载本地数据“/root/college/person.csv”，注意字段类型，自行定义

简单查询（300 / 300分）

1. 统计表中数据总条数，将结果写入本地/root/person00/。

2. 求person表中年龄最大的人，将结果写入本地/root/person01/。

3. 求person表中年龄最小的人，将结果写入本地/root/person02/。

4. 根据性别求取person表中男女平均年龄，并进行四舍五入，将结果写入本地/root/person03/。

5. 统计年龄为35岁至40岁且婚姻状况是“Never-married”（未婚）人的总数，将结果写入本地/root/person04/。

6. 求取每周工作时长为20至30小时且职业是“Tech-support”（技术支持）的人员总数，将结果写入本地/root/person05/。

Hive专题赛(1)（600 / 600分）

初始化环境（200 / 200分）

本次环境为单节点伪集群环境，环境中已经安装JDK1.8、Hadoop2.7.7、Mysql5.7、hive2.3.4。

1.环境中已经安装/root/software/hadoop-2.7.7，格式化HDFS，开启集群，查看集群状态。（HDFS端口为9000，其他端口默认）

2.环境中已经安装/root/software/apache-hive-2.3.4-bin，需要开启mysql服务，初始化数据库，即可开启Hive客户端。

格式化并启动集群（200 / 200分）

考核条件如下：

1. 格式化集群

操作环境: hive专题赛环境

hadoop namenode -format      #格式化Hadoop集群

2. 启动集群

操作环境: hive专题赛环境

start-all.sh                 #启动Hadoop集群

3. 开启mysql服务

操作环境: hive专题赛环境

systemctl start mysqld.service           #开启mysql服务

4. 格式化Hive元数据库，进入Hive客户端

操作环境: hive专题赛环境

schematool -dbType mysql -initSchema    #格式化Hive元数据库

hive                                    #启动Hive

导入人口数据（100 / 100分）

本数据为某人口普查公开数据数据库抽取而来，该数据集类变量为年收入是否超过50k$，属性变量包含年龄、工作类型、教育程度等属性，统计对各因素对收入的影响。数据地址：/college/person.csv

创建数据库表，导入本地数据（100 / 100分）

考核条件如下：

1. hive中创建person数据库，并在person数据库下创建person外部表

操作环境: hive专题赛环境

create database if not exists person;

use person;

create external table if not exists person(age double,workclass string,
fnlwgt string,edu string,edu_num string,marital_status string,
occupation string,relationship string,race string,sex string,gain string,
loss string,hours double,native string,income string)
row format delimited fields terminated by ',';

2. 向person外部表中加载本地数据“/root/college/person.csv”，注意字段类型，自行定义

操作环境: hive专题赛环境

load data local inpath '/root/college/person.csv' into table person;

简单查询（300 / 300分）

1.使用count函数统计表中所有数据。 2.使用max函数求最大 3.使用min函数求最小 4.根据sex性别列分组，再使用avg函数求取每组下的平均年龄，并使用round函数进行四舍五入 5.区间比较: between and

简单查询（Count、Max、Min、Groupby）（300 / 300分）

考核条件如下：

1. 统计表中数据总条数，将结果写入本地/root/person00/。

操作环境: hive专题赛环境

insert overwrite local directory '/root/person00'
row format delimited fields terminated by '\t'
select count(*) from person;

2. 求person表中年龄最大的人，将结果写入本地/root/person01/。

操作环境: hive专题赛环境

insert overwrite local directory '/root/person01'
row format delimited fields terminated by '\t'
select max(age) from person;

3. 求person表中年龄最小的人，将结果写入本地/root/person02/。

操作环境: hive专题赛环境

insert overwrite local directory '/root/person02'
row format delimited fields terminated by '\t'
select min(age) from person;

4. 根据性别求取person表中男女平均年龄，并进行四舍五入，将结果写入本地/root/person03/。

操作环境: hive专题赛环境

insert overwrite local directory '/root/person03'
row format delimited fields terminated by '\t'
select round(avg(age)),sex from person group by sex;

5. 统计年龄为35岁至40岁且婚姻状况是“Never-married”（未婚）人的总数，将结果写入本地/root/person04/。

操作环境: hive专题赛环境

insert overwrite local directory '/root/person04'
row format delimited fields terminated by '\t'
select count(*) from person where age between 35 and 40 and marital_status = 'Never-married';

6. 求取每周工作时长为20至30小时且职业是“Tech-support”（技术支持）的人员总数，将结果写入本地/root/person05/。

操作环境: hive专题赛环境

insert overwrite local directory '/root/person05'
row format delimited fields terminated by '\t'
select count(*) from person where hours between 20 and 30 and occupation = 'Tech-support';