大数据离线阶段 09：HQL数据查询（DQL）语言-蒲公英云

1.基础查询

#

语法树

[WITH CommonTableExpression (, CommonTableExpression)*]
SELECT [ALL | DISTINCT] select_expr, select_expr, …
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]

table_reference指示查询的输入。它可以是普通物理表，视图，join查询结果或子查询结果。

表名和列名不区分大小写。

#

案例：美国Covid-19新冠select查询

下面来准备一下语法测试环境，在附件资料中有一份数据文件《us-covid19-counties.dat》，里面记录了2021-01-28美国各个县累计新冠确诊病例数和累计死亡病例数。

在Hive中创建表，加载该文件到表中：

--step1:创建普通表t_usa_covid19
drop table t_usa_covid19;
CREATE TABLE t_usa_covid19(
count_date string,
county string,
state string,
fips int,
cases int,
deaths int)
row format delimited fields terminated by “,”;
--将源数据load加载到t_usa_covid19表对应的路径下
load data local inpath ‘/root/hivedata/us-covid19-counties.dat’ into table t_usa_covid19;

--step2:创建一张分区表基于count_date日期,state州进行分区
CREATE TABLE itcast.t_usa_covid19_p(
county string,
fips int,
cases int,
deaths int)
partitioned by(count_date string,state string)
row format delimited fields terminated by “,”;

--step3:使用动态分区插入将数据导入t_usa_covid19_p中
set hive.exec.dynamic.partition.mode = nonstrict;

insert into table t_usa_covid19_p partition (count_date,state)
select county,fips,cases,deaths,count_date,state from t_usa_covid19;

#

select_expr

每个select_expr表示您要检索的列。必须至少有一个 select_expr。

--select_expr
--查询所有字段或者指定字段
select * from t_usa_covid19_p;
select county, cases, deaths from t_usa_covid19_p;

--查询匹配正则表达式的所有字段
SET hive.support.quoted.identifiers = none; —带反引号的名称被解释为正则表达式
select `^c.*` from t_usa_covid19_p;
--查询当前数据库
select current_database(); —省去from关键字
--查询使用函数
select count(county) from t_usa_covid19_p;

#

ALL 、DISTINCT

ALL和DISTINCT选项指定是否应返回重复的行。如果没有给出这些选项，则默认值为ALL（返回所有匹配的行）。DISTINCT指定从结果集中删除重复的行。

--ALL DISTINCT
--返回所有匹配的行
select state
from t_usa_covid19_p;
--相当于
select all state
from t_usa_covid19_p;
--返回所有匹配的行去除重复的结果
select distinct state
from t_usa_covid19_p;
--多个字段distinct 整体去重
select distinct county,state from t_usa_covid19_p;

#

WHERE

WHERE条件是一个布尔表达式。在WHERE表达式中，您可以使用Hive支持的任何函数和运算符，但聚合函数除外。

从Hive 0.13开始，WHERE子句支持某些类型的子查询。

select * from t_usa_covid19_p where state =”California” and deaths > 1000;
select * from t_usa_covid19_p where 1 > 2; — 1 > 2 返回false
select * from t_usa_covid19_p where 1 = 1; — 1 = 1 返回true

--where条件中使用函数找出州名字母超过10个
select * from t_usa_covid19_p where length(state) >10 ;

--WHERE子句支持子查询
SELECT *
FROM A
WHERE A.a IN (SELECT foo FROM B);

--where条件中不能使用聚合函数
--报错 SemanticException:Not yet supported place for UDAF ‘sum’
select state,sum(deaths)
from t_usa_covid19_p where sum(deaths) >100 group by state;

那么为什么不能在where子句中使用聚合函数呢？

因为聚合函数要使用它的前提是结果集已经确定。而where子句还处于“确定”结果集的过程中，因而不能使用聚合函数。

#

分区查询、分区裁剪

通常，SELECT查询将扫描整个表（所谓的全表扫描）。如果使用PARTITIONED BY子句创建的分区表，则在查询时可以指定分区查询，减少全表扫描，也叫做分区裁剪。

所谓分区裁剪指的是：对分区表进行查询时，会检查WHERE子句或JOIN中的ON子句中是否存在对分区字段的过滤，如果存在，则仅访问查询符合条件的分区，即裁剪掉没必要访问的分区。

--找出来自加州，累计死亡人数大于1000的县 state字段就是分区字段进行分区裁剪避免全表扫描
select * from t_usa_covid19_p where state =”California” and deaths > 1000;

--多分区裁剪
select * from t_usa_covid19_p where count_date = “2021-01-28” and state =”California” and deaths > 1000;

#

GROUP BY

GROUP BY 语句用于结合聚合函数，根据一个或多个列对结果集进行分组。需要注意的是，出现在GROUP BY中select_expr的字段：要么是GROUP BY分组的字段；要么是被聚合函数应用的字段。原因很简单，避免出现一个字段多个值的歧义。

分组字段出现select_expr中，一定没有歧义，因为就是基于该字段分组的，同一组中必相同；

被聚合函数应用的字段，也没歧义，因为聚合函数的本质就是多进一出，最终返回一个结果。

如上图所示，基于category进行分组，相同颜色的分在同一组中。

在select_expr中，如果出现category字段，则没有问题，因为同一组中category值一样，但是返回day就有问题了，day的结果不一样。

下面针对t_usa_covid19_p进行演示：

--根据state州进行分组

--SemanticException:Expression not in GROUP BY key ‘deaths’
--deaths不是分组字段报错
--state是分组字段可以直接出现在select_expr中
select state,deaths
from t_usa_covid19_p where count_date = “2021-01-28” group by state;

--被聚合函数应用
select state,count(deaths)
from t_usa_covid19_p where count_date = “2021-01-28” group by state;

#

HAVING

在SQL中增加HAVING子句原因是，WHERE关键字无法与聚合函数一起使用。

HAVING子句可以让我们筛选分组后的各组数据,并且可以在Having中使用聚合函数，因为此时where，group by已经执行结束，结果集已经确定。

--having
--统计死亡病例数大于10000的州
--where语句中不能使用聚合函数语法报错
select state,sum(deaths)
from t_usa_covid19_p
where count_date = “2021-01-28” and sum(deaths) >10000 group by state;

--先where分组前过滤（此处是分区裁剪），再进行group by分组（含聚合），分组后每个分组结果集确定再使用having过滤
select state,sum(deaths)
from t_usa_covid19_p
where count_date = “2021-01-28”
group by state
having sum(deaths) > 10000;

--这样写更好即在group by的时候聚合函数已经作用得出结果 having直接引用结果过滤不需要再单独计算一次了
select state,sum(deaths) as cnts
from t_usa_covid19_p
where count_date = “2021-01-28”
group by state
having cnts> 10000;

having与where的区别:

having是在分组后对数据进行过滤

where是在分组前对数据进行过滤

having后面可以使用聚合函数

where后面不可以使用聚合

#

LIMIT

LIMIT子句可用于约束SELECT语句返回的行数。

LIMIT接受一个或两个数字参数，这两个参数都必须是非负整数常量。

第一个参数指定要返回的第一行的偏移量（从 Hive 2.0.0开始），第二个参数指定要返回的最大行数。当给出单个参数时，它代表最大行数，并且偏移量默认为0。

--limit
--没有限制返回2021.1.28 加州的所有记录
select * from t_usa_covid19_p
where count_date = “2021-01-28”
and state =”California”;

--返回结果集的前5条
select * from t_usa_covid19_p
where count_date = “2021-01-28”
and state =”California”
limit 5;

--返回结果集从第1行开始共3行
select * from t_usa_covid19_p
where count_date = “2021-01-28”
and state =”California”
limit 2,3; —注意第一个参数偏移量是从0开始的

#

Hive SQL查询执行顺序

SELECT [ALL | DISTINCT] select_expr, select_expr, …
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]

在查询过程中执行顺序：from>where>group（含聚合）>having>order>select。

所以聚合语句(sum,min,max,avg,count)要比having子句优先执行，而where子句在查询过程中执行优先级别优先于聚合语句(sum,min,max,avg,count)。

结合下面SQL感受一下：

select state,sum(deaths) as cnts
from t_usa_covid19_p
where count_date = “2021-01-28”
group by state
having cnts> 10000;

#

2.高阶查询

#

SORT/ORDER/CLUSTER/DISTRIBUTE BY

#

ORDER BY

ORDER BY [ASC|DESC]

Hive SQL中的ORDER BY语法类似于SQL语言中的ORDER BY语法。会对输出的结果进行全局排序，因此底层使用MapReduce引擎执行的时候，只会有一个reducetask执行。也因此，如果输出的行数太大，会导致需要很长的时间才能完成全局排序。

默认排序顺序为升序（ASC），也可以指定为DESC降序。

在Hive 2.1.0和更高版本中，支持在“ order by”子句中为每个列指定null类型结果排序顺序。ASC顺序的默认空排序顺序为NULLS FIRST，而DESC顺序的默认空排序顺序为NULLS LAST。

-—order by
--根据字段进行排序
select * from t_usa_covid19_p
where count_date = “2021-01-28”
and state =”California”
order by deaths; —默认asc null first

select * from t_usa_covid19_p
where count_date = “2021-01-28”
and state =”California”
order by deaths desc; —指定desc null last

--强烈建议将LIMIT与ORDER BY一起使用。避免数据集行数过大
--当hive.mapred.mode设置为strict严格模式时，使用不带LIMIT的ORDER BY时会引发异常。
select * from t_usa_covid19_p
where count_date = “2021-01-28”
and state =”California”
order by deaths desc
limit 3;

#

CLUSTER BY

SELECT expression… FROM table CLUSTER BY col_name;

Hive SQL中的CLUSTER BY语法可以指定根据后面的字段将数据分组，每组内再根据这个字段正序排序（不允许指定排序规则），概况起来就是：根据同一个字段，分且排序。

分组的规则hash散列。hash_func(col_name) % reduce task nums

分为几组取决于reduce task的个数。下面在Hive beeline客户端中针对student表进行演示。

--cluster by
select * from student;
--不指定reduce task个数
--日志显示：Number of reduce tasks not specified. Estimated from input data size: 1
select * from student cluster by sno;

--手动设置reduce task个数
set mapreduce.job.reduces =2;
select * from student cluster by sno;

默认情况下，reduce task的个数由Hive在编译期间自己决定。

设置set mapreduce.job.reduces =2;

执行结果如下：分为两个部分，每个部分内正序排序。

假如说，现在想法如下：把学生表数据根据性别分为两个部分，每个分组内根据年龄的倒序排序。你会发现CLUSTER BY无法完成了。而order by更不能在这里使用，因为它是全局排序，一旦使用order by，编译期间就会强制把reduce task个数设置为1。无法满足分组的需求。

#

DISTRIBUTE BY +SORT BY

如果说CLUSTER BY的功能是分且排序（同一个字段），那么DISTRIBUTE BY +SORT BY就相当于把cluster by的功能一分为二：DISTRIBUTE BY负责分，SORT BY负责分组内排序，并且可以是不同的字段。如果DISTRIBUTE BY +SORT BY的字段一样，可以得出下列结论：

CLUSTER BY=DISTRIBUTE BY +SORT BY（字段一样）

--案例：把学生表数据根据性别分为两个部分，每个分组内根据年龄的倒序排序。
select * from student distribute by sex sort by sage desc;

--下面两个语句执行结果一样
select * from student distribute by sno sort by sno;
select * from student cluster by sno;

#

SORT/ORDER/CLUSTER/DISTRIBUTE BY

#

ORDER BY

ORDER BY [ASC|DESC]

默认排序顺序为升序（ASC），也可以指定为DESC降序。

-—order by
--根据字段进行排序
select * from t_usa_covid19_p
where count_date = “2021-01-28”
and state =”California”
order by deaths; —默认asc null first

select * from t_usa_covid19_p
where count_date = “2021-01-28”
and state =”California”
order by deaths desc; —指定desc null last

#

CLUSTER BY

SELECT expression… FROM table CLUSTER BY col_name;

分组的规则hash散列。hash_func(col_name) % reduce task nums

分为几组取决于reduce task的个数。下面在Hive beeline客户端中针对student表进行演示。

--cluster by
select * from student;
--不指定reduce task个数
--日志显示：Number of reduce tasks not specified. Estimated from input data size: 1
select * from student cluster by sno;

--手动设置reduce task个数
set mapreduce.job.reduces =2;
select * from student cluster by sno;

默认情况下，reduce task的个数由Hive在编译期间自己决定。

设置set mapreduce.job.reduces =2;

执行结果如下：分为两个部分，每个部分内正序排序。

#

DISTRIBUTE BY +SORT BY

CLUSTER BY=DISTRIBUTE BY +SORT BY（字段一样）

--案例：把学生表数据根据性别分为两个部分，每个分组内根据年龄的倒序排序。
select * from student distribute by sex sort by sage desc;

--下面两个语句执行结果一样
select * from student distribute by sno sort by sno;
select * from student cluster by sno;

总结

order by会对输入做全局排序，因此只有一个reducer，会导致当输入规模较大时，需要较长的计算时间。
sort by不是全局排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序。
distribute by(字段)根据指定字段将数据分到不同的reducer，分发算法是hash散列。
Cluster by(字段) 除了具有Distribute by的功能外，还会对该字段进行排序。

如果distribute和sort的字段是同一个时，此时，cluster by = distribute by + sort by

#

Union联合查询

UNION用于将来自多个SELECT语句的结果合并为一个结果集。语法如下：

select_statement UNION [ALL | DISTINCT]
select_statement UNION [ALL | DISTINCT]
select_statement …

使用DISTINCT关键字与只使用UNION默认值效果一样，都会删除重复行。

使用ALL关键字，不会删除重复行，结果集包括所有SELECT语句的匹配行（包括重复行）。

1.2.0之前的Hive版本仅支持UNION ALL，在这种情况下不会消除重复的行。

每个select_statement返回的列的数量和名称必须相同。

--union
--使用DISTINCT关键字与使用UNION默认值效果一样，都会删除重复行。
select num,name from student_local
UNION
select num,name from student_hdfs;
--和上面一样
select num,name from student_local
UNION DISTINCT
select num,name from student_hdfs;

--使用ALL关键字会保留重复行。
select num,name from student_local
UNION ALL
select num,name from student_hdfs;

--如果要将ORDER BY，SORT BY，CLUSTER BY，DISTRIBUTE BY或LIMIT应用于单个SELECT
--请将子句放在括住SELECT的括号内
SELECT sno,sname FROM (select sno,sname from student_local LIMIT 2) subq1
UNION
SELECT sno,sname FROM (select sno,sname from student_hdfs LIMIT 3) subq2

--如果要将ORDER BY，SORT BY，CLUSTER BY，DISTRIBUTE BY或LIMIT子句应用于整个UNION结果
--请将ORDER BY，SORT BY，CLUSTER BY，DISTRIBUTE BY或LIMIT放在最后一个之后。
select sno,sname from student_local
UNION
select sno,sname from student_hdfs
order by sno desc;

#

Common Table Expressions（CTE）

#

CTE介绍

公用表表达式（CTE）是一个临时结果集，该结果集是从WITH子句中指定的简单查询派生而来的，该查询紧接在SELECT或INSERT关键字之前。

CTE仅在单个语句的执行范围内定义。一个或多个CTE可以在Hive SELECT，INSERT， CREATE TABLE AS SELECT或CREATE VIEW AS SELECT语句中使用。

#

CTE案例

--选择语句中的CTE
with q1 as (select sno,sname,sage from student where sno = 95002)
select *
from q1;

-- from风格
with q1 as (select sno,sname,sage from student where sno = 95002)
from q1
select *;

-- chaining CTEs 链式
with q1 as ( select * from student where sno = 95002),
q2 as ( select sno,sname,sage from q1)
select * from (select sno from q2) a;

-- union案例
with q1 as (select * from student where sno = 95002),
q2 as (select * from student where sno = 95004)
select * from q1 union all select * from q2;

--视图，CTAS和插入语句中的CTE
-- insert
create table s1 like student;

with q1 as ( select * from student where sno = 95002)
from q1
insert overwrite table s1
select *;

select * from s1;

-- ctas
create table s2 as
with q1 as ( select * from student where sno = 95002)
select * from q1;

-- view
create view v1 as
with q1 as ( select * from student where sno = 95002)
select * from q1;

select * from v1;