Hive(21)hive的综合案例实战--统计⽹站各种TopN指标
hive的综合案例实战
案例需求
统计youtube影⾳视频⽹站的常规指标,各种TopN指标:
--统计视频观看数Top10
--统计视频类别热度Top10
--统计视频观看数Top20所属类别
--统计视频观看数Top50所关联视频的所属类别Rank
--统计每个类别中的视频热度Top10
--统计每个类别中视频流量Top10
--统计上传视频最多的⽤户Top10以及他们上传的视频
-
-统计每个类别视频观看数Top10
项⽬表字段
1.视频表
#数据字段情况⼤致如下:
LKh7zAJ4nwo TheReceptionist 653 Entertainment 424 13021 4.34 1305 744 DjdA-5oKYFQ NxTDlnOuybo c-8VuICzXtU DH56yrIO5nI W1Uo5DQTtzc E-3zXq_r4w0 1TCeoRPg5dE yAr26YhuYNY 2ZgXx72XmoE -7ClGo-YgZ0 vmdPOOd6cxI KRHfMQqSHpk pIM fQShwYqGqsw lonelygirl15 736 People & Blogs 133 151763 3.01 666 765 fQShwYqGqsw LfAaY1p_2Is 5LELNIVyMqo vW6ZpqXjCE4 vPUAf43vc-Q ZllfQZCc2_M it2d7LaU_TA KGRx8TgZEeU aQWdqI1vd6o kzwa8NBlUeo X3ctuFCCF5k Ble9N2kDiGc R24FO 字段备注详细描述
video id视频唯⼀id11位字符串
uploader视频上传者上传视频的⽤户名String
age视频年龄视频在平台上的整数天
category视频类别上传视频指定的视频分类
length视频长度整形数字标识的视频长度
views观看次数视频被浏览的次数
rate视频评分满分5分
ratings流量视频的流量,整型数字
conments评论数⼀个视频的整数评论数
related ids相关视频id相关视频的id,最多20个
2.⽤户表
#数据字段情况⼤致如下:
barelypolitical 151 5106
bonk65 89 144
camelcars 26 674
cubskickass34 13 126
boydism08 32 50
字段备注字段类型
uploader上传者⽤户名string
videos上传视频数int
friends朋友数量int
ETL原始数据清洗
通过观察原始数据形式,可以发现:
视频可以有多个所属分类category,如People & Blogs,每个所属分类⽤&符号分割,且分割的两边有空格字符
同时相关视频也是可以有多个元素,多个相关视频⼜⽤“\t”进⾏分割。
为了分析数据时⽅便对存在多个⼦元素的数据进⾏操作,我们⾸先进⾏数据重组清洗操作。即:将所有的类别⽤“&”分割,同时去掉两边空格,多个相关视频id也使⽤“&”进⾏分割。
视频的原始数据⾥,⼀⾏应该有⾄少9个字段。三件事情
因为我们的数据清洗⼯作主要有以下三件事情:
长度不够9的删掉。(视频数据⾥,⼀⾏数据不够9个字段的证明改⾏数据损坏掉,要删掉)
视频类别删掉空格
相关视频的分割符⽤&
第⼀步:创建Maven⼯程
<repositories>
<repository>
<id>cloudera</id>
<url>repository.cloudera/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0-mr1-cdh5.14.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.0-cdh5.14.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.6.0-cdh5.14.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.6.0-cdh5.14.2</version>
</dependency>
<!-- mvnrepository/artifact/junit/junit -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId&stng</groupId>
<artifactId>testng</artifactId>
<version>RELEASE</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
<scope>compile</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
<!-- <verbal>true</verbal>-->
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<minimizeJar>true</minimizeJar>
</configuration>
</execution>
</executions>
</plugin>
<!-- <plugin>
<artifactId>maven-assembly-plugin </artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass></mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>-->
</plugins>
</build>
第⼆部:代码开发:ETLUtil
public class VideoUtil {
/**
* 对我们的数据进⾏清洗的⼯作,
* 数据切割,如果长度⼩于9 直接丢掉
梦见迷路* 视频类别中间空格去掉
* 关联视频,使⽤ & 进⾏分割
* @param line
* @return
* FM1KUDE3C3k renetto 736 News & Politics 1063 9062 4.57 525 488 LnMvSxl0o0A&IKMtzNuKQso&Bq8ubu7WHkY&Su0VTfwia1w&0SNRfquDfZs&C72NVoPsRGw */
public static String washDatas(String line){
水文地质工程地质if(null == line || "".equals(line)) {
return null;
}
//判断数据的长度,如果⼩于9,直接丢掉
String[] split = line.split("\t");
if(split.length <9){
return null;
}十大红颜祸水
//将视频类别空格进⾏去掉
split[3] = split[3].replace(" ","");
StringBuilder builder = new StringBuilder();
for(int i =0;i<split.length;i++){
if(i <9){
//这⾥⾯是前⾯⼋个字段
builder.append(split[i]).append("\t");
}else if(i >=9 && i < split.length -1){
builder.append(split[i]).append("&");
}else if( i == split.length -1){
builder.append(split[i]);
}
}
return String();
}
}
第三步:代码开发:ETLMapper
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class VideoMapper extends Mapper<LongWritable,Text,Text,NullWritable> {
private Text key2 ;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
key2 = new Text();
}
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String s = VideoUtils.String());
if(null != s ){
key2.set(s);
context.write(());
}
}
}
第四步:代码开发:ETLRunner
import org.f.Configuration;
import org.f.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class VideoMain extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
Job job = Conf(), "washDatas");
job.setJarByClass(VideoMain.class);
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job,new Path(args[0]));
job.setMapperClass(VideoMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job,new Path(args[1]));
//注意,我们这⾥没有⾃定义reducer,会使⽤默认的reducer类
/
/设置reduceTask为7,从⽽⽣成7个⽂件,便于后⾯表的数据加载
job.setNumReduceTasks(7);
boolean b = job.waitForCompletion(true);
return b?0:1;
}
public static void main(String[] args) throws Exception {
int run = ToolRunner.run(new Configuration(), new VideoMain(), args);
}
}
第五步:打包jar包到集运⾏
#上传jar包到集任意节点,这⾥以node03节点为例
[hadoop@node03 hivedatas]$ rz
中国好声音第二期original-MyHive-1.0-SNAPSHOT.jar
#上传视频数据到linux本地,再从本地到hdfs
[hadoop@node03 hivedatas]$ mkdir videoData
[hadoop@node03 hivedatas]$ cd videoData
[hadoop@node03 videoData]$ rz
< 1.txt 2.txt 3.txt 4.txt
hdfs dfs -put videoData /
#上传⽤户数据到linux本地,再从本地到hdfs
[hadoop@node03 hivedatas]$ mkdir userData;cd userData
[hadoop@node03 userData]$ rz
<
hdfs dfs -put userData /
#运⾏jar包
[hadoop@node03 hivedatas]$ hadoop jar original-MyHive-1.0-SNAPSHOT.jar com.jimmy.ETL.VideoMain /videoData/ /videoOut
项⽬建表并加载数据
创建普通表
接下来我们要建⽴分桶表,因为分桶表不能够直接通过load⽅式加载数据,因此我们⾸先要建⽴两个对应的普通表来为接下来分桶表加载数据做准备。普通表使⽤textfile格式进⾏存储。youtubevideo_ori
force.bucketing=true;
set duces=-1;
create database youtube;
use youtube;
create table youtubevideo_ori(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as textfile;
youtubevideo_user_ori
create table youtubevideo_user_ori(
uploader string,
videos int,
friends int)
row format delimited fields terminated by "\t"
stored as textfile;霜狼氏族
创建分桶表
youtubevideo_orc
create table youtubevideo_orc(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
clustered by (uploader) into 8 buckets
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as orc;
youtubevideo_user_orc
create table youtubevideo_user_orc(
uploader string,
videos int,
friends int)
clustered by (uploader) into 24 buckets
row format delimited
fields terminated by "\t"
stored as orc;
2、导⼊ETL之后的数据
youtubevideo_ori
load data inpath "/videoOut" into table youtubevideo_ori;
youtubevideo_user_ori:
load data inpath "/userData" into table youtubevideo_user_ori;
3、向ORC表插⼊数据
youtubevideo_orc:
insert overwrite table youtubevideo_orc select * from youtubevideo_ori;
youtubevideo_user_orc:
insert into table youtubevideo_user_orc select * from youtubevideo_user_ori;
4、查看表结构
0: jdbc:hive2://node03:10000> desc youtubevideo_orc;
+------------+----------------+----------+--+
| col_name | data_type | comment |
+------------+----------------+----------+--+
| videoid | string | |
| uploader | string | |
| age | int | |
| category | array<string> | |
| length | int | |
| views | int | |
| rate | float | |
| ratings | int | |
| comments | int | |
| relatedid | array<string> | |
+------------+----------------+----------+--+
10 rows selected (0.084 seconds)
0: jdbc:hive2://node03:10000> desc youtubevideo_user_orc ;
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| uploader | string | |
| videos | int | |
| friends | int | |
+-----------+------------+----------+--+
业务分析
1、统计视频观看数Top10的视频
思路:使⽤order by按照views字段做⼀个全局排序即可,同时我们设置只显⽰前10条。
select videoid,views
from youtubevideo_orc
order by views desc
limit 10;
+--------------+-----------+--+
| videoid | views |
+--------------+-----------+--+
| dMH0bHeiRNg | 42513417 |
| 0XxI-hvPRRA | 20282464 |
| 1dmVU08zVpA | 16087899 |
| RB-wUgnyGv0 | 15712924 |
| QjA5faZF1A8 | 15256922 |
| -_CSo1gOd48 | 13199833 |
| 49IDp76kjPw | 11970018 |
| tYnn51C3X_w | 11823701 |
| pv5zWaTEVkI | 11672017 |
| D2kJZOfq7zk | 11184051 |
+--------------+-----------+--+
2、统计视频类别热度Top10的类别
思路:
即统计每个类别有多少个视频,显⽰出包含视频最多的前10个类别。
我们需要按照类别group by聚合,然后count组内的videoId个数即可。
因为当前表结构为:⼀个视频对应⼀个或多个类别。所以如果要group by类别,需要先将类别进⾏列转⾏(展开),然后再进⾏count即可。最后按照热度排序,显⽰前10条。
select t1.category_col as category,count(t1.videoid) as hot
from
(select videoid,category_col
from youtubevideo_orc lateral view explode(category) t_cate as category_col) t1
group by t1.category_col
order by hot desc
limit 10;
+----------------+---------+--+
| category | hot |
+----------------+---------+--+
| Music | 179049 |
| Entertainment | 127674 |
| Comedy | 87818 |
| Animation | 73293 |
| Film | 73293 |
| Sports | 67329 |
| Gadgets | 59817 |
| Games | 59817 |
| Blogs | 48890 |
| People | 48890 |
+----------------+---------+--+
3、统计出观看数最⾼的20个视频的所属类别以及类别包含Top20视频的个数
思路:
先到观看数最⾼的20个视频所属条⽬的所有信息,降序排列
把这20条信息中的category分裂出来(列转⾏)
最后查询视频分类名称和该分类下有多少个Top20的视频
视频观看数最⾼的20个视频的所属类别
select videoid,views,category
from youtubevideo_orc
order by views desc
limit 20;
+--------------+-----------+---------------------+--+
| videoid | views | category |
+--------------+-----------+---------------------+--+
| dMH0bHeiRNg | 42513417 | ["Comedy"] |
| 0XxI-hvPRRA | 20282464 | ["Comedy"] |
| 1dmVU08zVpA | 16087899 | ["Entertainment"] |
| RB-wUgnyGv0 | 15712924 | ["Entertainment"] |
| QjA5faZF1A8 | 15256922 | ["Music"] |
| -_CSo1gOd48 | 13199833 | ["People","Blogs"] |
| 49IDp76kjPw | 11970018 | ["Comedy"] |
| tYnn51C3X_w | 11823701 | ["Music"] |
| pv5zWaTEVkI | 11672017 | ["Music"] |
| D2kJZOfq7zk | 11184051 | ["People","Blogs"] |
| vr3x_RRJdd4 | 10786529 | ["Entertainment"] |
| lsO6D1rwrKc | 10334975 | ["Entertainment"] |
| 5P6UU6m3cqk | 10107491 | ["Comedy"] |
| 8bbTtPL1jRs | 9579911 | ["Music"] |
| _BuRwH59oAo | 9566609 | ["Comedy"] |
| aRNzWyD7C9o | 8825788 | ["UNA"] |
| UMf40daefsI | 7533070 | ["Music"] |
| ixsZy2425eY | 7456875 | ["Entertainment"] |
| MNxwAU_xAMk | 7066676 | ["Comedy"] |
| RUCZJVJ_M8o | 6952767 | ["Entertainment"] |
+--------------+-----------+---------------------+--+
我吹过你吹过的晚风是什么歌视频观看数最⾼的20个视频的所属类别包含Top20视频的个数select t2.category_col as category,count(t2.videoid) as hot_videos
from
(select videoid,category_col
from
(select videoid,views,category
from youtubevideo_orc
order by views desc
limit 20) t1
lateral view explode(category) t_category as category_col) t2
group by t2.category_col
order by hot_videos desc;
+----------------+-------------+--+
| category | hot_videos |
+----------------+-------------+--+
| Entertainment | 6 |
| Comedy | 6 |
| Music | 5 |
| People | 2 |
| Blogs | 2 |
| UNA | 1 |
+----------------+-------------+--+
4、统计观看数Top50的视频所关联的视频的所属类别Rank
#分析思路
#第⼀步,获取观看数前50的视频,select⼀定要有views,因为是根据views来排序的
select videoid,relatedid,views
from youtubevideo_orc
order by views desc
limit 50
#第⼆步,获取每⼀个视频的关联视频
select latedid) as videoid
from
(select videoid,relatedid,views
from youtubevideo_orc
order by views desc
limit 50) t1
#第三步,对视频去重,再获取每⼀个视频的所属类型
select distinct(t2.videoid),t3.category
from
(select latedid) as videoid
from
(select videoid,relatedid,views
from youtubevideo_orc
order by views desc
limit 50) t1) t2 inner join youtubevideo_orc t3 on t3.videoid=t2.videoid
#第四步,展开视频所属类型
select videoid,category_col as category
from
(select distinct(t2.videoid),t3.category
from
(select latedid) as videoid
from
(select videoid,relatedid,views
from youtubevideo_orc
order by views desc
limit 50) t1) t2 inner join youtubevideo_orc t3 on t3.videoid=t2.videoid
)
t4 lateral view explode(category)t_category as category_col
#第五步,进⾏分组,计数,排序
select count(t5.videoid) as hot,category
from
(select videoid,category_col as category
from
(select distinct(t2.videoid),t3.category
from
(select latedid) as videoid
from
(select videoid,relatedid,views
from youtubevideo_orc
order by views desc
limit 50) t1) t2 inner join youtubevideo_orc t3 on t3.videoid=t2.videoid
) t4 lateral view explode(category)t_category as category_col
) t5
group by category
order by hot desc;
+------+----------------+--+
| hot | category |
+------+----------------+--+
| 232 | Comedy |
| 216 | Entertainment |
| 195 | Music |
| 51 | Blogs |
| 51 | People |
| 47 | Film |
| 47 | Animation |
| 22 | News |
| 22 | Politics |
| 20 | Games |
| 20 | Gadgets |
| 19 | Sports |
| 14 | Howto |
| 14 | DIY |
| 13 | UNA |
| 12 | Places |
| 12 | Travel |
| 11 | Animals |
| 11 | Pets |
| 4 | Autos |
| 4 | Vehicles |
+------+----------------+--+
5、统计每个类别中的视频热度Top10,以Music为例
第⼀种做法:直接查询
select t1.category,t1.videoid,t1.views
from
(select videoid, category_col as category,views
from youtubevideo_orc lateral view explode (category) t_category as category_col
) t1 where t1.category="Music"
order by views desc
limit 10;
+--------------+--------------+-----------+--+
| t1.category | t1.videoid | t1.views |
+--------------+--------------+-----------+--+
| Music | QjA5faZF1A8 | 15256922 |
| Music | tYnn51C3X_w | 11823701 |
| Music | pv5zWaTEVkI | 11672017 |
| Music | 8bbTtPL1jRs | 9579911 |
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论