SQL使用窗口函数计算百分位数
SQL使⽤窗⼝函数计算百分位数
百分位数:如果将⼀组数据从⼩到⼤排序,并计算相应的累计百分位,则某⼀百分位所对应数据的值就称为这⼀百分位的百分位数。可表⽰为:⼀组n个观测值按数值⼤⼩排列。如,处于p%位置的值称第p百分位数。
下⾯给出3种计算⽅式:
1. PERCENT_RANK() OVER(ORDER BY .....)
返回每⾏的百分⽐排序,返回值在0~1之间,使⽤此函数可以直接得出百分位数,等价于分组内当前⾏的RANK值-1/分组内总⾏数-1
2. RANK() OVER(ORDER BY .....) /COUNT(1) OVER()
使⽤rank()函数可以统计出当前⾏的排名,配合总数即可算出百分位数,总数使⽤COUNT(1) OVER() 即可得出
3. COUNT(1) OVER(ORDER BY .....RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW) /COUNT(1) OVER()
⼿动调整窗⼝范围,确认当前⾏的排名,配合总数即可算出百分位数,总数使⽤COUNT(1) OVER() 即可得出
下⾯将举例给出具体使⽤⽅法
举例场景:计算学⽣成绩的百分位数
注:本次测试在oracle环境下完成,不过使⽤到的函数⼤部分数据库都⽀持,⼤家有兴趣的话可以尝试⼀下其他数据库
创建学⽣成绩表:
CREATE TABLE TEST.STUDENT_SCORE(
name varchar(20),  --学⽣姓名
course varchar(20), --科⽬
score NUMBER(5,2) --成绩
);
写⼊测试数据:
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('张三','政治',90.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('李四','政治',79.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('王五','政治',85.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('赵六','政治',93.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('⼩明','政治',92.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('⼩红','政治',88.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('⼩吕','政治',76.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('⼩⾼','政治',93.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('张三','外语',87.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('李四','外语',92.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('王五','外语',69.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('赵六','外语',76.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('⼩⾼','外语',76.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('张三','⾼数',95.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('李四','⾼数',70.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('王五','⾼数',65.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('赵六','⾼数',88.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('张三','算法',59.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('张三','数据结构',99.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('李四','数据结构',89.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('王五','数据结构',69.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('赵六','数据结构',90.5);
1.使⽤ PERCENT_RANK() OVER(ORDER BY .....) 计算各个科⽬的百分位数:
--⽤法⾮常简单,此处将百分位数乘100,使百分位数在0~100之间
SELECT
name 姓名,
course 科⽬,
score 成绩,
ROUND(PERCENT_RANK() OVER(PARTITION BY COURSE ORDER BY SCORE DESC)*100, 2) 百分位数FROM TEST.STUDENT_SCORE ;
--PARTITION BY COURSE ORDER BY SCORE DESC :按COURSE分区,使⽤SCORE降序排序
结果:
2.使⽤ RANK() OVER(ORDER BY .....)  /COUNT(1) OVER() 计算各个科⽬的百分位数:
--这种写法使⽤总⼈数和排名来计算百分位,复杂⼀些,但是算法可以⾃⼰修改
SELECT
name 姓名,
course 科⽬,
score 成绩,
score_rank 排名,
students 总⼈数,
CASE WHEN students > 1
THEN ROUND(score_rank * 100 / (students - 1), 2)
ELSE 0
END 百分位数
FROM (
SELECT
name,
course,
score,
RANK() over(PARTITION BY course ORDER BY score DESC)-1 score_rank, --当前⾏的排名
count(1) over(PARTITION BY course) students --当前科⽬的总⼈数
FROM TEST.STUDENT_SCORE
);
--RANK函数的排名从1开始,所以这⾥给他减⼀,便于后续计算
结果(排名从0开始):
3.使⽤COUNT(1) OVER(ORDER BY .....RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)  /COUNT(1) OVER()  计算各个科⽬的百分位数:
--这种写法更复杂⼀些,效果与前⾯的是⼀样的,可以调整的地⽅更多⼀些
SELECT
name 姓名,
course 科⽬,
score 成绩,
students-score_rank 排名,
students 总⼈数,
CASE WHEN students > 1
THEN ROUND((students-score_rank) * 100 / (students - 1), 2)
ELSE 0
END 百分位数
FROM (
SELECT
name,
course,
score,
count(0) over(PARTITION BY course ORDER BY score RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) score_rank, --当前⾏的排名,倒序count(0) over(PARTITION BY course) students --当前科⽬的总⼈数
FROM TEST.STUDENT_SCORE
);
--PARTITION BY course ORDER BY score RANGE BETWEEN UNBOUNDED PRECEDING AND C
URRENT ROW :按COURSE分区,使⽤SCORE升序排序,窗⼝范结果(排名从0开始):
扩展知识:
PARTITION BY :分组⼦句,表⽰分析函数的计算范围,不同的组互不相⼲;
ORDER BY: 排序⼦句,表⽰分组后,组内的排序⽅式;
ROWS|RANGE :窗⼝⼦句,是在分组(PARTITION BY)后,组内的⼦分组(也称窗⼝),可以选择窗⼝的范围,需要配合ORDER BY⼦
句使⽤
注:⼀般ORDER BY⼦句后默认的窗⼝⼦句为 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
ROWS|RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW --从第⼀⾏到当前⾏
ROWS|RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING --从当前⾏到结尾⾏
ROWS BETWEEN CURRENT ROW AND n FOLLOWING --从当前⾏到随后的n⾏
ROWS BETWEEN n PRECEDING AND CURRENT ROW --从前n⾏到当前⾏
ROWS BETWEEN n FOLLOWING AND UNBOUNDED FOLLOWING --从下n⾏到结尾⾏
ROWS、RANGE的区别:
ROWS是物理窗⼝,即根据order by ⼦句排序后,取的前N⾏及后N⾏的数据计算(与当前⾏的值⽆关,只与排序后的⾏号相关)RANGE是逻辑窗⼝,是指定当前⾏对应值的范围取值,⾏数不固定,只要列值在范围内,对应⾏都包含在内
举例:
SELECT
name 姓名,
course 科⽬,
score 成绩,
count(0) over(PARTITION BY course) 总⼈数,
count(0) over(PARTITION BY course ORDER BY score RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)  RANK1,
count(0) over(PARTITION BY course ORDER BY score ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)  RANK2
FROM TEST.STUDENT_SCORE ;
结果:
RANGE 关键字配合 n FOLLOWING 使⽤时发现的问题:
SELECTrank函数的用法
name 姓名,
course 科⽬,
score 成绩,
count(0) over(PARTITION BY course) 总⼈数,
count(0) over(PARTITION BY course ORDER BY score RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)  RANK1,
count(0) over(PARTITION BY course ORDER BY score ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)  RANK2
FROM TEST.STUDENT_SCORE ;
结果:
可以看见,使⽤RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING来选择窗⼝范围时,成绩为92.5时,与93的值居然是⼀样的,oracle、mysql环境都出现了这种情况,具体原因还没搞清楚

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。