sql方式实现连续n天登陆构造测试数据create table dwd.login_log asselect 1 as user_id, "2020-01-01" as login_dateunion
sql方式实现连续n天登陆
构造测试数据
create table dwd.login_log as
select 1 as user_id, "2020-01-01" as login_date
union all
select 1 as user_id, "2020-01-02" as login_date
union all
select 1 as user_id, "2020-01-07" as login_date
union all
select 1 as user_id, "2020-01-08" as login_date
union all
select 1 as user_id, "2020-01-09" as login_date
union all
select 1 as user_id, "2020-01-10" as login_date
union all
select 2 as user_id, "2020-01-01" as login_date
union all
select 2 as user_id, "2020-01-02" as login_date
union all
select 2 as user_id, "2020-01-04" as login_date
如果日期格式不规范,可以将其转换为标准格式
create table dwd.login_log as
select user_id,to_date(from_unixtime(unix_timestamp(login_date,'yyyy-mm-dd'))) as login_date
from tmp.login_log; -- tmp库为原始数据
1.使用lag&lead+datediff窗口函数
- 比如求连续三天登陆,可以将当天上一条数据和下一条数据都拿到,然后保证now-lag=lead-now=1即可;
- 如果是连续多天,可以取更多的数据,或者将数据全部更改为lag或者lead函数;
datediff(date1, date2) - returns the number of days between date1 and date2
select user_id
from
(select user_id
from
(select user_id,
lag(login_date,1) over(partition by user_id order by login_date) as lag_login_date,
login_date,
lead(login_date,1) over(partition by user_id order by login_date) as lead_login_date
from dwd.login_log)t1
where datediff(login_date,lag_login_date)=1 and datediff(lead_login_date,login_date)=1)t2
group by user_id;
2.使用date_add函数
- 通用的,先对user_id分区排序,然后将日期减去rank天,查看有多少条数据即可;
- 优点在于可以统计具体连续登陆多少天,以及连续登陆的实际情况;
date_add(start_date, num_days) - returns the date that is num_days after start_date
select user_id,con_login_date,count(*) nums
from
(select user_id,login_date,rk,date_add(login_date,1 - rk) as con_login_date
from
(select user_id,login_date,rank() over(partition by user_id order by login_date) rk
from dwd.login_log)t1
)t2
group by user_id,con_login_date
having count(*) >= 3;
| 用户id | 登陆时间 | 按照登陆时间组内排序 |
|---|
| 1 | 2020-01-01 | 1 |
| 1 | 2020-01-02 | 2 |
| 1 | 2020-01-07 | 3 |
| 1 | 2020-01-08 | 4 |
| 1 | 2020-01-09 | 5 |
| 1 | 2020-01-10 | 6 |
| 2 | 2020-01-01 | 1 |
| 2 | 2020-01-02 | 2 |
| 2 | 2020-01-04 | 3 |
- t2表的查询结果,归一化的日期(也就是上述取前
1 - rk)可以自己定义
| 用户id | 登陆时间 | 连续登陆的日期归一化的日期 |
|---|
| 1 | 2020-01-01 | 2020-01-01 |
| 1 | 2020-01-02 | 2020-01-01 |
| 1 | 2020-01-07 | 2020-01-05 |
| 1 | 2020-01-08 | 2020-01-05 |
| 1 | 2020-01-09 | 2020-01-05 |
| 1 | 2020-01-10 | 2020-01-05 |
| 2 | 2020-01-1 | 2020-01-01 |
| 2 | 2020-01-2 | 2020-01-01 |
| 2 | 2020-01-4 | 2020-01-02 |
- group by后的查询结果,第三列可以按照session内统计来理解,就是这批连续登陆内连续登陆的天数
| 用户id | 连续登陆的日期归一化的日期 | 用户此次连续登陆天数 |
|---|
| 1 | 2020-01-01 | 2 |
| 1 | 2020-01-05 | 4 |
| 2 | 2020-01-01 | 2 |
| 2 | 2020-01-02 | 1 |
代码实现思路
- 使用代码来实现连续n天登陆,核心逻辑就是
按照日期排序,新日期如果和旧日期相差1天就保留在hashmap里面,size超过n即可输出user_id,否则清空
package cn.lang.spark_core
import java.text.{parseexception, simpledateformat}
import java.util.calendar
import org.apache.spark.sql.sparksession
object continuouslogindays {
def main(args: array[string]): unit = {
// env
val spark: sparksession = sparksession
.builder()
.appname("continuouslogindays")
.master("local[*]")
.getorcreate()
val sc = spark.sparkcontext
// source,可以是load hive(开启hive支持)或者parquet列式文件(定义好schema)
val source = sc.textfile("/user/hive/warehouse/dwd/login_log")
case class login(uid: int, logintime: string) // 可以kryo序列化
/** get date last `abs(n)` days defore or after biz_date *
* example biz_date = 20200101 ,last_n = 1,return 20191231 */
def getlastndate(biz_date: string,
date_format: string = "yyyymmdd",
last_n: int = 1): string = {
val calendar: calendar = calendar.getinstance()
val sdf = new simpledateformat(date_format)
try
calendar.settime(sdf.parse(biz_date))
catch {
case e: parseexception => // omit
}
calendar.set(calendar.date, calendar.get(calendar.date) - last_n)
sdf.format(calendar.gettime)
}
// transform
val result = source
.map(_.split("\t"))
.map(iterm => login(iterm(0).toint, iterm(1)))
.groupby(_.uid) // rdd[(int, iterable[login])]
.map(iterm => {
// 用于给此uid标记是否符合要求
var continuous_login_n = false
val logins = iterm._2
.toseq
.sortwith((v1, v2) => v1.logintime.compareto(v2.logintime) > 0)
var lastlogintime: string = ""
var logindays: int = 0
logins
.foreach(iterm => {
if (lastlogintime == "") {
lastlogintime = iterm.logintime
logindays = 1
} else if (getlastndate(iterm.logintime) == lastlogintime) {
lastlogintime = iterm.logintime
logindays = 2
} else {
lastlogintime = iterm.logintime
logindays = 1
}
})
if (logindays > 3) continuous_login_n = true
/** 此处可以使用集合将连续登陆的情况保留,
* 也可以直接按照是否连续登陆n天进行标记
*/
(iterm._1, continuous_login_n)
})
.filter(_._2)
.map(_._1)
// sink
result.foreach(println(_))
}
}
总结
到此这篇关于hive实现连续n天登陆语法的文章就介绍到这了,更多相关hive连续n天登陆内容请搜索代码网以前的文章或继续浏览下面的相关文章希望大家以后多多支持代码网!
版权声明:本文内容由互联网用户贡献,该文观点仅代表作者本人。本站仅提供信息存储服务,不拥有所有权,不承担相关法律责任。
如发现本站有涉嫌抄袭侵权/违法违规的内容, 请发送邮件至 2386932994@qq.com 举报,一经查实将立刻删除。
发表评论