hive anti join 的几种写法_大数据

t_a 表的记录如下

c1 |
:———— |
a |
b |
c |

生成 sql 如下：

create table t_a(c1 string);
insert into t_a values("a"),("b"),("c");

t_b 表的记录如下

c1
b
m

生成 sql 如下：

create table t_b(c1 string);
insert into t_b values("b"),("m");

我们要在t_a 中出现，而不在 t_b中出现的记录。
结果需要为：

c1
a
c

写法1 — 使用not in

select * from t_a 
where c1 not in(select c1 from t_b);

写法2 —使用 left join 关联上的去掉
这种写法不容易读懂。

select a.* from t_a a left join t_b b
on a.c1=b.c1
where b.c1 is null;

注意 left join 中 b.c1 is null 不能谓词下推。

生成的执行计划如下。注意，在 join 后才过滤 _col1 is null，关联上的 _col1 肯定是 not null，所以关联上的全去掉。

plan optimized by cbo.

vertex dependency in root stage
map 1 <- map 2 (broadcast_edge)

stage-0
  fetch operator
    limit:-1
    stage-1
      map 1 vectorized
      file output operator [fs_20]
        select operator [sel_19] (rows=1 width=93)
          output:["_col0"]
          filter operator [fil_18] (rows=1 width=93)
            predicate:_col1 is null
            map join operator [mapjoin_17] (rows=2 width=93)
              conds:sel_16._col0=rs_15._col0(left outer),output:["_col0","_col1"]
            <-map 2 [broadcast_edge] vectorized
              broadcast [rs_15]
                partitioncols:_col0
                select operator [sel_14] (rows=2 width=85)
                  output:["_col0"]
                  tablescan [ts_2] (rows=2 width=85)
                    ods@t_b,b,tbl:complete,col:none,output:["c1"]
            <-select operator [sel_16] (rows=2 width=85)
                output:["_col0"]
                tablescan [ts_0] (rows=2 width=85)
                  ods@t_a,a,tbl:complete,col:none,output:["c1"]

time taken: 0.159 seconds, fetched: 29 row(s)

写法3 — except
这种写法运行速度比较慢，并且如果每个表有多个字段，但是，仅按少数的字段进行判断的话就不适合。

select * from t_a except select * from t_b;

Apache Doris：下一代实时数据仓库

Apache Doris 简介：下一代实时数据仓库…

2024年07月28日 • 数据库

数据仓库（2）-认识数仓

认识数据仓库…

2024年07月28日 • 数据库

离线数仓（五）【数据仓库建模】

离线数仓建模理论…

2024年07月28日 • 数据库

综合案例——构建DVD租赁商店数据仓库

综合案例——构建DVD租赁商店数据仓库…

2024年07月28日 • 数据库

Eureka入门

通过以上步骤，你已经成功设置了一个简单的Eureka服务注册和发现系统。Eureka服务器管理服务实例，Eureka客户端注册自身并能够发现其他服务。这是微服务... [阅读全文]

Docker学习（8）容器监控

Weave Scope是Docker和Kubernetes的可视化监控工具，它提供了至上而下的集群基础设施和应用的完整视图，使用户能够轻松对分布式的容器化应用进... [阅读全文]


验证码：

验证码：

hive anti join 的几种写法

2024年07月28日 • 大数据 •我要评论

相关文章:

Apache Doris：下一代实时数据仓库

数据仓库（2）-认识数仓

离线数仓（五）【数据仓库建模】

综合案例——构建DVD租赁商店数据仓库

发表评论