Hudi系列13:Hudi集成Hive_ar

一. hudi集成hive概述

hudi 源表对应一份 hdfs数据，通过spark，flink组件或者hudi cli，可以将hudi表的数据映射为hive外部表，基于该外部表，hive可以方便的进行实时视图，读优化视图以及增量的查询。

二. hudi集成hive步骤

以 hive 3.1.2、hudi 0.12.0 为例。

2.1 拷贝jar包

2.1.1 拷贝编译好的hudi的jar包

将hudi-hive-sync-bundle-0.12.0.jar 和 hudi-hadoop-mr-bundle-0.12.0.jar 放到hive节点的lib目录下

cd /home/hudi-0.12.0/packaging/hudi-hive-sync-bundle/target
cp ./hudi-hive-sync-bundle-0.12.0.jar /home/apache-hive-3.1.2-bin/lib/
cd /home/hudi-0.12.0/packaging/hudi-hadoop-mr-bundle/target
cp ./hudi-hadoop-mr-bundle-0.12.0.jar /home/apache-hive-3.1.2-bin/lib/

2.1.2 拷贝hive jar包到flink lib目录

将hive的lib拷贝到flink的lib目录

cd $hive_home/lib
cp ./hive-exec-3.1.2.jar $flink_home/lib/
cp ./libfb303-0.9.3.jar $flink_home/lib/

https://nightlies.apache.org/flink/flink-docs-release-1.15/zh/docs/connectors/table/hive/overview/

2.1.3 flink以及flink sql连接hive的jar包

wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-hive_2.12/1.14.5/flink-connector-hive_2.12-1.14.5.jar
wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-hive-3.1.2_2.12/1.14.5/flink-sql-connector-hive-3.1.2_2.12-1.14.5.jar

2.2 重启hive

拷贝jar包之后，需要重启hive

2.3 flink访问hive表

2.3.1 启动flink sql client

# 启动yarn session(非root账户)
/home/flink-1.14.5/bin/yarn-session.sh -d  2>&1 &

# 在yarn session模式下启动flink sql
/home/flink-1.14.5/bin/sql-client.sh embedded -s yarn-session

2.3.2 创建hive catalog

create catalog hive_catalog with (
    'type' = 'hive',
    'default-database' = 'default',
    'hive-conf-dir' = '/home/apache-hive-3.1.2-bin/conf'
);

2.3.3 切换 catalog

use catalog hive_catalog;

2.3.4 查询hive表

use test;
show tables;
-- flink可以直接读取hive表
select * from t1;

2.4 flink 同步hive

flink hive sync 现在支持两种 hive sync mode，分别是 hms 和 jdbc 模式。其中 hms 只需要配置 metastore uris；而 jdbc模式需要同时配置 jdbc 属性和 metastore uris。

配置模板:

三. 实操案例（cow）

3.1 在内存中创建hudi表(不使用catalog)

代码:


-- 创建表
create table t_cow1 (
  id     int primary key,
  num    int,
  ts     int
)
partitioned by (num)
with (
  'connector' = 'hudi',
  'path' = 'hdfs://hp5:8020/tmp/hudi/t_cow1',
  'table.type' = 'copy_on_write',
  'hive_sync.enable' = 'true',
  'hive_sync.table' = 't_cow1',
  'hive_sync.db' = 'test',
  'hive_sync.mode' = 'hms',
  'hive_sync.metastore.uris' = 'thrift://hp5:9083',
  'hive_sync.conf.dir'='/home/apache-hive-3.1.2-bin/conf'
);




-- 只有在写数据的时候才会触发同步hive表
insert into t_cow1 values (1,1,1);

测试记录:
flink sql运行记录:

hive的test库下面多了一个t_cow1 表

hive端查询数据:

3.2 在catalog中创建hudi表

3.2.1 指定到hive目录之外

代码:

-- 创建目录
create catalog hive_catalog with (
    'type' = 'hive',
    'default-database' = 'default',
    'hive-conf-dir' = '/home/apache-hive-3.1.2-bin/conf'
);
        
-- 进入目录
use catalog hive_catalog;

use test;

  -- 创建表
create table t_catalog_cow1 (
  id     int primary key,
  num    int,
  ts     int
)
partitioned by (num)
with (
  'connector' = 'hudi',
  'path' = 'hdfs://hp5:8020/tmp/hudi/t_catalog_cow1',
  'table.type' = 'copy_on_write',
  'hive_sync.enable' = 'true',
  'hive_sync.table' = 't_catalog_cow1',
  'hive_sync.db' = 'test',
  'hive_sync.mode' = 'hms',
  'hive_sync.metastore.uris' = 'thrift://hp5:9083',
  'hive_sync.conf.dir'='/home/apache-hive-3.1.2-bin/conf'
);


insert into t_catalog_cow1 values (1,1,1);

测试记录:
flink sql 这边是可以查看到表

flink sql查询数据也没问题

hive端可以看到表，但是查询不到数据:

hive端查看建表语句:

发现问题:
cow的表从hudi同步过来之后，直接少了partition字段。
也就是相当于在使用hive catalog的情况下，通过flink创建的hudi表自动同步到hive这边是存在一定的问题的

3.2.2 指定到hive目录之内

代码:

-- 创建目录
create catalog hive_catalog with (
    'type' = 'hive',
    'default-database' = 'default',
    'hive-conf-dir' = '/home/apache-hive-3.1.2-bin/conf'
);
        
-- 进入目录
use catalog hive_catalog;

use test;

  -- 创建表
create table t_catalog_cow2 (
  id     int primary key,
  num    int,
  ts     int
)
partitioned by (num)
with (
  'connector' = 'hudi',
  'path' = 'hdfs://hp5:8020/user/hive/warehouse/test.db/t_catalog_cow2',
  'table.type' = 'copy_on_write',
  'hive_sync.enable' = 'true',
  'hive_sync.table' = 't_catalog_cow2',
  'hive_sync.db' = 'test',
  'hive_sync.mode' = 'hms',
  'hive_sync.metastore.uris' = 'thrift://hp5:9083',
  'hive_sync.conf.dir'='/home/apache-hive-3.1.2-bin/conf'
);


insert into t_catalog_cow2 values (1,1,1);

测试记录:
问题依旧存在

3.2.3 使用参数指定hudi表分区

代码:

create table t_catalog_cow4 (
  id     int primary key,
  num    int,
  ts     int
)
partitioned by (num)
with (
  'connector' = 'hudi',
  'path' = 'hdfs://hp5:8020/tmp/hudi/t_catalog_cow4',
  'table.type' = 'copy_on_write',
  'hive_sync.enable' = 'true',
  'hive_sync.table' = 't_catalog_cow4',
  'hive_sync.db' = 'test',
  'hoodie.datasource.write.keygenerator.class' = 'org.apache.hudi.keygen.complexavrokeygenerator',
  'hoodie.datasource.write.recordkey.field' = 'id',
  'hoodie.datasource.write.hive_style_partitioning' = 'true',
  'hive_sync.mode' = 'hms',
  'hive_sync.metastore.uris' = 'thrift://hp5:9083',
  'hive_sync.conf.dir'='/home/apache-hive-3.1.2-bin/conf',
  'hive_sync.partition_fields' = 'dt',
  'hive_sync.partition_extractor_class' = 'org.apache.hudi.hive.hivestylepartitionvalueextractor'
);


insert into t_catalog_cow4 values (1,1,1);

测试记录:

四. 实操案例（mor）

4.1 在内存中创建hudi表(不使用catalog)

代码:

-- 创建表
create table t_mor1 (
  id     int primary key,
  num    int,
  ts     int
)
partitioned by (num)
with (
  'connector' = 'hudi',
  'path' = 'hdfs://hp5:8020/tmp/hudi/t_mor1',
  'table.type' = 'merge_on_read',
  'hive_sync.enable' = 'true',
  'hive_sync.table' = 't_mor1',
  'hive_sync.db' = 'test',
  'hive_sync.mode' = 'hms',
  'hive_sync.metastore.uris' = 'thrift://hp5:9083',
  'hive_sync.conf.dir'='/home/apache-hive-3.1.2-bin/conf'
);

-- 只有在写数据的时候才会触发同步hive表
-- hive只能读取parquet的数据，mor的表不会立马生成parquet文件，需要多录入几条数据，或者使用spark-sql再多录入几条数据
insert into t_mor1 values (1,1,1);

测试记录:
hdfs:
只有log，没有parquet文件

insert into t_mor1 values (2,1,2);
insert into t_mor1 values (3,1,3);
insert into t_mor1 values (4,1,4);
insert into t_mor1 values (5,1,5);

flink web：

多了几个表:
t_mor1 是hudi表，通过flink可以进行读写

t_mor1_ro、t_mor1_rt hive表，可以通过hive、spark进行操作

hive端查看数据:
因为没有parquet文件，所以没有数据生成

加入了很多的测试数据，结果依旧是log文件而没有parquet文件…

退出重新登陆:
flink sql 客户端这边看不到之前的表了

hive这边，退出重新登陆，依旧是存在的。

faq:

faq1: noclassdeffounderror parquetinputformat

问题描述:
在flink sql客户端查询cow表的时候报错

[error] could not execute sql statement. reason:
java.lang.noclassdeffounderror: org/apache/parquet/hadoop/parquetinputformat

解决方案:
找到hudi编译时候的parquet的包，拷贝到flink的lib目录

参考:

https://hudi.apache.org/cn/docs/syncing_metastore/
https://dongkelun.com/2022/08/26/flinksqlclientqueryhive/
https://www.modb.pro/db/539792

Hudi系列13:Hudi集成Hive

2024年08月04日 • ar •我要评论

文章目录

一. hudi集成hive概述

二. hudi集成hive步骤

2.1 拷贝jar包

2.1.1 拷贝编译好的hudi的jar包

2.1.2 拷贝hive jar包到flink lib目录

2.1.3 flink以及flink sql连接hive的jar包

2.2 重启hive

2.3 flink访问hive表

2.3.1 启动flink sql client

2.3.2 创建hive catalog

2.3.3 切换 catalog

2.3.4 查询hive表

2.4 flink 同步hive

三. 实操案例（cow）

3.1 在内存中创建hudi表(不使用catalog)

3.2 在catalog中创建hudi表

3.2.1 指定到hive目录之外

3.2.2 指定到hive目录之内

3.2.3 使用参数指定hudi表分区

四. 实操案例（mor）

4.1 在内存中创建hudi表(不使用catalog)

faq:

faq1: noclassdeffounderror parquetinputformat

参考:

相关文章:

2024年大数据最新数据仓库面试题整理超详细_数仓面试(2)，2024年最新大数据开发校招面试经验汇总

你真的看懂扩散模型(diffusion model)了吗？(从DALL·E 2讲起，GAN、VAE、MAE都有)

发表评论


验证码：