hadoop配置lzo
一、编译:
0. 环境准备
- maven(下载安装,配置环境变量,修改sitting.xml加阿里云镜像)
- 安装以下软件
yum -y install lzo-devel zlib-devel gcc autoconf automake libtool
1. 下载、安装并编译lzo
- 下载
wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz
- 解压
tar -zxvf lzo-2.10.tar.gz
- 编译前准备
compile文件夹下出现 include、lib、share三个目录表示成功cd lzo-2.10 export cflags=-m64 mkdir compile ./configure -enable-shared -prefix=/home/hadoop/source/lzo-source/lzo-2.10/compile/ make & make install
2. 编译hadoop-lzo源码
2.1 下载hadoop-lzo的源码,下载地址:
wget https://github.com/twitter/hadoop-lzo/archive/master.zip
2.2 解压之后,修改hadoop-lzo-master下的pom.xml文件
<properties>
<project.build.sourceencoding>utf-8</project.build.sourceencoding>
<hadoop.current.version>2.6.0-cdh5.16.2</hadoop.current.version> #这里修改成对应的hadoop版本号
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
添加cdh仓库地址
<repository>
<id>alimaven</id>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
</repository>
2.3 声明两个临时环境变量
cd hadoop-lzo-master/
export cflags=-m64
export cxxflags=-m64
export c_include_path=/home/hadoop/source/lzo-source/lzo-2.10/compile/include/ # 这里需要提供编译好的lzo的include文件
export library_path=/home/hadoop/source/lzo-source/lzo-2.10/compile/lib/ # 这里需要提供编译好的lzo的lib文件
2.4 执行编译
进入hadoop-lzo-master,执行maven编译命令
mvn clean package -dmaven.test.skip=true
出现 build success 的时候 说明成功!
2.5 jar复制
[ruoze@rzdata001 hadoop-lzo-master]$ pwd
/home/ruoze/source/lzo-source/hadoop-lzo-master
[ruoze@rzdata001 hadoop-lzo-master]$ cd target/native/linux-amd64-64/
[ruoze@rzdata001 linux-amd64-64]$ tar -cbf - -c lib . | tar -xbvf - -c ~
./
./libgplcompression.so.0.0.0
./libgplcompression.so.0
./libgplcompression.so
./libgplcompression.la
./libgplcompression.a
[ruoze@rzdata001 linux-amd64-64]$ cp ~/libgplcompression* $hadoop_home/lib/native/
[ruoze@rzdata001 linux-amd64-64]$
[ruoze@rzdata001 linux-amd64-64]$ cd ..
[ruoze@rzdata001 native]$ cd ..
[ruoze@rzdata001 target]$ cd ..
[ruoze@rzdata001 hadoop-lzo-master]$
[ruoze@rzdata001 hadoop-lzo-master]$ mkdir $hadoop_home/extlib
[ruoze@rzdata001 hadoop-lzo-master]$ cp target/hadoop-lzo-0.4.21-snapshot.jar $hadoop_home/share/hadoop/common/
[ruoze@rzdata001 hadoop-lzo-master]$ cp target/hadoop-lzo-0.4.21-snapshot.jar $hadoop_home/share/hadoop/mapreduce/lib
[ruoze@rzdata001 hadoop-lzo-master]$ cp target/hadoop-lzo-0.4.21-snapshot.jar $hadoop_home/extlib
进入target,将hadoop-lzo-0.4.21-snapshot.jar放到hadoop的classpath下,
如${hadoop_home}/share/hadoop/common
2.6 修改hadoop-env.sh
vim $hadoop_home/etc/hadoop/hadoop-env.sh
# 增加 编译好的lzo包下的lib
export ld_library_path=/home/hadoop/app/lzo-2.06/complie/lib
2.7 修改core-site.xml
增加配置支持lzo压缩
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.gzipcodec,
org.apache.hadoop.io.compress.defaultcodec,
org.apache.hadoop.io.compress.bzip2codec,
org.apache.hadoop.io.compress.snappycodec,
com.hadoop.compression.lzo.lzocodec,
com.hadoop.compression.lzo.lzopcodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.lzocodec</value>
</property>
2.8 修改 mapred-site.xml
/home/ruoze/app/hadoop/extlib文件夹在2.5中已创建
<property>
<name>mapred.child.env </name>
<value>ld_library_path=/home/ruoze/app/hadoop/extlib</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>com.hadoop.compression.lzo.lzocodec</value>
</property>
2.9启动及查看集群
重启
sbin/stop-all.sh
sbin/start-all.sh
检查
jps 查看进程
web查看:http://hadoop102:50070
hadoop fs 验证
2.9 当启动发生错误的时候:
查看日志:/home/atguigu/module/hadoop-2.7.2/logs
如果进入安全模式,可以通过hdfs dfsadmin -safemode leave
停止所有进程,删除data和log文件夹,然后hdfs namenode -format 来格式化
3 验证使用
3.1 输出使用lzo压缩
hadoop jar \
$hadoop_home/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar \
wordcount \
-dmapreduce.output.fileoutputformat.compress=true \
-dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.lzopcodec \
/user/ruoze/data/wordcount/special/in \
/user/ruoze/data/wordcount/special/out9
3.2 输入使用压缩
3.2.1 linux安装lzo压缩
yum install lzop
3.2.2 准备数据
[ruoze@rzdata001 sfile]$ ll
-rw-rw-r-- 1 ruoze ruoze 1349080023 apr 11 09:13 cdn.log
[ruoze@rzdata001 sfile]$ lzop -v cdn.log
compressing cdn.log into cdn.log.lzo
[ruoze@rzdata001 sfile]$ ls -lh
-rw-rw-r-- 1 ruoze ruoze 1.3g apr 11 09:13 cdn.log
-rw-rw-r-- 1 ruoze ruoze 305m apr 11 09:13 cdn.log.lzo
[ruoze@rzdata001 sfile]$
[ruoze@rzdata001 sfile]$ hadoop fs -put cdn.log.lzo /user/ruoze/data/lzo/input
[ruoze@rzdata001 sfile]$
3.2.3 未建索引
hadoop jar \
$hadoop_home/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar \
wordcount \
/user/ruoze/data/lzo/input/cdn.log.lzo \
/user/ruoze/data/lzo/output1
查看执行过程,可以看到 ** number of splits:1** ,说明没有给lzo文件切片
20/04/11 09:27:41 info client.rmproxy: connecting to resourcemanager at /0.0.0.0:8032
20/04/11 09:27:41 info input.fileinputformat: total input paths to process : 1
20/04/11 09:27:41 info lzo.gplnativecodeloader: loaded native gpl library from the embedded binaries
20/04/11 09:27:41 info lzo.lzocodec: successfully loaded & initialized native-lzo library [hadoop-lzo rev 5dbdddb8cfb544e58b4e0b9664b9d1b66657faf5]
20/04/11 09:27:42 info mapreduce.jobsubmitter: number of splits:1
20/04/11 09:27:42 info mapreduce.jobsubmitter: submitting tokens for job: job_1586566801628_0001
20/04/11 09:27:43 info impl.yarnclientimpl: submitted application application_1586566801628_0001
3.2.4 创建索引
hadoop jar \
$hadoop_home/share/hadoop/mapreduce/lib/hadoop-lzo-0.4.21-snapshot.jar \
com.hadoop.compression.lzo.distributedlzoindexer \
/user/ruoze/data/lzo/input/cdn.log.lzo
多了一个索引文件
[ruoze@rzdata001 sfile]$ hadoop fs -ls /user/ruoze/data/lzo/input
-rw-r--r-- 1 ruoze supergroup 319220023 2020-04-11 09:23 /user/ruoze/data/lzo/input/cdn.log.lzo
-rw-r--r-- 1 ruoze supergroup 41176 2020-04-11 09:57 /user/ruoze/data/lzo/input/cdn.log.lzo.index
[ruoze@rzdata001 sfile]$
3.2.5 执行带索引的lzo文件
hadoop jar \
$hadoop_home/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar \
wordcount \
/user/ruoze/data/lzo/input/cdn.log.lzo \
/user/ruoze/data/lzo/output2
查看执行结果,还是没有切片,单纯做了索引还是不行的,在运行程序的时候还要对要运行的程序进行相应的更改,
把inputformat设置成lzotextinputformat,不然还是会把索引文件也当做是输入文件,还是只运行一个map来处理。
提交任务时增加 -dmapreduce.job.inputformat.class=com.hadoop.mapreduce.lzotextinputformat
hadoop jar \
$hadoop_home/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar \
wordcount \
-dmapreduce.job.inputformat.class=com.hadoop.mapreduce.lzotextinputformat \
/user/ruoze/data/lzo/input/cdn.log.lzo \
/user/ruoze/data/lzo/output2
查看结果,已经将 305mb的lzo文件分成3个map处理了
20/04/11 10:02:59 info client.rmproxy: connecting to resourcemanager at /0.0.0.0:8032
20/04/11 10:02:59 info input.fileinputformat: total input paths to process : 1
20/04/11 10:03:00 info mapreduce.jobsubmitter: number of splits:3
20/04/11 10:03:00 info mapreduce.jobsubmitter: submitting tokens for job: job_1586566801628_0004
20/04/11 10:03:00 info impl.yarnclientimpl: submitted application application_1586566801628_0004
发表评论