一、问题
环境为jdk1.8,spark3.2.1,读取hadoop中gb18030编码格式的文件出现乱码。
二、心酸历程
为了解决该问题,尝试过很多种方法,但都没有成功
1、textfile+configuration方式——乱码
string filepath = "hdfs:///user/test.deflate";
//创建sparksession和sparkcontext的实例
string encoding = "gb18030";
sparksession spark = sparksession.builder()
.master("local[*]").appname("spark example")
.getorcreate();
javasparkcontext sc = javasparkcontext.fromsparkcontext(spark.sparkcontext());
configuration entries = sc.hadoopconfiguration();
entries.set("textinputformat.record.delimiter", "\n");
entries.set("mapreduce.input.fileinputformat.inputdir",filepath);entries.set("mapreduce.input.fileinputformat.encoding", "gb18030");
javardd<string> rdd = sc.textfile(filepath);
2、spark.read().option方式——乱码
dataset<row> load = spark.read().format("text").option("encoding", "gb18030").load(filepath);
load.foreach(row -> {
system.out.println(row.tostring());
system.out.println(new string(row.tostring().getbytes(encoding),"utf-8"));
system.out.println(new string(row.tostring().getbytes(encoding),"gbk"));
});
3、newapihadoopfile+configuration——乱码
javapairrdd<longwritable, text> longwritabletextjavapairrdd = sc.newapihadoopfile(filepath, textinputformat.class, longwritable.class, text.class, entries );
system.out.println("longwritabletextjavapairrdd count ="+longwritabletextjavapairrdd.count());
longwritabletextjavapairrdd.foreach(k->{
system.out.println(k._2);
});
4、newapihadoopfile+自定义类——乱码
javapairrdd<longwritable, text> longwritabletextjavapairrdd = sc.newapihadoopfile(filepath, gbkinputformat.class, longwritable.class, text.class, entries );
system.out.println("longwritabletextjavapairrdd count ="+longwritabletextjavapairrdd.count());
longwritabletextjavapairrdd.foreach(k->{
system.out.println(k._2);
});
代码中gbkinputformat.class是textinputformat.class复制将内部utf-8修改为gb18030所得
5、newapihadooprdd+自定义类——乱码
javapairrdd<longwritable, text> longwritabletextjavapairrdd1 = sc.newapihadooprdd(entries, gbkinputformat.class, longwritable.class, text.class);
system.out.println("longwritabletextjavapairrdd count ="+longwritabletextjavapairrdd1.count());
longwritabletextjavapairrdd1.foreach(k->{
system.out.println(k._2());
});
三、最终解决
上述方法感觉指定的字符编码并没有生效不知道为什么,如有了解原因的还请为我解惑,谢谢
最终解决方案如下
javapairrdd<longwritable, text> longwritabletextjavapairrdd = sc.newapihadoopfile(filepath, textinputformat.class, longwritable.class, text.class, new configuration());
system.out.println("longwritabletextjavapairrdd count ="+longwritabletextjavapairrdd.count());
longwritabletextjavapairrdd.foreach(k->{
system.out.println(new string(k._2.copybytes(), encoding));
});
javapairrdd<longwritable, text> longwritabletextjavapairrdd1 = sc.newapihadooprdd(entries, textinputformat.class, longwritable.class, text.class);
system.out.println("longwritabletextjavapairrdd count ="+longwritabletextjavapairrdd1.count());
longwritabletextjavapairrdd1.foreach(k->{
system.out.println(new string(k._2().copybytes(),encoding));
system.out.println(new string(k._2.copybytes(),encoding));
});
主要是new string(k._2().copybytes(),encoding)得以解决
到此这篇关于java spark文件读取乱码问题的解决方法的文章就介绍到这了,更多相关java spark文件读取乱码内容请搜索代码网以前的文章或继续浏览下面的相关文章希望大家以后多多支持代码网!
发表评论