项目场景和问题描述:
fllink 批任务运行一段时间后出现如下错误:
java.nio.file.nosuchfileexception: /tmp/flink-netty-shuffle-c5222ebc-a7bb-4fa1-bfd2-c7b5c9bd9b67/3740ddaa0f56ec8bcce80927e4a05443.channel.shuffle.data
详细信息如下:
2024-07-18 14:27:54
org.apache.flink.runtime.jobexception: recovery is suppressed by norestartbackofftimestrategy
at org.apache.flink.runtime.executiongraph.failover.flip1.executionfailurehandler.handlefailure(executionfailurehandler.java:139)
at org.apache.flink.runtime.executiongraph.failover.flip1.executionfailurehandler.getfailurehandlingresult(executionfailurehandler.java:83)
at org.apache.flink.runtime.scheduler.defaultscheduler.recordtaskfailure(defaultscheduler.java:256)
at org.apache.flink.runtime.scheduler.defaultscheduler.handletaskfailure(defaultscheduler.java:247)
at org.apache.flink.runtime.scheduler.defaultscheduler.ontaskfailed(defaultscheduler.java:240)
at org.apache.flink.runtime.scheduler.schedulerbase.ontaskexecutionstateupdate(schedulerbase.java:738)
at org.apache.flink.runtime.scheduler.schedulerbase.updatetaskexecutionstate(schedulerbase.java:715)
at org.apache.flink.runtime.scheduler.schedulerng.updatetaskexecutionstate(schedulerng.java:78)
at org.apache.flink.runtime.jobmaster.jobmaster.updatetaskexecutionstate(jobmaster.java:477)
at sun.reflect.generatedmethodaccessor91.invoke(unknown source)
at sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:43)
at java.lang.reflect.method.invoke(method.java:498)
at org.apache.flink.runtime.rpc.akka.akkarpcactor.lambda$handlerpcinvocation$1(akkarpcactor.java:309)
at org.apache.flink.runtime.concurrent.akka.classloadingutils.runwithcontextclassloader(classloadingutils.java:83)
at org.apache.flink.runtime.rpc.akka.akkarpcactor.handlerpcinvocation(akkarpcactor.java:307)
at org.apache.flink.runtime.rpc.akka.akkarpcactor.handlerpcmessage(akkarpcactor.java:222)
at org.apache.flink.runtime.rpc.akka.fencedakkarpcactor.handlerpcmessage(fencedakkarpcactor.java:84)
at org.apache.flink.runtime.rpc.akka.akkarpcactor.handlemessage(akkarpcactor.java:168)
at akka.japi.pf.unitcasestatement.apply(casestatements.scala:24)
at akka.japi.pf.unitcasestatement.apply(casestatements.scala:20)
at scala.partialfunction.applyorelse(partialfunction.scala:123)
at scala.partialfunction.applyorelse$(partialfunction.scala:122)
at akka.japi.pf.unitcasestatement.applyorelse(casestatements.scala:20)
at scala.partialfunction$orelse.applyorelse(partialfunction.scala:171)
at scala.partialfunction$orelse.applyorelse(partialfunction.scala:172)
at scala.partialfunction$orelse.applyorelse(partialfunction.scala:172)
at akka.actor.actor.aroundreceive(actor.scala:537)
at akka.actor.actor.aroundreceive$(actor.scala:535)
at akka.actor.abstractactor.aroundreceive(abstractactor.scala:220)
at akka.actor.actorcell.receivemessage(actorcell.scala:580)
at akka.actor.actorcell.invoke(actorcell.scala:548)
at akka.dispatch.mailbox.processmailbox(mailbox.scala:270)
at akka.dispatch.mailbox.run(mailbox.scala:231)
at akka.dispatch.mailbox.exec(mailbox.scala:243)
at java.util.concurrent.forkjointask.doexec(forkjointask.java:289)
at java.util.concurrent.forkjoinpool$workqueue.runtask(forkjoinpool.java:1056)
at java.util.concurrent.forkjoinpool.runworker(forkjoinpool.java:1692)
at java.util.concurrent.forkjoinworkerthread.run(forkjoinworkerthread.java:157)
caused by: java.io.ioexception: failed to create file writer.
at org.apache.flink.runtime.io.network.partition.sortmergeresultpartition.setupinternal(sortmergeresultpartition.java:185)
at org.apache.flink.runtime.io.network.partition.resultpartition.setup(resultpartition.java:161)
at org.apache.flink.runtime.taskmanager.task.setuppartitionsandgates(task.java:946)
at org.apache.flink.runtime.taskmanager.task.dorun(task.java:639)
at org.apache.flink.runtime.taskmanager.task.run(task.java:550)
at java.lang.thread.run(thread.java:748)
caused by: java.nio.file.nosuchfileexception: /tmp/flink-netty-shuffle-45d4cbd5-f22e-47d0-98ff-7f11dd98a0b7/07391f7b184c5f8b8e234abee6649daa.channel.shuffle.data
at sun.nio.fs.unixexception.translatetoioexception(unixexception.java:86)
at sun.nio.fs.unixexception.rethrowasioexception(unixexception.java:102)
at sun.nio.fs.unixexception.rethrowasioexception(unixexception.java:107)
at sun.nio.fs.unixfilesystemprovider.newfilechannel(unixfilesystemprovider.java:177)
at java.nio.channels.filechannel.open(filechannel.java:287)
at java.nio.channels.filechannel.open(filechannel.java:335)
at org.apache.flink.runtime.io.network.partition.partitionedfilewriter.openfilechannel(partitionedfilewriter.java:133)
at org.apache.flink.runtime.io.network.partition.partitionedfilewriter.<init>(partitionedfilewriter.java:121)
at org.apache.flink.runtime.io.network.partition.sortmergeresultpartition.setupinternal(sortmergeresultpartition.java:182)
... 5 more
原因分析:
flink任务按batch模式执行时,中间结果数据回进行落地,默认放在/tmp/目录下。因/tmp目录linux系统一般默认每10天进行定期删除。故如果shuffle目录被删除后,再执行任务就会报java.nio.file.nosuchfileexception: /tmp/flink-netty-shuffle-c5222ebc-a7bb-4fa1-bfd2-c7b5c9bd9b67/3740ddaa0f56ec8bcce80927e4a05443.channel.shuffle.data
的错误。
解决方案:
taskmanager.tmp.dirs:这个配置项允许你设置 taskmanager 使用的一个或多个临时文件目录列表。flink 会在这些目录之间均匀分配临时文件。
修改flink flink-conf.yaml配置文件,配置taskmanager.tmp.dirs 文件目录,避免使用/tmp目录,如:taskmanager.tmp.dirs: /opt/flink_shuffle/
发表评论