dfs.datanode.max.xcievers 参数导致的MR失败

线上一个任务跑着跑着报了一下错误:

Error: java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2207) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1439) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1361) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)

看到这个异常时候,猜测可能是文件写入被中断了。
具体到日志中看是一下信息

2017-01-19 14:15:48,477 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1479462601457_166866_m_000003_1: Error: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/infoc/app/20170118_22/_temporary/1/_temporary/attempt_1479462601457_166866_m_000003_1/app_feed/app/20170117/00/part-m-00003 (inode 136826430): File does not exist. [Lease. Holder: DFSClient_attempt_1479462601457_166866_m_000003_1_9377521_1, pendingcreates: 2]
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3561)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3358)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3214)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:663)

果然是同时写入的文件数过多导致的。因为这个是一个日志分析的map,会分隔日志写入到不同的hdfs目录。

查网上资料是因为dfs.datanode.max.xcievers 参数设置太小导致,看了日志确实发现了以下错误

2017-01-19 10:37:49,511 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: shd-datanode-25:50010:DataXceiverServer:
java.io.IOException: Xceiver count 4098 exceeds the limit of concurrent xcievers: 4096
at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140)
at java.lang.Thread.run(Thread.java:745)
2017-01-19 10:37:49,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 788ms (threshold=300ms)
2017-01-19 10:37:49,556 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: shd-datanode-25:50010:DataXceiverServer:
java.io.IOException: Xceiver count 4098 exceeds the limit of concurrent xcievers: 4096
at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140)
at java.lang.Thread.run(Thread.java:745)

后来修改 hdfs-site.xml

<property>
<name>dfs.datanode.max.xcievers</name>
<value>8192</value>
</property>

就好了。

发表评论

电子邮件地址不会被公开。 必填项已用*标注