从Hadoop到S3的distcp失败，“任何本地目录中都没有空间”

I'm trying to copy data from a local hadoop cluster to an S3 bucket using distcp.

我正在尝试使用distcp将数据从本地hadoop集群复制到S3存储桶。

Sometimes it "works", but some of the mappers fail, with the stack trace below. Other times, so many mappers fail that the whole job cancels.

有时它“有效”，但有些映射器失败，下面有堆栈跟踪。其他时候，很多地图选手都失败了整个工作取消了。

The error "No space available in any of the local directories." doesn't make sense to me. There is PLENTY of space on the edge node (where the distcp command is running), on the cluster, and in the S3 bucket.

错误“任何本地目录中没有可用空间”。对我来说没有意义。边缘节点（运行distcp命令的位置），群集和S3存储桶中有大量空间。

Can anyone shed some light on this?

任何人都可以对此有所了解吗？

16/06/16 15:48:08 INFO mapreduce.Job: The url to track the job: <url>
16/06/16 15:48:08 INFO tools.DistCp: DistCp job-id: job_1465943812607_0208
16/06/16 15:48:08 INFO mapreduce.Job: Running job: job_1465943812607_0208
16/06/16 15:48:16 INFO mapreduce.Job: Job job_1465943812607_0208 running in uber mode : false
16/06/16 15:48:16 INFO mapreduce.Job:  map 0% reduce 0%
16/06/16 15:48:23 INFO mapreduce.Job:  map 33% reduce 0%
16/06/16 15:48:26 INFO mapreduce.Job: Task Id : attempt_1465943812607_0208_m_000001_0, Status : FAILED
Error: java.io.IOException: File copy failed: hdfs://<hdfs path>/000000_0 --> s3n://<bucket>/<s3 path>/000000_0
        at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:285)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:253)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying hdfs://<hdfs path>/000000_0 to s3n://<bucket>/<s3 path>/000000_0
        at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
        at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:281)
        ... 10 more
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: No space available in any of the local directories.
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:366)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)
        at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.newBackupFile(NativeS3FileSystem.java:263)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.<init>(NativeS3FileSystem.java:245)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:412)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:986)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:174)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:123)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99)
        at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
        ... 11 more

2 个解决方案

#1

We ran into a similar exception while trying to save directly the results of a run from Apache Spark (version 1.5.2) to S3. The exception is the same though. I'm not really sure what the core issue is - somehow S3 upload doesn't seem to "play nice" with Hadoop's LocalDirAllocator class (version 2.7).

我们在试图直接保存从Apache Spark（版本1.5.2）到S3的运行结果时遇到了类似的异常。但例外是相同的。我不确定核心问题是什么 - 不知何故S3上传似乎与Hadoop的LocalDirAllocator类（版本2.7）“玩得不好”。

What finally solved it for us was a combination of:

最终为我们解决的是以下组合：

enabling S3's "fast upload" - by setting "fs.s3a.fast.upload" to "true" in Hadoop configuration. This uses S3AFastOutputStream instead of S3AOutputStream and uploads data directly from memory, instead of first allocating local storage

启用S3的“快速上传” - 在Hadoop配置中将“fs.s3a.fast.upload”设置为“true”。这使用S3AFastOutputStream而不是S3AOutputStream并直接从内存上传数据，而不是先分配本地存储
merging the results of the job to a single part before saving to s3 (in Spark that's called repartitioning/coalescing)

在保存到s3之前将作业结果合并到单个部分（在Spark中称为重新分区/合并）

Some caveats though:

但有些警告：

S3's fast upload is apparently marked "experimental" in Hadoop 2.7

S3的快速上传显然在Hadoop 2.7中标记为“实验性”
this work-around only applies to the newer s3a file-system ("s3a://..."). it won't work for the older "native" s3n file-system ("s3n://...")

这种解决方法仅适用于较新的s3a文件系统（“s3a：// ...”）。它不适用于较旧的“原生”s3n文件系统（“s3n：// ...”）

hope this helps

希望这可以帮助

#2

Ideally you should use s3a rather than s3n, as s3n is deprecated.

理想情况下，您应该使用s3a而不是s3n，因为不推荐使用s3n。

With s3a, there is a parameter:

使用s3a，有一个参数：

<property>
  <name>fs.s3a.buffer.dir</name>
  <value>${hadoop.tmp.dir}/s3a</value>
  <description>Comma separated list of directories that will be used to buffer file
uploads to. No effect if fs.s3a.fast.upload is true.</description>
</property>

When you are getting the local file error, it most likely because the buffer directory has no space.

当您收到本地文件错误时，很可能是因为缓冲区目录没有空格。

While you can change this setting to point at a directory with more space, a better solution may be to set (again in S3a):

虽然您可以将此设置更改为指向具有更多空间的目录，但可以设置更好的解决方案（再次在S3a中）：

fs.s3a.fast.upload=true

fs.s3a.fast.upload =真

This avoids buffering the data on local disk and should actually be faster too.

这样可以避免缓冲本地磁盘上的数据，实际上也应该更快。

The S3n buffer directory parameter should be:

S3n缓冲区目录参数应为：

fs.s3.buffer.dir

So if you stick with s3n, ensure it has plenty of space and it should hopefully resolve this issue.

因此，如果你坚持使用s3n，确保它有足够的空间，它应该有希望解决这个问题。

#1