从Dataflow写入BigQuery - 作业完成后不会删除JSON文件

时间:2022-01-02 14:21:41

One of our Dataflow jobs writes its output to BigQuery. My understanding of how this is implemented under-the-hood, is that Dataflow actually writes the results (sharded) in JSON format to GCS, and then kicks off a BigQuery load job to import that data.

我们的一个Dataflow作业将其输出写入BigQuery。我对如何实现这一点的理解是,Dataflow实际上将结果(分片)以JSON格式写入GCS,然后启动BigQuery加载作业以导入该数据。

However, we've noticed that some JSON files are not deleted after the job regardless of whether it succeeds or fails. There is no warning or suggestion in the error message that the files will not be deleted. When we noticed this, we had a look at our bucket and it had hundreds of large JSON files from failed jobs (mostly during development).

但是,我们注意到在作业之后不会删除某些JSON文件,无论它是成功还是失败。错误消息中没有警告或建议不会删除文件。当我们注意到这一点时,我们看了一下我们的存储桶,它有数百个来自失败作业的大型JSON文件(主要是在开发期间)。

I would have thought that Dataflow should handle any cleanup, even if the job fails, and when it succeeds those files should definitely be deleted Leaving these files around after the job has finished incurs significant storage costs!

我原本以为Dataflow应该处理任何清理,即使作业失败,当它成功时,应该绝对删除这些文件。在作业完成后留下这些文件会产生大量的存储成本!

Is this a bug?

这是一个错误吗?

Example job id of a job that "succeeded" but left hundreds of large files in GCS: 2015-05-27_18_21_21-8377993823053896089

作业的示例作业ID“成功”但在GCS中留下了数百个大文件:2015-05-27_18_21_21-8377993823053896089

从Dataflow写入BigQuery  - 作业完成后不会删除JSON文件

从Dataflow写入BigQuery  - 作业完成后不会删除JSON文件

从Dataflow写入BigQuery  - 作业完成后不会删除JSON文件

3 个解决方案

#1


5  

Because this is still happening we decided that we'll clean up ourselves after the pipeline has finished executing. We run the following command to delete everything that is not a JAR or ZIP:

因为这仍然在发生,我们决定在管道完成执行后我们自己清理。我们运行以下命令来删除不是JAR或ZIP的所有内容:

gsutil ls -p <project_id> gs://<bucket> | grep -v '[zip|jar]$' | xargs -n 1 gsutil -m rm -r

#2


5  

Another possible cause of left over files is cancelled jobs. Currently dataflow does not delete files from cancelled jobs. In other cases files should be cleaned up.

遗留文件的另一个可能原因是取消了作业。目前,数据流不会从已取消的作业中删除文件。在其他情况下,应清理文件。

Also the error listed on the first post "Unable to delete temporary files" is the result of a logging issue on our side, and should be resolved within a week or two. Until then, feel free to ignore these errors as they do not indicate left over files.

此外,第一篇文章“无法删除临时文件”中列出的错误是我们方面的日志记录问题的结果,应该在一两周内解决。在此之前,请随意忽略这些错误,因为它们不会指示遗留文件。

#3


2  

This was a bug where the Dataflow service would sometimes fail to delete the temporary JSON files after a BigQuery import job completes. We have fixed the issue internally and rolled out a release with the fix.

这是一个错误,在BigQuery导入作业完成后,Dataflow服务有时无法删除临时JSON文件。我们已在内部修复了该问题,并推出了修复版本。

#1


5  

Because this is still happening we decided that we'll clean up ourselves after the pipeline has finished executing. We run the following command to delete everything that is not a JAR or ZIP:

因为这仍然在发生,我们决定在管道完成执行后我们自己清理。我们运行以下命令来删除不是JAR或ZIP的所有内容:

gsutil ls -p <project_id> gs://<bucket> | grep -v '[zip|jar]$' | xargs -n 1 gsutil -m rm -r

#2


5  

Another possible cause of left over files is cancelled jobs. Currently dataflow does not delete files from cancelled jobs. In other cases files should be cleaned up.

遗留文件的另一个可能原因是取消了作业。目前,数据流不会从已取消的作业中删除文件。在其他情况下,应清理文件。

Also the error listed on the first post "Unable to delete temporary files" is the result of a logging issue on our side, and should be resolved within a week or two. Until then, feel free to ignore these errors as they do not indicate left over files.

此外,第一篇文章“无法删除临时文件”中列出的错误是我们方面的日志记录问题的结果,应该在一两周内解决。在此之前,请随意忽略这些错误,因为它们不会指示遗留文件。

#3


2  

This was a bug where the Dataflow service would sometimes fail to delete the temporary JSON files after a BigQuery import job completes. We have fixed the issue internally and rolled out a release with the fix.

这是一个错误,在BigQuery导入作业完成后,Dataflow服务有时无法删除临时JSON文件。我们已在内部修复了该问题,并推出了修复版本。