在Google Cloud Dataflow上调试BigQuery的慢速读取

时间:2021-12-27 14:08:16

Background: We have a really simple pipeline which reads some data from BigQuery (usually ~300MB) filters/transforms it and puts it back to BigQuery. in 99% of cases this pipeline finishes in 7-10minutes and is then restarted again to process a new batch.

背景:我们有一个非常简单的管道,它从BigQuery中读取一些数据(通常约300MB)过滤/转换它并将其放回BigQuery。在99%的情况下,该管道在7-10分钟内完成,然后再次重新启动以处理新批次。

Problem: Recently, the job has started to take >3h once in a while, maybe 2 times in a month out of 2000 runs. When I look at the logs, I can't see any errors and in fact it's only the first step (read from BigQuery) that is taking so long.

问题:最近,这项工作已经开始在一段时间内开始服用> 3h,在2000次运行中可能每月服用2次。当我查看日志时,我看不到任何错误,事实上它只是第一步(从BigQuery读取)花了这么长时间。

Does anyone have a suggestion on how to approach debugging of such cases? Especially since it's really the read from BQ and not any of our transformation code. We are using Apache Beam SDK for Python 0.6.0 (maybe that's the reason!?)

有没有人建议如何处理此类案件的调试?特别是因为它真的是从BQ读取而不是我们的任何转换代码。我们正在使用Apache Beam SDK for Python 0.6.0(也许这就是原因!?)

Is it maybe possible to define a timeout for the job?

是否可以为作业定义超时?

在Google Cloud Dataflow上调试BigQuery的慢速读取

1 个解决方案

#1


2  

This is an issue on either Dataflow side or BigQuery side depending on how one looks at it. When splitting the data for parallel processing, Dataflow relies on an estimate of the data size. The long runtime happens when BigQuery sporadically gives a severe under-estimate of the query result size, and Dataflow, as a consequence, severely over-splits the data and the runtime becomes bottlenecked by the overhead of reading lots and lots of tiny file chunks exported by BigQuery.

这是Dataflow方面或BigQuery方面的问题,具体取决于人们如何看待它。拆分数据进行并行处理时,Dataflow依赖于数据大小的估计。长时间运行时发生在BigQuery偶尔严重低估查询结果大小时,因此数据流严重过度分割数据,运行时因读取批量和导出的大量小文件块的开销而变得瓶颈通过BigQuery。

On one hand, this is the first time I've seen BigQuery produce such dramatically incorrect query result size estimates. However, as size estimates are inherently best-effort and can in general be arbitrarily off, Dataflow should control for that and prevent such oversplitting. We'll investigate and fix this.

一方面,这是我第一次看到BigQuery产生如此显着不正确的查询结果大小估计。但是,由于大小估计本质上是最大努力的,并且通常可以任意关闭,因此Dataflow应该控制它并防止这种过度分裂。我们将调查并修复此问题。

The only workaround that comes to mind meanwhile is to use the Java SDK: it uses quite different code for reading from BigQuery that, as far as I recall, does not rely on query size estimates.

同时想到的唯一解决方法是使用Java SDK:它使用完全不同的代码从BigQuery读取,据我记忆,它不依赖于查询大小估计。

#1


2  

This is an issue on either Dataflow side or BigQuery side depending on how one looks at it. When splitting the data for parallel processing, Dataflow relies on an estimate of the data size. The long runtime happens when BigQuery sporadically gives a severe under-estimate of the query result size, and Dataflow, as a consequence, severely over-splits the data and the runtime becomes bottlenecked by the overhead of reading lots and lots of tiny file chunks exported by BigQuery.

这是Dataflow方面或BigQuery方面的问题,具体取决于人们如何看待它。拆分数据进行并行处理时,Dataflow依赖于数据大小的估计。长时间运行时发生在BigQuery偶尔严重低估查询结果大小时,因此数据流严重过度分割数据,运行时因读取批量和导出的大量小文件块的开销而变得瓶颈通过BigQuery。

On one hand, this is the first time I've seen BigQuery produce such dramatically incorrect query result size estimates. However, as size estimates are inherently best-effort and can in general be arbitrarily off, Dataflow should control for that and prevent such oversplitting. We'll investigate and fix this.

一方面,这是我第一次看到BigQuery产生如此显着不正确的查询结果大小估计。但是,由于大小估计本质上是最大努力的,并且通常可以任意关闭,因此Dataflow应该控制它并防止这种过度分裂。我们将调查并修复此问题。

The only workaround that comes to mind meanwhile is to use the Java SDK: it uses quite different code for reading from BigQuery that, as far as I recall, does not rely on query size estimates.

同时想到的唯一解决方法是使用Java SDK:它使用完全不同的代码从BigQuery读取,据我记忆,它不依赖于查询大小估计。