从BigQuery读取小表时数据流OutOfMemoryError

时间:2022-12-20 15:25:48

We have a pipeline reading data from BigQuery and processing historical data for various calendar years. It fails with OutOfMemoryError errors if the input data is small (~500MB)

我们有一个从BigQuery读取数据的管道,并处理各个日历年的历史数据。如果输入数据很小(~500MB),它会因OutOfMemoryError错误而失败

On startup it reads from BigQuery about 10.000 elements/sec, after short time it slows down to hundreds elements/s then it hangs completely.

在启动时,它从BigQuery读取大约10.000个元素/秒,在短时间内它减慢到数百个元素/秒然后它完全挂起。

Observing 'Elements Added' on the next processing step (BQImportAndCompute), the value increases and then decreases again. That looks to me like some already loaded data is dropped and then loaded again.

在下一个处理步骤(BQImportAndCompute)上观察“添加元素”,该值会增加然后再次减小。在我看来,一些已经加载的数据被删除然后再次加载。

Stackdriver Logging console contains errors with various stack traces that contain java.lang.OutOfMemoryError, for example:

Stackdriver日志记录控制台包含包含java.lang.OutOfMemoryError的各种堆栈跟踪的错误,例如:

Error reporting workitem progress update to Dataflow service:

将工作项进度更新报告给Dataflow服务时出错:

"java.lang.OutOfMemoryError: Java heap space
    at com.google.cloud.dataflow.sdk.runners.worker.BigQueryAvroReader$BigQueryAvroFileIterator.getProgress(BigQueryAvroReader.java:145)
    at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIteratorConcurrent(ReadOperation.java:397)
    at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIterator(ReadOperation.java:389)
    at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$1.run(ReadOperation.java:206)

I would suspect that there is a problem with topology of the pipe, but running the same pipeline

我怀疑管道的拓扑存在问题,但运行相同的管道

  1. locally with DirectPipelineRunner works fine
  2. 本地使用DirectPipelineRunner工作正常
  3. in cloud with DataflowPipelineRunner on large dataset (5GB, for another year) works fine
  4. 云中使用大数据集上的DataflowPipelineRunner(5GB,另一年)工作正常

I assume problem is how Dataflow parallelizes and distributes work in the pipeline. Are there any possibilities to inspect or influence it?

我假设问题是Dataflow如何在管道中并行化和分配工作。有没有可能检查或影响它?

2 个解决方案

#1


1  

The problem here doesn't seem to be related to the size of the BigQuery table, but likely the number of BigQuery sources being used and the rest of the pipeline.

这里的问题似乎与BigQuery表的大小无关,但可能与正在使用的BigQuery源的数量以及管道的其余部分有关。

  1. Instead of reading from multiple BigQuery sources and flattening them have you tried reading from a query that pulls in all the information? Doing that in a single step should simplify the pipeline and also allow BigQuery to execute better (one query against multiple tables vs. multiple queries against individual tables).

    您是否尝试从提取所有信息的查询中读取,而不是从多个BigQuery源读取并展平它们?在一个步骤中执行此操作应简化管道并允许BigQuery更好地执行(针对多个表的一个查询与针对各个表的多个查询)。

  2. Another possible problem is if there is a high degree of fan-out within or after the BQImportAndCompute operation. Depending on the computation being done there, you may be able to reduce the fan-out using clever CombineFns or WindowFns. If you want help figuring out how to improve that path, please share more details about what is happening after the BQImportAndCompute.

    另一个可能的问题是在BQImportAndCompute操作之内或之后是否存在高度扇出。根据在那里进行的计算,您可以使用聪明的CombineFns或WindowFns减少扇出。如果您需要帮助确定如何改进该路径,请分享有关BQImportAndCompute之后发生的事情的更多详细信息。

#2


0  

Have you tried debugging with Stackdriver?

你试过用Stackdriver调试吗?

https://cloud.google.com/blog/big-data/2016/04/debugging-data-transformations-using-cloud-dataflow-and-stackdriver-debugger

https://cloud.google.com/blog/big-data/2016/04/debugging-data-transformations-using-cloud-dataflow-and-stackdriver-debugger

#1


1  

The problem here doesn't seem to be related to the size of the BigQuery table, but likely the number of BigQuery sources being used and the rest of the pipeline.

这里的问题似乎与BigQuery表的大小无关,但可能与正在使用的BigQuery源的数量以及管道的其余部分有关。

  1. Instead of reading from multiple BigQuery sources and flattening them have you tried reading from a query that pulls in all the information? Doing that in a single step should simplify the pipeline and also allow BigQuery to execute better (one query against multiple tables vs. multiple queries against individual tables).

    您是否尝试从提取所有信息的查询中读取,而不是从多个BigQuery源读取并展平它们?在一个步骤中执行此操作应简化管道并允许BigQuery更好地执行(针对多个表的一个查询与针对各个表的多个查询)。

  2. Another possible problem is if there is a high degree of fan-out within or after the BQImportAndCompute operation. Depending on the computation being done there, you may be able to reduce the fan-out using clever CombineFns or WindowFns. If you want help figuring out how to improve that path, please share more details about what is happening after the BQImportAndCompute.

    另一个可能的问题是在BQImportAndCompute操作之内或之后是否存在高度扇出。根据在那里进行的计算,您可以使用聪明的CombineFns或WindowFns减少扇出。如果您需要帮助确定如何改进该路径,请分享有关BQImportAndCompute之后发生的事情的更多详细信息。

#2


0  

Have you tried debugging with Stackdriver?

你试过用Stackdriver调试吗?

https://cloud.google.com/blog/big-data/2016/04/debugging-data-transformations-using-cloud-dataflow-and-stackdriver-debugger

https://cloud.google.com/blog/big-data/2016/04/debugging-data-transformations-using-cloud-dataflow-and-stackdriver-debugger