对于长时间运行的流数据流作业,“超出了GC开销限制”

时间:2021-02-17 15:36:14

Running my streaming dataflow job for a longer period of time tends to end up in a "GC overhead limit exceeded" error which brings the job to a halt. How can I best proceed to debug this?

长时间运行流式数据流作业往往会导致“超出GC开销限制”错误,从而导致作业停止。我怎样才能最好地进行调试呢?

java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.google.cloud.dataflow.worker.repackaged.com.google.common.collect.HashBasedTable.create (HashBasedTable.java:76)
    at com.google.cloud.dataflow.worker.WindmillTimerInternals.<init> (WindmillTimerInternals.java:53)
    at com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.start (StreamingModeExecutionContext.java:490)
    at com.google.cloud.dataflow.worker.StreamingModeExecutionContext.start (StreamingModeExecutionContext.java:221)
    at com.google.cloud.dataflow.worker.StreamingDataflowWorker.process (StreamingDataflowWorker.java:1058)
    at com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000 (StreamingDataflowWorker.java:133)
    at com.google.cloud.dataflow.worker.StreamingDataflowWorker$8.run (StreamingDataflowWorker.java:841)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
    at java.lang.Thread.run (Thread.java:745)
  • Job ID: 2018-02-06_00_54_50-15974506330123401176
  • 职位编号:2018-02-06_00_54_50-15974506330123401176
  • SDK: Apache Beam SDK for Java 2.2.0
  • SDK:Apache Beam SDK for Java 2.2.0
  • Scio version: 0.4.7
  • Scio版本:0.4.7

1 个解决方案

#1


2  

I've run into this issue a few times. My approach typically starts with trying to isolate the transform step that is causing the memory error in Dataflow. It's a longer process, but you can usually make an educated guess about which is the problematic transform. Remove the transform, execute the pipeline, and check if the error persists.

我曾经遇到过这个问题几次。我的方法通常从尝试隔离导致Dataflow中的内存错误的转换步骤开始。这是一个较长的过程,但你通常可以做出有根据的猜测,哪个是有问题的变换。删除转换,执行管道,并检查错误是否仍然存在。

Once I determine the problematic transform, I start looking at the implementation for any memory inefficiencies. This is usually related to initializing objects (memory allocation) or design where a transform has a really high fanout; creating a bunch of output. But it could be something as trivial as string manipulation.

一旦我确定了有问题的转换,我就开始考虑实现任何内存效率低下的问题。这通常与初始化对象(内存分配)或设计相关,其中转换具有非常高的扇出;创造一堆输出。但它可能像字符串操作一样微不足道。

From here, it's just a matter of continuing to isolate the issue. Dataflow does have memory limitations. You could potentially increase the hardware of the Compute Engine instances backing the workers. However, this isn't a scalable solution.

从这里开始,这只是一个继续孤立问题的问题。数据流确实存在内存限制。您可能会增加支持工作人员的Compute Engine实例的硬件。但是,这不是一个可扩展的解决方案。

You should also consider implementing the pipeline using ONLY Apache Beam Java. This will rule out Scio as the issue. This usually isn't the case though.

您还应该考虑仅使用Apache Beam Java实现管道。这将排除Scio作为问题。但事实并非如此。

#1


2  

I've run into this issue a few times. My approach typically starts with trying to isolate the transform step that is causing the memory error in Dataflow. It's a longer process, but you can usually make an educated guess about which is the problematic transform. Remove the transform, execute the pipeline, and check if the error persists.

我曾经遇到过这个问题几次。我的方法通常从尝试隔离导致Dataflow中的内存错误的转换步骤开始。这是一个较长的过程,但你通常可以做出有根据的猜测,哪个是有问题的变换。删除转换,执行管道,并检查错误是否仍然存在。

Once I determine the problematic transform, I start looking at the implementation for any memory inefficiencies. This is usually related to initializing objects (memory allocation) or design where a transform has a really high fanout; creating a bunch of output. But it could be something as trivial as string manipulation.

一旦我确定了有问题的转换,我就开始考虑实现任何内存效率低下的问题。这通常与初始化对象(内存分配)或设计相关,其中转换具有非常高的扇出;创造一堆输出。但它可能像字符串操作一样微不足道。

From here, it's just a matter of continuing to isolate the issue. Dataflow does have memory limitations. You could potentially increase the hardware of the Compute Engine instances backing the workers. However, this isn't a scalable solution.

从这里开始,这只是一个继续孤立问题的问题。数据流确实存在内存限制。您可能会增加支持工作人员的Compute Engine实例的硬件。但是,这不是一个可扩展的解决方案。

You should also consider implementing the pipeline using ONLY Apache Beam Java. This will rule out Scio as the issue. This usually isn't the case though.

您还应该考虑仅使用Apache Beam Java实现管道。这将排除Scio作为问题。但事实并非如此。