Apache Beam Pipeline(数据流) - 解释*数据的执行时间

时间:2021-08-24 14:41:20

In the Dataflow Monitoring Interface for Beam Pipeline Executions, there is a time duration specified in each of the Transformation boxes (see https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf).

在Beam Pipeline Executions的数据流监控界面中,每个转换框中都指定了一个持续时间(请参阅https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf)。

For bounded data, I understood this is the estimated time it would take for the transformation to be completed. However, for unbounded data as in my streaming case, how do I interpret this number?

对于有界数据,我理解这是转换完成所需的估计时间。但是,对于我的流媒体案例中的*数据,我该如何解释这个数字呢?

Some of my transforms have a duration significantly higher than the others, and this means that the transform takes more time. But what are the other implications regarding how this uneven distribution affects my execution, especially if I have a windowing functions going on?

我的一些变换的持续时间明显高于其他变换,这意味着变换需要更多时间。但是关于这种不均匀分布如何影响我的执行的其他含义是什么,特别是如果我有一个窗口函数正在进行?

Also, is this related to autoscaling? For e.g. do more workers get spun up if the time taken for execution exceeds certain thresholds? Or does autoscaling depend on data volume at the input?

此外,这是否与自动缩放有关?对于例如如果执行时间超过一定的阈值,那么更多的工人会被剥离吗?或者自动缩放是否依赖于输入的数据量?

1 个解决方案

#1


2  

In both Batch and Streaming this is a measure of how long those steps have spent active on each work thread. The number of threads for each worker machine varies between Batch and Streaming, and as you note more workers means more worker threads.

在Batch和Streaming中,这是衡量这些步骤在每个工作线程上活动的时间长度的度量。每个工作者计算机的线程数在批处理和流式处理之间有所不同,并且当您注意到更多工作程序意味着更多的工作线程。

There aren't any actual implications -- these measurements are provided as a way of understanding what the work threads have spent most of their time doing. If the total pipeline seems to be behaving reasonably, you don't need to do anything. If you think that the pipeline is slower than you expect, or if one of the steps seems to be taking longer than you would expect, these can act as a starting point to understanding performance.

没有任何实际意义 - 这些测量是为了解工作线程花费大部分时间做的事情。如果总管道似乎表现合理,则无需执行任何操作。如果您认为管道比您预期的要慢,或者其中一个步骤似乎花费的时间比您预期的要长,那么这些步骤可以作为理解性能的起点。

In some sense these are similar to how a profile of time spent in various functions can be useful for improving the performance of a normal program. There isn't any impact to one function taking longer than another, but it may be useful information to have.

在某种意义上,这些类似于在各种功能中花费的时间简档如何用于改善正常程序的性能。对一个功能的影响不会比另一个功能更长,但它可能是有用的信息。

#1


2  

In both Batch and Streaming this is a measure of how long those steps have spent active on each work thread. The number of threads for each worker machine varies between Batch and Streaming, and as you note more workers means more worker threads.

在Batch和Streaming中,这是衡量这些步骤在每个工作线程上活动的时间长度的度量。每个工作者计算机的线程数在批处理和流式处理之间有所不同,并且当您注意到更多工作程序意味着更多的工作线程。

There aren't any actual implications -- these measurements are provided as a way of understanding what the work threads have spent most of their time doing. If the total pipeline seems to be behaving reasonably, you don't need to do anything. If you think that the pipeline is slower than you expect, or if one of the steps seems to be taking longer than you would expect, these can act as a starting point to understanding performance.

没有任何实际意义 - 这些测量是为了解工作线程花费大部分时间做的事情。如果总管道似乎表现合理,则无需执行任何操作。如果您认为管道比您预期的要慢,或者其中一个步骤似乎花费的时间比您预期的要长,那么这些步骤可以作为理解性能的起点。

In some sense these are similar to how a profile of time spent in various functions can be useful for improving the performance of a normal program. There isn't any impact to one function taking longer than another, but it may be useful information to have.

在某种意义上,这些类似于在各种功能中花费的时间简档如何用于改善正常程序的性能。对一个功能的影响不会比另一个功能更长,但它可能是有用的信息。