Google云数据流中的条件迭代

时间:2022-01-14 15:03:47

I am looking at the opportunities for implementing a data analysis algorithm using Google Cloud Dataflow. Mind you, I have no experience with dataflow yet. I am just doing some research on whether it can fulfill my needs.

我正在研究使用Google Cloud Dataflow实施数据分析算法的机会。请注意,我还没有数据流的经验。我正在研究它是否能满足我的需求。

Part of my algorithm contains some conditional iterations, that is, continue until some condition is met:

我的算法的一部分包含一些条件迭代,即继续直到满足某些条件:

PCollection data  = ...
while(needsMoreWork(data)) {
  data = doAStep(data)
}

I have looked around in the documentation and as far as I can see I am only able to do "iterations" if I know the exact number of iterations before the pipeline starts. In this case my pipeline construction code can just create a sequential pipeline with fixed number of steps.

我在文档中查看过,据我所知,如果我知道管道启动前的确切迭代次数,我只能进行“迭代”。在这种情况下,我的管道构造代码可以只创建一个具有固定步数的顺序管道。

The only "solution" I can think of is to run each iteration in separate pipelines, store the intermediate data in some database, and then decide in my pipeline construction whether or not to launch a new pipeline for the next iteration. This seems to be an extremely inefficient solution!

我能想到的唯一“解决方案”是在单独的管道中运行每个迭代,将中间数据存储在某个数据库中,然后在我的管道构造中决定是否为下一次迭代启动新管道。这似乎是一个非常低效的解决方案!

Are there any good ways to perform this kind of additional iterations in Google cloud dataflow?

有没有什么好的方法可以在Google云数据流中执行这种额外的迭代?

Thanks!

谢谢!

1 个解决方案

#1


3  

For the time being, the two options you've mentioned are both reasonable. You could even combine the two approaches. Create a pipeline which does a few iterations (becoming a no-op if needsMoreWork is false), and then have a main Java program that submits that pipeline multiple times until needsMoreWork is false.

目前,您提到的两个选项都是合理的。你甚至可以将这两种方法结合起来。创建一个执行一些迭代的管道(如果needsMoreWork为false,则成为无操作),然后有一个主Java程序多次提交该管道,直到needsMoreWork为false。

We've seen this use case a few times and hope to address it natively in the future. Native support is being tracked in https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/50.

我们已经看过几次这个用例,并希望将来能够在本地解决它。在https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/50中跟踪本机支持。

#1


3  

For the time being, the two options you've mentioned are both reasonable. You could even combine the two approaches. Create a pipeline which does a few iterations (becoming a no-op if needsMoreWork is false), and then have a main Java program that submits that pipeline multiple times until needsMoreWork is false.

目前,您提到的两个选项都是合理的。你甚至可以将这两种方法结合起来。创建一个执行一些迭代的管道(如果needsMoreWork为false,则成为无操作),然后有一个主Java程序多次提交该管道,直到needsMoreWork为false。

We've seen this use case a few times and hope to address it natively in the future. Native support is being tracked in https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/50.

我们已经看过几次这个用例,并希望将来能够在本地解决它。在https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/50中跟踪本机支持。