如何从App Engine运行Google Cloud Dataflow作业?

时间:2023-01-11 23:13:28

After reading Cloud Dataflow docs, I am still not sure how can I run my dataflow job from App Engine. Is it possible? Is it relevant whether my backend written in Python or in Java? Thanks!

阅读Cloud Dataflow文档后,我仍然不确定如何从App Engine运行我的数据流作业。可能吗?我的后端用Python或Java编写是否相关?谢谢!

3 个解决方案

#1


3  

Yes it is possibile, you need to use the "Streaming execution" as mentioned here.

是的,它是可能的,你需要使用这里提到的“流执行”。

Using Google Cloud Pub/Sub as a streaming source you can use it as "trigger" of your pipeline.

使用Google Cloud Pub / Sub作为流媒体源,您可以将其用作管道的“触发器”。

From App Engine you can do the "Pub" action to the Pub/Sub Hub with the REST API.

在App Engine中,您可以使用REST API对Pub / Sub Hub执行“Pub”操作。

#2


0  

There might be a way to submit your Dataflow job from App Engine but this is not something that's actively supported as suggested by the lack of docs. APP Engine's runtime environment makes it more difficult to do some of the operations required, e.g. to obtain credentials, to submit Dataflow jobs.

可能有一种方法可以从App Engine提交您的Dataflow作业,但这并不是缺乏文档所建议的那样。 APP Engine的运行时环境使得执行某些操作变得更加困难,例如获取凭据,提交Dataflow作业。

#3


0  

One way would indeed be to use Pub/Sub from within App Engine to let Cloud Dataflow know when new data is available. The Cloud Dataflow job would then run continuously and App Engine would provide the data for processing.

确实有一种方法是在App Engine中使用Pub / Sub让Cloud Dataflow知道新数据何时可用。然后,Cloud Dataflow作业将持续运行,App Engine将提供用于处理的数据。

A different approach would be to add the code that sets up the Cloud Dataflow pipeline to a class in App Engine (including the Dataflow SDK to your GAE project) and set the job options programatically as explained here:

另一种方法是将设置Cloud Dataflow管道的代码添加到App Engine中的类(包括GAE项目的Dataflow SDK),并按照此处的说明以编程方式设置作业选项:

https://cloud.google.com/dataflow/pipelines/specifying-exec-params

https://cloud.google.com/dataflow/pipelines/specifying-exec-params

Make sure to set the 'runner' option to DataflowPipelineRunner, so it executes asynchronously on the Google Cloud Platform. Since the pipeline runner (that actually runs your pipeline) does not have to be the same as the code that initiates it, this code (up until pipeline.run() ) could be in App Engine.

确保将“runner”选项设置为DataflowPipelineRunner,以便在Google Cloud Platform上异步执行。由于管道运行器(实际运行管道)不必与启动它的代码相同,因此该代码(直到pipeline.run())可以在App Engine中。

You can then add an endpoint or servlet to GAE that when called, runs the code that sets up the pipeline.

然后,您可以向GAE添加端点或servlet,在调用时,运行设置管道的代码。

To schedule even more, you could have a cron job in GAE that calls the endpoint that initiates the pipeline...

要安排更多,您可以在GAE中拥有一个调用启动管道的端点的cron作业...

#1


3  

Yes it is possibile, you need to use the "Streaming execution" as mentioned here.

是的,它是可能的,你需要使用这里提到的“流执行”。

Using Google Cloud Pub/Sub as a streaming source you can use it as "trigger" of your pipeline.

使用Google Cloud Pub / Sub作为流媒体源,您可以将其用作管道的“触发器”。

From App Engine you can do the "Pub" action to the Pub/Sub Hub with the REST API.

在App Engine中,您可以使用REST API对Pub / Sub Hub执行“Pub”操作。

#2


0  

There might be a way to submit your Dataflow job from App Engine but this is not something that's actively supported as suggested by the lack of docs. APP Engine's runtime environment makes it more difficult to do some of the operations required, e.g. to obtain credentials, to submit Dataflow jobs.

可能有一种方法可以从App Engine提交您的Dataflow作业,但这并不是缺乏文档所建议的那样。 APP Engine的运行时环境使得执行某些操作变得更加困难,例如获取凭据,提交Dataflow作业。

#3


0  

One way would indeed be to use Pub/Sub from within App Engine to let Cloud Dataflow know when new data is available. The Cloud Dataflow job would then run continuously and App Engine would provide the data for processing.

确实有一种方法是在App Engine中使用Pub / Sub让Cloud Dataflow知道新数据何时可用。然后,Cloud Dataflow作业将持续运行,App Engine将提供用于处理的数据。

A different approach would be to add the code that sets up the Cloud Dataflow pipeline to a class in App Engine (including the Dataflow SDK to your GAE project) and set the job options programatically as explained here:

另一种方法是将设置Cloud Dataflow管道的代码添加到App Engine中的类(包括GAE项目的Dataflow SDK),并按照此处的说明以编程方式设置作业选项:

https://cloud.google.com/dataflow/pipelines/specifying-exec-params

https://cloud.google.com/dataflow/pipelines/specifying-exec-params

Make sure to set the 'runner' option to DataflowPipelineRunner, so it executes asynchronously on the Google Cloud Platform. Since the pipeline runner (that actually runs your pipeline) does not have to be the same as the code that initiates it, this code (up until pipeline.run() ) could be in App Engine.

确保将“runner”选项设置为DataflowPipelineRunner,以便在Google Cloud Platform上异步执行。由于管道运行器(实际运行管道)不必与启动它的代码相同,因此该代码(直到pipeline.run())可以在App Engine中。

You can then add an endpoint or servlet to GAE that when called, runs the code that sets up the pipeline.

然后,您可以向GAE添加端点或servlet,在调用时,运行设置管道的代码。

To schedule even more, you could have a cron job in GAE that calls the endpoint that initiates the pipeline...

要安排更多,您可以在GAE中拥有一个调用启动管道的端点的cron作业...