如何从jar提交Dataflow作业?

时间:2022-02-04 14:03:48

For reproducibility I want to be able to build jars containing dataflow jobs and then run them with different parameters (e.g. promote them through different accounts). This will also simplify rolling back because builds will be immutable.

为了重现性,我希望能够构建包含数据流作业的jar,然后使用不同的参数运行它们(例如,通过不同的帐户进行推广)。这也将简化回滚,因为构建将是不可变的。

I am currently running jobs with the DataflowPipelineRunner from maven but this is terrible for the above reasons with automated deployments etc.

我目前正在使用maven的DataflowPipelineRunner运行作业,但由于上述原因,自动部署等情况非常糟糕。

How can I directly run a dataflow job from a jar?

如何直接从jar运行数据流作业?

2 个解决方案

#1


2  

Ah it looks like I need templates.

啊看起来我需要模板。

#2


1  

I think template is the most promising way to go, but currently if you want to write a batch job to write to BigQuery, you need to create a template every time you want to run a job, which almost ruins the benefit of template. (This is explained here)

我认为模板是最有前途的方法,但是目前如果你想写一个批处理作业来编写BigQuery,你需要在每次想要运行一个作业时创建一个模板,这几乎破坏了模板的好处。 (这在这里解释)

As written in this Github README, you can create a bundle jar by calling mvn package then something like below should work to submit a dataflow job using the jar file.

正如在这个Github自述文件中所写,您可以通过调用mvn包创建一个包jar,然后下面的内容应该可以使用jar文件提交数据流作业。

java -cp target/google-cloud-dataflow-java-examples-all-bundled-manual_build.jar \
com.google.cloud.dataflow.examples.WordCount \
--project=<YOUR CLOUD PLATFORM PROJECT ID> \
--stagingLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowPipelineRunner

This is the way I currently choose, since I need to interact with BigQuery.

这是我目前选择的方式,因为我需要与BigQuery进行交互。

#1


2  

Ah it looks like I need templates.

啊看起来我需要模板。

#2


1  

I think template is the most promising way to go, but currently if you want to write a batch job to write to BigQuery, you need to create a template every time you want to run a job, which almost ruins the benefit of template. (This is explained here)

我认为模板是最有前途的方法,但是目前如果你想写一个批处理作业来编写BigQuery,你需要在每次想要运行一个作业时创建一个模板,这几乎破坏了模板的好处。 (这在这里解释)

As written in this Github README, you can create a bundle jar by calling mvn package then something like below should work to submit a dataflow job using the jar file.

正如在这个Github自述文件中所写,您可以通过调用mvn包创建一个包jar,然后下面的内容应该可以使用jar文件提交数据流作业。

java -cp target/google-cloud-dataflow-java-examples-all-bundled-manual_build.jar \
com.google.cloud.dataflow.examples.WordCount \
--project=<YOUR CLOUD PLATFORM PROJECT ID> \
--stagingLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowPipelineRunner

This is the way I currently choose, since I need to interact with BigQuery.

这是我目前选择的方式,因为我需要与BigQuery进行交互。