在执行BigQueryIO.write()时获取/设置BigQuery作业ID

时间:2022-06-15 14:07:40

Is it possible to set BigQuery JobID or to get it while the batch pipeline is running.
I know it's possible using BigQuery API but is it possible if I'm using BigQueryIO from Apache Beam? I need to send an acknowledgement after writing to BigQuery that the load is complete.

是否可以设置BigQuery JobID或在批处理管道运行时获取它。我知道可以使用BigQuery API但是如果我使用Apache Beam的BigQueryIO可以吗?我需要在向BigQuery写入负载完成后发送确认。

1 个解决方案

#1


5  

Currently this is not possible. It is complicated by the fact that a single BigQueryIO.write() may use many BigQuery jobs under the hood (i.e. BigQueryIO.write() is a general-purpose API for writing data to BigQuery, rather than an API for working with a single specific BigQuery load job), e.g.:

目前这是不可能的。事实上,单个BigQueryIO.write()可能会使用许多BigQuery作业(即BigQueryIO.write()是用于将数据写入BigQuery的通用API,而不是用于处理单个的API特定的BigQuery加载工作),例如:

  • In case the amount of data to be loaded is larger than the BigQuery limits for a single load job, BigQueryIO.write() will shard it into multiple load jobs.
  • 如果要加载的数据量大于单个加载作业的BigQuery限制,BigQueryIO.write()会将其分成多个加载作业。
  • In case you are using one of the destination-dependent write methods (e.g. DynamicDestinations), and are loading into multiple tables at the same time, there'll be at least 1 load job per table.
  • 如果您正在使用其中一个与目标相关的写入方法(例如DynamicDestinations),并且同时加载到多个表中,则每个表至少会有一个加载作业。
  • In case you are writing an unbounded PCollection using the BATCH_LOADS method, it will periodically issue load jobs for newly arrived data, subject to the notes above.
  • 如果您使用BATCH_LOADS方法编写*PCollection,它将定期为新到达的数据发出加载作业,具体取决于上述注释。
  • In case you're using the STREAMING_INSERTS method (it is allowed to use it even if you're writing a bounded PCollection), there will be no load jobs at all.
  • 如果您正在使用STREAMING_INSERTS方法(即使您正在编写有界PCollection也允许使用它),则根本不会有加载作业。

You will need to use one of the typical workarounds for "doing something after something else is done", which is, e.g. wait until the entire pipeline is done using pipeline.run().waitUntilFinish() in your main program and then do your second action.

您需要使用一种典型的解决方法,即“在完成其他操作后执行某些操作”,例如,等到你的主程序中使用pipeline.run()。waitUntilFinish()完成整个管道,然后再做第二个动作。

#1


5  

Currently this is not possible. It is complicated by the fact that a single BigQueryIO.write() may use many BigQuery jobs under the hood (i.e. BigQueryIO.write() is a general-purpose API for writing data to BigQuery, rather than an API for working with a single specific BigQuery load job), e.g.:

目前这是不可能的。事实上,单个BigQueryIO.write()可能会使用许多BigQuery作业(即BigQueryIO.write()是用于将数据写入BigQuery的通用API,而不是用于处理单个的API特定的BigQuery加载工作),例如:

  • In case the amount of data to be loaded is larger than the BigQuery limits for a single load job, BigQueryIO.write() will shard it into multiple load jobs.
  • 如果要加载的数据量大于单个加载作业的BigQuery限制,BigQueryIO.write()会将其分成多个加载作业。
  • In case you are using one of the destination-dependent write methods (e.g. DynamicDestinations), and are loading into multiple tables at the same time, there'll be at least 1 load job per table.
  • 如果您正在使用其中一个与目标相关的写入方法(例如DynamicDestinations),并且同时加载到多个表中,则每个表至少会有一个加载作业。
  • In case you are writing an unbounded PCollection using the BATCH_LOADS method, it will periodically issue load jobs for newly arrived data, subject to the notes above.
  • 如果您使用BATCH_LOADS方法编写*PCollection,它将定期为新到达的数据发出加载作业,具体取决于上述注释。
  • In case you're using the STREAMING_INSERTS method (it is allowed to use it even if you're writing a bounded PCollection), there will be no load jobs at all.
  • 如果您正在使用STREAMING_INSERTS方法(即使您正在编写有界PCollection也允许使用它),则根本不会有加载作业。

You will need to use one of the typical workarounds for "doing something after something else is done", which is, e.g. wait until the entire pipeline is done using pipeline.run().waitUntilFinish() in your main program and then do your second action.

您需要使用一种典型的解决方法,即“在完成其他操作后执行某些操作”,例如,等到你的主程序中使用pipeline.run()。waitUntilFinish()完成整个管道,然后再做第二个动作。