DataFlow Runner升级到Beam 2.4.0后失败

时间:2022-10-23 15:38:17

I have a simple dataflow job for testing that ran successfully with apache-beam 2.1.0, the code looks something like:

我有一个简单的数据流作业,用于使用apache-beam 2.1.0成功运行的测试,代码如下所示:

public static void main(String[] args) throws Exception {
    DataflowPipelineOptions dataflowOptions = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
    dataflowOptions.setProject("MY_PROJECT_ID");
    dataflowOptions.setStagingLocation("gs://MY_STAGING_LOC");
    dataflowOptions.setTempLocation("gs://MY_TEMP_LOC");
    dataflowOptions.setFilesToStage(Collections.singletonList("MY_LOCAL_JAR_FILE.jar"));
    dataflowOptions.setRunner(DataflowRunner.class);
    dataflowOptions.setNetwork("SOME_NETWORK");
    dataflowOptions.setSubnetwork("regions/SOME_REGION/subnetworks/SOME_SUBNETWORK");
    dataflowOptions.setZone("SOME_ZONE");

    Pipeline p = Pipeline.create(dataflowOptions);

    List<String> LINES = Arrays.asList("foobar");
    p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of());

    p.run().waitUntilFinish();
}

However, when I migrate to apache-beam 2.4.0, I immediately get the following error when trying to submit a dataflow job via the cli.

但是,当我迁移到apache-beam 2.4.0时,我在尝试通过cli提交数据流作业时立即收到以下错误。

Exception in thread "main" java.lang.RuntimeException: Error while staging packages
        at org.apache.beam.runners.dataflow.util.PackageUtil.stageClasspathElements(PackageUtil.java:396)
        at org.apache.beam.runners.dataflow.util.PackageUtil.stageClasspathElements(PackageUtil.java:273)
        at org.apache.beam.runners.dataflow.util.GcsStager.stageFiles(GcsStager.java:76)
        at org.apache.beam.runners.dataflow.util.GcsStager.stageDefaultFiles(GcsStager.java:64)
        at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:661)
        at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:174)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
        at com.company.app.App.main(App.java:48)
Caused by: java.io.IOException: Error executing batch GCS request
        at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:607)
        at org.apache.beam.sdk.util.GcsUtil.getObjects(GcsUtil.java:339)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.matchNonGlobs(GcsFileSystem.java:216)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.match(GcsFileSystem.java:85)
        at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:123)
        at org.apache.beam.sdk.io.FileSystems.matchSingleFileSpec(FileSystems.java:188)
        at org.apache.beam.runners.dataflow.util.PackageUtil.alreadyStaged(PackageUtil.java:160)
        at org.apache.beam.runners.dataflow.util.PackageUtil.stagePackageSynchronously(PackageUtil.java:184)
        at org.apache.beam.runners.dataflow.util.PackageUtil.lambda$stagePackage$1(PackageUtil.java:174)
        at org.apache.beam.sdk.util.MoreFutures.lambda$supplyAsync$0(MoreFutures.java:101)
        at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: com.google.api.client.http.HttpResponseException: 404 Not Found
...

I haven't changed any configuration settings.

我没有更改任何配置设置。

Further debugging the code, it is failing on a POST request to https://www.googleapis.com/null

进一步调试代码,POST请求失败到https://www.googleapis.com/null

2 个解决方案

#1


2  

Looks like it is a bug which was fixed in the dev branch on Feb 13. Hopefully the fix will be released soon:

看起来它是2月13日在dev分支中修复的bug。希望修复程序很快就会发布:

Original Issue: https://github.com/google/google-api-java-client/issues/1073

原始问题:https://github.com/google/google-api-java-client/issues/1073

Flawed Fix: https://github.com/google/google-api-java-client/pull/1087

有缺陷的修复:https://github.com/google/google-api-java-client/pull/1087

Corrected Fix: https://github.com/google/google-api-java-client/pull/1096

更正了修复程序:https://github.com/google/google-api-java-client/pull/1096

#2


0  

You're hitting this issue: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/607

你遇到了这个问题:https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/607

To fix, add the following if using Gradle:

要修复,请使用Gradle添加以下内容:

compile (group: 'com.google.api-client', name: 'google-api-client', version: '1.22.0') {
    force = true
}

Or Maven:

或者Maven:

<dependency>
  <groupId>com.google.api-client</groupId>
  <artifactId>google-api-client</artifactId>
  <version>[1.22.0]</version>
</dependency>

#1


2  

Looks like it is a bug which was fixed in the dev branch on Feb 13. Hopefully the fix will be released soon:

看起来它是2月13日在dev分支中修复的bug。希望修复程序很快就会发布:

Original Issue: https://github.com/google/google-api-java-client/issues/1073

原始问题:https://github.com/google/google-api-java-client/issues/1073

Flawed Fix: https://github.com/google/google-api-java-client/pull/1087

有缺陷的修复:https://github.com/google/google-api-java-client/pull/1087

Corrected Fix: https://github.com/google/google-api-java-client/pull/1096

更正了修复程序:https://github.com/google/google-api-java-client/pull/1096

#2


0  

You're hitting this issue: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/607

你遇到了这个问题:https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/607

To fix, add the following if using Gradle:

要修复,请使用Gradle添加以下内容:

compile (group: 'com.google.api-client', name: 'google-api-client', version: '1.22.0') {
    force = true
}

Or Maven:

或者Maven:

<dependency>
  <groupId>com.google.api-client</groupId>
  <artifactId>google-api-client</artifactId>
  <version>[1.22.0]</version>
</dependency>