使用Google Cloud Dataflow中的XmlSource读取XML文件时出现ClassCastException

时间:2022-02-01 15:22:55

I am using XmlSource.from to read from an XML file stored in a Cloud Storage bucket.

我正在使用XmlSource.from从存储在云存储桶中的XML文件中读取。

 XmlSource<Data> source = XmlSource.<Data>from("gs://<my-url>/TestData.xml")
        .withRootElement("data")
        .withRecordElement("record")
        .withRecordClass(Data.class);

p.apply(Read.from(source))
        .apply(RemoveDuplicates.<Data>create())
        .apply(ParDo.of(new XMLPipeline.CreateItemQtyMapping()))
        .apply(Combine.<String, Integer>perKey(new SumIntegers()))
        .apply("FormatResults", MapElements.via(
                new SimpleFunction<KV<String, Integer>, String>() {
                  @Override
                  public String apply(KV<String, Integer> input) {
                    return input.getKey() + "," + input.getValue();
                  }
                }))
        .apply(TextIO.Write.to("gs://<my-url>.appspot.com/pos-pipeline-output/ItemCounts"));

p.run();

But I am getting this exception:

但是我得到了这个例外:

017-01-09T14:01:31.107Z: Error:   (c88c756cabe0dbec): java.io.IOException: Failed to start reading from source: StaticValueProvider{value=gs://<my-url>/TestData.xml} range [48524, 97048)
at com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:534)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:387)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:217)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:182)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:69)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:284)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:220)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:170)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:192)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:172)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:159)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: com.sun.xml.internal.stream.XMLInputFactoryImpl cannot be cast to org.codehaus.stax2.XMLInputFactory2
    at com.google.cloud.dataflow.sdk.io.XmlSource$XMLReader.setUpXMLParser(XmlSource.java:490)
    at com.google.cloud.dataflow.sdk.io.XmlSource$XMLReader.startReading(XmlSource.java:356)
    at com.google.cloud.dataflow.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:528)
    at com.google.cloud.dataflow.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:281)
    at com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:531)
    ... 14 more

These are the dependencies in my pom.xml:

这些是我的pom.xml中的依赖项:

<dependencies>
<dependency>
  <groupId>com.google.cloud.dataflow</groupId>
  <artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
  <version>1.9.0</version>
</dependency>

<dependency>
  <groupId>com.google.cloud</groupId>
  <artifactId>google-cloud-storage</artifactId>
  <version>0.7.0</version>
</dependency>

<dependency>
  <groupId>org.codehaus.woodstox</groupId>
  <artifactId>stax2-api</artifactId>
  <version>4.0.0</version>
</dependency>

I am not sure what is wrong here. Can someone please give some pointers?

我不确定这里有什么问题。有人可以给点指点吗?

Thanks,

谢谢,

Abhishek

阿布舍克

2 个解决方案

#1


1  

This is a bit subtle, but it looks like you also need to include the appropriate runtime dependency. According to https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/XmlSource, you want to:

这有点微妙,但看起来您还需要包含适当的运行时依赖项。根据https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/XmlSource,您希望:

  1. Explicitly declare a dependency on org.codehaus.woodstox:stax2-api

    明确声明对org.codehaus.woodstox:stax2-api的依赖

  2. Include a compatible implementation on the classpath at run-time, such as org.codehaus.woodstox:woodstox-core-asl

    在运行时在类路径中包含兼容的实现,例如org.codehaus.woodstox:woodstox-core-asl

It looks like you've correctly done #1 but not #2.

看起来你已经正确完成#1而不是#2。

#2


0  

for me to solve java.lang.ClassCastException: com.sun.xml.internal.stream.XMLInputFactoryImpl cannot be cast to org.codehaus.stax2.XMLInputFactory2

为我解决java.lang.ClassCastException:com.sun.xml.internal.stream.XMLInputFactoryImpl无法强制转换为org.codehaus.stax2.XMLInputFactory2

the answer was to only use the dependency for org.codehaus.woodstox:woodstox.core.asl

答案是只使用org.codehaus.woodstox:woodstox.core.asl的依赖项

which already has the indirect dependencies to stax and stax2 (javax.xml.stream - stax-api, org.codehaus.woodstox - stax2-api).

它已经与stax和stax2(javax.xml.stream - stax-api,org.codehaus.woodstox - stax2-api)有间接依赖关系。

#1


1  

This is a bit subtle, but it looks like you also need to include the appropriate runtime dependency. According to https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/XmlSource, you want to:

这有点微妙,但看起来您还需要包含适当的运行时依赖项。根据https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/XmlSource,您希望:

  1. Explicitly declare a dependency on org.codehaus.woodstox:stax2-api

    明确声明对org.codehaus.woodstox:stax2-api的依赖

  2. Include a compatible implementation on the classpath at run-time, such as org.codehaus.woodstox:woodstox-core-asl

    在运行时在类路径中包含兼容的实现,例如org.codehaus.woodstox:woodstox-core-asl

It looks like you've correctly done #1 but not #2.

看起来你已经正确完成#1而不是#2。

#2


0  

for me to solve java.lang.ClassCastException: com.sun.xml.internal.stream.XMLInputFactoryImpl cannot be cast to org.codehaus.stax2.XMLInputFactory2

为我解决java.lang.ClassCastException:com.sun.xml.internal.stream.XMLInputFactoryImpl无法强制转换为org.codehaus.stax2.XMLInputFactory2

the answer was to only use the dependency for org.codehaus.woodstox:woodstox.core.asl

答案是只使用org.codehaus.woodstox:woodstox.core.asl的依赖项

which already has the indirect dependencies to stax and stax2 (javax.xml.stream - stax-api, org.codehaus.woodstox - stax2-api).

它已经与stax和stax2(javax.xml.stream - stax-api,org.codehaus.woodstox - stax2-api)有间接依赖关系。