Google Dataflow:如何使用FileIO.ReadableFile中的有效JSON数组解析大文件
In my pipeline FileIO.readMatches() transform reads big JSON file(around 300-400MB) with a valid JSON array and returns FileIO.ReadableFile object to...
使用云功能执行Dataflow模板时出错
Getting below error while trying to execute custom dataflow template using Google Cloud function. 尝试使用Google Cloud功能执行自定义数据流模板时出现以下错误。 Error:"problem ...
如何从Cassandra增加Dataflow读取并行性
I am trying to export a lot of data (2 TB, 30kkk rows) from Cassandra to BigQuery. All my infrastructure is on GCP. My Cassandra cluster have 4 nodes ...
在DataFlow DoFn子类之间共享BigTable Connection对象
I am setting up a Java Pipeline in DataFlow to read a .csv file and to create a bunch of BigTable rows based on the content of the file. I see in the ...
如何确保DataFlow和Cloud Pub Sub的幂等性?
I'm curious about the best way to ensure idempotence when using Cloud DataFlow and PubSub? 我对使用Cloud DataFlow和PubSub时确保幂等性的最佳方法感到好奇吗? We currently hav...
如何使用Cloud Dataflow中的TextIO.Read将多个文件与名称匹配
I have a gcs folder as below: 我有一个gcs文件夹,如下所示: gs://<bucket-name>/<folder-name>/dt=2017-12-01/part-0000.tsv ...
Apache beam Dataflow SDK错误示例
I'm trying one of the beam google dataflow pipeline examples, but i'm bumping into a exception regarding MapElements and methods SingleFunction / Seri...
在Dataflow上运行时缺少路径中的对象或存储桶
When trying to run a pipeline on the Dataflow service, I specify the staging and temp buckets (in GCS) on the command line. When the program executes,...
用于在Cloud Dataflow中读取Parquet文件的自定义源
I have a requirement to read parquet file in my dataflow written in java and upload on bigquery. As there is no out of the box functionality given yet...
Cloud Dataflow BQ输出挂起作业,用于TLS握手错误
My Cloud Dataflow job hangs. 我的Cloud Dataflow作业挂起。 Pipeline: 管道: Pipeline p = Pipeline.create(options);p.apply(TextIO.Read.named("ReadFiles").from(opt...
Cloud Pub / Sub和Cloud Dataflow将工作人员修复为主题
I have a Google Cloud Pub/Sub and Cloud Dataflow stream processing architecture, i need guaranteed message ordering. Is it possible to set the subscri...
在DataFlow作业中应用多个GroupByKey变换,导致多次应用窗口
We have a DataFlow job that is subscribed to a PubSub stream of events. We have applied sliding windows of 1 hour with a 10 minute period. In our code...
耗尽从PubSub读取并写入Google云端存储的Dataflow作业时的数据丢失
When putting a fixed number of strings (800,000 1KB used to test) into a PubSub topic and running the following Apache Beam (2.1.0) job in Dataflow, e...
Google Cloud Dataflow BigQueryIO.Write发生未知错误(http代码500)
Has somebody occur same problem with me that Google Cloud Dataflow BigQueryIO.Write happen unknown error (http code 500)? 有人问我谷歌云数据流BigQueryIO.Write发生...
使用DataFlow表达ControlFlow的一些思考
一、控制流从接触面向过程语言开始,使用控制流编程的概念已是司空见惯。if (condition) { // do something} else { // do something else}分支和循环是最常见的控制流形式。由于控制条件的存在,总有一部分代码片段会执行,另一部分不会执行。在控制流...
GCP Dataflow Apache Beam写入输出错误处理
I need to apply error handling to my Dataflow for multiple inserts to Spanner with the same primary key. The logic being that an older message may be ...
Apache beam DataFlow runner抛出设置错误
We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error, 我们正在使用Beam Python SDK构建数据管道并尝试在Dataflo...
如何使用Spark runner运行Cloud Dataflow管道?
I have read that Google Cloud Dataflow pipelines, which are based on Apache Beam SDK, can be run with Spark or Flink. 我已经读过基于Apache Beam SDK的Google Cl...
DataFlow Runner升级到Beam 2.4.0后失败
I have a simple dataflow job for testing that ran successfully with apache-beam 2.1.0, the code looks something like: 我有一个简单的数据流作业,用于使用apache-beam 2.1...
将Dataflow模板GCS调试为BigQuery
I am getting some strange errors that are difficult to debug. I am running a simple UDF JavaScript mapper which maps the JSON data and imports it into...