当使用DirectRunner时,Bigquery apache beam管道“挂”

时间:2022-05-10 01:15:03

I was curious to here if anyone else has encountered a similar problem with the python apache beam dataflow runner as described below. (I'm not able to ship to the CloudRunner just yet)

如果其他人遇到类似python apache beam dataflow runner的问题,我很好奇,如下所述。 (我暂时无法发送到CloudRunner)

The query that is being executed returns just under 18 million rows. If I add a LIMIT to query (eg: 10000) then the datafow works as expected. Not included in the code snippet is the WriteToBleve sink which is a custom sink to support writing to a bleve index.

正在执行的查询返回的行数不到1800万行。如果我添加一个LIMIT来查询(例如:10000),那么datafow按预期工作。 WriteToBleve接收器不包含在代码片段中,它是一个支持写入bleve索引的自定义接收器。

The python sdk that is being used is 2.2.0 but I'm getting ready to spark up some java....

正在使用的python sdk是2.2.0,但我正准备引发一些java ....

The last log message I see when running the pipeline is:

我在运行管道时看到的最后一条日志消息是:

WARNING:root:Dataset my-project:temp_dataset_7708fbe7e7694cd49b8b0de07af2470b does not exist so we will create it as temporary with location=None

警告:root:数据集my-project:temp_dataset_7708fbe7e7694cd49b8b0de07af2470b不存在所以我们将它创建为临时的location = None

The dataset is created correctly and populated and when I debug into the pipeline I can see the results being iterated, but this pipeline itself never seems to reach the write stage.

数据集是正确创建并填充的,当我调试到管道时,我可以看到结果被迭代,但这个管道本身似乎永远不会到达写阶段。

    options = {
        "project": "my-project",
        "staging_location": "gs://my-project/staging",
        "temp_location": "gs://my-project/temp",
        "runner": "DirectRunner"
    }
    pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
    p = beam.Pipeline(options=pipeline_options)
    p | 'Read From Bigquery' >> beam.io.Read(beam.io.BigQuerySource(
        query=self.build_query(),
        use_standard_sql=True,
        validate=True,
        flatten_results=False,
    )) | 'Write to Bleve' >> WriteToBleve()

    result = p.run()
    result.wait_until_finish()

1 个解决方案

#1


3  

The direct runner is meant to be used for local debugging and testing your pipeline on small amounts of data. It is not particularly optimized for performance and is not meant to be used for large amounts of data - this is the case both for Python and Java.

直接运行器旨在用于本地调试和测试少量数据的管道。它没有特别针对性能进行优化,也不适用于大量数据 - 这适用于Python和Java。

That said, currently some very serious improvements to the Python direct runner are in progress.

也就是说,目前对Python直接跑步者的一些非常严重的改进正在进行中。

I recommend you try running on Dataflow and see if performance is still unsatisfactory.

我建议您尝试在Dataflow上运行,看看性能是否仍然不令人满意。

Also, if you can write in Java - I recommend to do that: it often performs orders of magnitude better than Python, especially in the case of reading from BigQuery: reading BigQuery goes through a BigQuery export to Avro files, and performance of the standard Python library for reading Avro files is notoriously horrible, but unfortunately currently there is no well-performing and maintained replacement.

此外,如果你可以用Java编写 - 我建议这样做:它通常比Python更好地执行数量级,特别是在从BigQuery读取的情况下:阅读BigQuery通过BigQuery导出到Avro文件,以及标准的性能用于阅读Avro文件的Python库非常糟糕,但不幸的是目前还没有表现良好且维护良好的替代品。

#1


3  

The direct runner is meant to be used for local debugging and testing your pipeline on small amounts of data. It is not particularly optimized for performance and is not meant to be used for large amounts of data - this is the case both for Python and Java.

直接运行器旨在用于本地调试和测试少量数据的管道。它没有特别针对性能进行优化,也不适用于大量数据 - 这适用于Python和Java。

That said, currently some very serious improvements to the Python direct runner are in progress.

也就是说,目前对Python直接跑步者的一些非常严重的改进正在进行中。

I recommend you try running on Dataflow and see if performance is still unsatisfactory.

我建议您尝试在Dataflow上运行,看看性能是否仍然不令人满意。

Also, if you can write in Java - I recommend to do that: it often performs orders of magnitude better than Python, especially in the case of reading from BigQuery: reading BigQuery goes through a BigQuery export to Avro files, and performance of the standard Python library for reading Avro files is notoriously horrible, but unfortunately currently there is no well-performing and maintained replacement.

此外,如果你可以用Java编写 - 我建议这样做:它通常比Python更好地执行数量级,特别是在从BigQuery读取的情况下:阅读BigQuery通过BigQuery导出到Avro文件,以及标准的性能用于阅读Avro文件的Python库非常糟糕,但不幸的是目前还没有表现良好且维护良好的替代品。