Apache beam DataFlow runner抛出设置错误

We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,

我们正在使用Beam Python SDK构建数据管道并尝试在Dataflow上运行，但是得到以下错误，

A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.

But could not find detailed worker-startup logs.

但是找不到详细的工作者启动日志。

We tried increasing memory size, worker count etc, but still getting the same error.

我们尝试增加内存大小，工作人员数量等，但仍然得到相同的错误。

Here is the command we use,

这是我们使用的命令，

python run.py \
--project=xyz \
--runner=DataflowRunner \
--staging_location=gs://xyz/staging \
--temp_location=gs://xyz/temp \
--requirements_file=requirements.txt \
--worker_machine_type n1-standard-8 \
--num_workers 2

pipeline snippet,

管道片段，

data = pipeline | "load data" >> beam.io.Read(    
    beam.io.BigQuerySource(query="SELECT * FROM abc_table LIMIT 100")
)

data | "filter data" >> beam.Filter(lambda x: x.get('column_name') == value)

Above pipeline is just loading the data from BigQuery and filtering based on some column value. This pipeline works like a charm in DirectRunner but fails on Dataflow.

管道上面只是从BigQuery加载数据并根据某些列值进行过滤。此管道的工作方式类似于DirectRunner中的魅力，但在Dataflow上失败。

Are we doing any obvious setup mistake? anyone else getting the same error? We could use some help to resolve the issue.

我们是否有任何明显的设置错误？其他人得到同样的错误？我们可以使用一些帮助来解决问题。

Update:

Our pipeline code is spread across multiple files, so we created a python package. We solved setup error problem by passing --setup_file argument instead of --requirements_file.

我们的管道代码分布在多个文件中，因此我们创建了一个python包。我们通过传递--setup_file参数而不是--requirements_file来解决设置错误问题。

1 个解决方案

#1

We resolved this setup error issue by sending a different set of arguments to the dataflow. Our code is spread across multiple files, so had to create a package for it. If we use --requirements_file, the job will start, but fail eventually, because it wouldn't be able to find the package in the workers. Beam Python SDK sometimes does not throw explicit error message for these instead, it will retry the job and fail. To get your code running with a package, you will need to pass --setup_file argument, which has dependencies listed in it. Make sure package created by python setup.py sdist command includes all the files required by your pipeline code.

我们通过向数据流发送一组不同的参数来解决此安装错误问题。我们的代码分布在多个文件中，因此必须为它创建一个包。如果我们使用--requirements_file，作业将启动，但最终失败，因为它无法在worker中找到包。 Beam Python SDK有时不为这些抛出显式错误消息，它将重试该作业并失败。要使用包运行代码，您需要传递--setup_file参数，该参数中包含依赖项。确保python setup.py sdist命令创建的包中包含管道代码所需的所有文件。

If you have a privately hosted python package dependency then pass --extra_package with the path to the package.tar.gz file. Better way is to store in a GCS bucket and pass the path here.

如果您有一个私有托管的python包依赖项，那么将--extra_package传递给package.tar.gz文件的路径。更好的方法是存储在GCS存储桶中并在此处传递路径。

I have written an example project to get started with Apache Beam Python SDK on Dataflow - https://github.com/RajeshHegde/apache-beam-example

我已经编写了一个示例项目来开始使用Dataflow上的Apache Beam Python SDK - https://github.com/RajeshHegde/apache-beam-example

Read about it here - https://medium.com/@rajeshhegde/data-pipeline-using-apache-beam-python-sdk-on-dataflow-6bb8550bf366

在这里阅读 - https://medium.com/@rajeshhegde/data-pipeline-using-apache-beam-python-sdk-on-dataflow-6bb8550bf366

#1