数据流管道中的外部Python依赖项

Can python dependences be loaded into a google cloud dataflow pipeline? I would like to use gensim's phrase modeler which reads data line by line to automatically detect common phrases/bigrams (two words that frequently appear next to each other). So the first run through of the pipeline would be passing each sentence to this phrase modeler. The second pass through the pipeline would then take the same phrase modeler and apply this phrase modeler to each sentence to identify the phrases that should be modeled together (if 'machine' and 'learning' frequently appear next to each other in the corpus, they would be transformed to a single word 'machine_learning' instead. Would this be possible to accomplish within dataflow? Can a build/requirements file be passed forcing pip install gensim on the worker machines?

可以将python依赖加载到谷歌云数据流管道中吗？我想使用gensim的短语建模器，它逐行读取数据以自动检测常用短语/双字母（两个经常出现在彼此旁边的单词）。所以管道的第一个贯穿是将每个句子传递给这个短语建模者。然后通过管道的第二次传递将采用相同的短语建模器并将这个短语建模器应用于每个句子以识别应该一起建模的短语（如果“机器”和“学习”经常在语料库中彼此相邻，他们将转换为单个单词'machine_learning'。这是否可以在数据流中完成？是否可以通过强制pip install gensim在工作机上传递构建/需求文件？

1 个解决方案

#1

You can check out this page for managing dependencies in your pipeline:

您可以查看此页面以管理管道中的依赖项：

https://beam.apache.org/documentation/sdks/python-pipeline-dependencies

example: For packages on PyPI, you can use requirement file by adding the following command line option:

示例：对于PyPI上的包，您可以通过添加以下命令行选项来使用需求文件：

--requirements_file requirements.txt

#1