如何指定数据流作业的多个输入路径

时间:2022-09-01 23:18:04

I want to run a Dataflow job over multiple inputs from Google Cloud Storage, but the paths I want to pass to the job can't be specified with just the * glob operator.

我想通过Google Cloud Storage的多个输入运行Dataflow作业,但是我想要传递给作业的路径不能仅使用* glob运算符来指定。

Consider these paths:

考虑这些路径:

gs://bucket/some/path/20160208/input1
gs://bucket/some/path/20160208/input2
gs://bucket/some/path/20160209/input1
gs://bucket/some/path/20160209/input2
gs://bucket/some/path/20160210/input1
gs://bucket/some/path/20160210/input2
gs://bucket/some/path/20160211/input1
gs://bucket/some/path/20160211/input2
gs://bucket/some/path/20160212/input1
gs://bucket/some/path/20160212/input2

I want my job to work on the files in the 20160209, 20160210 and 20160211 directories, but not on 20160208 (the first) and 20160212 (the last). In reality there's a lot of more dates, and I want to be able to specify an arbitrary range of dates for my job to work on.

我希望我的工作能够处理20160209,20160210和20160211目录中的文件,而不是20160208(第一个)和20160212(最后一个)。实际上还有很多日期,我希望能够为我的工作指定一个任意的日期范围。

The docs for TextIO.Read say:

TextIO.Read的文档说:

Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.

支持标准Java文件系统glob模式(“*”,“?”,“[..]”)。

But I can't get this to work. There's a link to Java Filesystem glob patterns , which in turn links to getPathMatcher(String), that lists all the globbing options. One of them is {a,b,c}, which looks exactly like what I need, however, if I pass gs://bucket/some/path/201602{09,10,11}/* to TextIO.Read#from I get "Unable to expand file pattern".

但我不能让这个工作。有一个指向Java Filesystem glob模式的链接,后者又链接到getPathMatcher(String),列出了所有的globbing选项。其中一个是{a,b,c},它看起来与我需要的完全一样,但是,如果我将gs:// bucket / some / path / 201602 {09,10,11} / *传递给TextIO.Read#从我得到“无法扩展文件模式”。

Maybe the docs mean that only *, ? and […] are supported, and if that is the case, how can I construct a glob that Dataflow will accept and that can match an arbitrary date range like the one I describe above?

也许文档意味着只有*,?并且[...]是支持的,如果是这种情况,我如何构造一个数据流将接受的glob,并且可以匹配任意日期范围,如上所述?

Update: I've figured out that I can write a chunk of code to so that I can pass in the path prefixes as a comma separated list, create an input from each and use the Flatten transform, but that seems like a very inefficient way of doing it. It looks like the first step reads all input files and immediately write them out again to the temporary location on GCS. Only when all the inputs have been read and written the actual processing starts. This step is completely unnecessary in the job I'm writing. I want the job to read the first file, start processing it and read the next, and so on. This just caused a ton other problems, I'll try to make it work, but it feels like a dead end because of the initial rewriting.

更新:我已经发现我可以编写一大块代码,以便我可以将路径前缀作为逗号分隔列表传递,从每个创建输入并使用Flatten变换,但这似乎是一种非常低效的方式这样做。看起来第一步是读取所有输入文件并立即将它们再次写入GCS上的临时位置。只有在读取和写入所有输入后,才开始实际处理。在我写的工作中,这一步是完全没必要的。我希望作业读取第一个文件,开始处理它并读取下一个文件,依此类推。这只会引起其他问题,我会尝试使其工作,但由于最初的重写,它感觉像死路一条。

1 个解决方案

#1


2  

The docs do, indeed, mean that only *, ?, and [...] are supported. This means that arbitrary subsets or ranges in alphabetical or numeric order cannot be expressed as a single glob.

实际上,文档确实意味着只支持* ,?和[...]。这意味着按字母顺序或数字顺序的任意子集或范围不能表示为单个glob。

Here are some approaches that might work for you:

以下是一些可能适合您的方法:

  1. If the date represented in the file path is also present in the records in the files, then the simplest solution is to read them all and use a Filter transform to select the date range you are interested in.
  2. 如果文件路径中表示的日期也出现在文件中的记录中,那么最简单的解决方案是全部读取它们并使用过滤器变换来选择您感兴趣的日期范围。
  3. The approach you tried of many reads in a separates TextIO.Read transforms and flattening them is OK for small sets of files; our tf-idf example does this. You can express arbitrary numerical ranges with a small number of globs so this need not be one read per file (for example the two character range "23 through 67" is 2[3-] plus [3-5][0-9] plus 6[0-7])
  4. 您尝试过多次读取的方法将TextIO.Read转换并展平它们对于小型文件集是可以的;我们的tf-idf示例就是这样做的。您可以使用少量的globs表示任意数值范围,因此不需要每个文件读取一次(例如,两个字符范围“23到67”是2 [3-]加[3-5] [0-9]加6 [0-7])
  5. If the subset of files is more arbitrary then the number of globs/filenames may exceed the maximum graph size, and the last recommendation is to put the list of files into a PCollection and use a ParDo transform to read each file and emit its contents.
  6. 如果文件子集更随意,则globs /文件名的数量可能超过最大图形大小,最后的建议是将文件列表放入PCollection并使用ParDo转换来读取每个文件并发出其内容。

I hope this helps!

我希望这有帮助!

#1


2  

The docs do, indeed, mean that only *, ?, and [...] are supported. This means that arbitrary subsets or ranges in alphabetical or numeric order cannot be expressed as a single glob.

实际上,文档确实意味着只支持* ,?和[...]。这意味着按字母顺序或数字顺序的任意子集或范围不能表示为单个glob。

Here are some approaches that might work for you:

以下是一些可能适合您的方法:

  1. If the date represented in the file path is also present in the records in the files, then the simplest solution is to read them all and use a Filter transform to select the date range you are interested in.
  2. 如果文件路径中表示的日期也出现在文件中的记录中,那么最简单的解决方案是全部读取它们并使用过滤器变换来选择您感兴趣的日期范围。
  3. The approach you tried of many reads in a separates TextIO.Read transforms and flattening them is OK for small sets of files; our tf-idf example does this. You can express arbitrary numerical ranges with a small number of globs so this need not be one read per file (for example the two character range "23 through 67" is 2[3-] plus [3-5][0-9] plus 6[0-7])
  4. 您尝试过多次读取的方法将TextIO.Read转换并展平它们对于小型文件集是可以的;我们的tf-idf示例就是这样做的。您可以使用少量的globs表示任意数值范围,因此不需要每个文件读取一次(例如,两个字符范围“23到67”是2 [3-]加[3-5] [0-9]加6 [0-7])
  5. If the subset of files is more arbitrary then the number of globs/filenames may exceed the maximum graph size, and the last recommendation is to put the list of files into a PCollection and use a ParDo transform to read each file and emit its contents.
  6. 如果文件子集更随意,则globs /文件名的数量可能超过最大图形大小,最后的建议是将文件列表放入PCollection并使用ParDo转换来读取每个文件并发出其内容。

I hope this helps!

我希望这有帮助!