BigQuery脚本无法访问大文件

时间:2022-03-08 15:30:03

I am trying to load a json file to GoogleBigquery using the script at https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/api/load_data_by_post.py with very little modification. I added

我正在尝试使用https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/api/load_data_by_post.py中的脚本将json文件加载到GoogleBigquery,几乎没有任何修改。我补充道

,chunksize=10*1024*1024, resumable=True))

to MediaFileUpload.

到MediaFileUpload。

The script works fine for a sample file with a few million records. The actual file is about 140 GB with approx 200,000,000 records. insert_request.execute() always fails with

该脚本适用于具有几百万条记录的示例文件。实际文件大约为140 GB,大约有200,000,000条记录。 insert_request.execute()总是失败

socket.error: `[Errno 32] Broken pipe` 

after half an hour or so. How can this be fixed? Each row is less than 1 KB, so it shouldn't be a quota issue.

半小时后左右。怎么解决这个问题?每行小于1 KB,因此不应该是配额问题。

1 个解决方案

#1


2  

When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.

处理大型文件时不使用流式传输,而是批量加载:流式传输最多可以处理每秒100,000行。这对于流式传输非常有用,但不适用于加载大型文件。

The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails.

链接的示例代码正在做正确的事情(批量而不是流式传输),所以我们看到的是另一个问题:此示例代码试图将所有这些数据直接加载到BigQuery中,但是通过POST部分上传失败。

Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.

解决方案:不是通过POST加载大块数据,而是先将它们放入Google云端存储,然后告诉BigQuery从GCS读取文件。

Update: Talking to the engineering team, POST should work if you try a smaller chunksize.

更新:与工程团队交谈,如果您尝试较小的块,则POST应该有效。

#1


2  

When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.

处理大型文件时不使用流式传输,而是批量加载:流式传输最多可以处理每秒100,000行。这对于流式传输非常有用,但不适用于加载大型文件。

The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails.

链接的示例代码正在做正确的事情(批量而不是流式传输),所以我们看到的是另一个问题:此示例代码试图将所有这些数据直接加载到BigQuery中,但是通过POST部分上传失败。

Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.

解决方案:不是通过POST加载大块数据,而是先将它们放入Google云端存储,然后告诉BigQuery从GCS读取文件。

Update: Talking to the engineering team, POST should work if you try a smaller chunksize.

更新:与工程团队交谈,如果您尝试较小的块,则POST应该有效。