如何在Python中将大型csv文件读取为大小均匀的块?

时间:2022-01-10 02:02:07

In a basic I had the next process.

在一个基本的我有下一个过程。

import csv
reader = csv.reader(open('huge_file.csv', 'rb'))

for line in reader:
    process_line(line)

See this related question. I want to send the process line every 100 rows, to implement batch sharding.

看到这个相关的问题。我想每100行发送一次生产线,以实现批量分片。

The problem about implementing the related answer is that csv object is unsubscriptable and can not use len.

实现相关答案的问题是csv对象是不可取消的,不能使用len。

>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable

How can I solve this?

我怎么解决这个问题?

2 个解决方案

#1


20  

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

只需将您的读者包装到列表中即可下载。显然这会破坏真正的大文件(请参阅下面的更新中的替代方案):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

Further reading: How do you split a list into evenly sized chunks in Python?

进一步阅读:如何在Python中将列表拆分为大小均匀的块?


Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

更新1(列表版本):另一种可能的方法是处理每个chuck,因为它在迭代时到达:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

chunk, chunksize = [], 100

def process_chunk(chuck):
    print len(chuck)
    # do something useful ...

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]
    chunk.append(line)

# process the remainder
process_chunk(chunk)

Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

更新2(生成器版本):我没有对它进行基准测试,但也许您可以通过使用块生成器来提高性能:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

#2


1  

There isn't a good way to do this for all .csv files. You should be able to divide the file into chunks using file.seek to skip a section of the file. Then you have to scan one byte at a time to find the end of the row. The you can process the two chunks independently. Something like the following (untested) code should get you started.

没有一种好方法可以为所有.csv文件执行此操作。您应该能够使用file.seek将文件划分为块以跳过文件的一部分。然后,您必须一次扫描一个字节以找到行的结尾。您可以独立处理这两个块。像下面(未经测试的)代码应该让你开始。

file_one = open('foo.csv')
file_two = open('foo.csv') 
file_two.seek(0, 2)     # seek to the end of the file
sz = file_two.tell()    # fetch the offset
file_two.seek(sz / 2)   # seek back to the middle
chr = ''
while chr != '\n':
    chr = file_two.read(1)
# file_two is now positioned at the start of a record
segment_one = csv.reader(file_one)
segment_two = csv.reader(file_two)

I'm not sure how you can tell that you have finished traversing segment_one. If you have a column in the CSV that is a row id, then you can stop processing segment_one when you encounter the row id from the first row in segment_two.

我不知道你怎么知道你已经完成遍历segment_one。如果CSV中的列是行标识,则当您遇到segment_two中第一行的行标识时,可以停止处理segment_one。

#1


20  

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

只需将您的读者包装到列表中即可下载。显然这会破坏真正的大文件(请参阅下面的更新中的替代方案):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

Further reading: How do you split a list into evenly sized chunks in Python?

进一步阅读:如何在Python中将列表拆分为大小均匀的块?


Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

更新1(列表版本):另一种可能的方法是处理每个chuck,因为它在迭代时到达:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

chunk, chunksize = [], 100

def process_chunk(chuck):
    print len(chuck)
    # do something useful ...

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]
    chunk.append(line)

# process the remainder
process_chunk(chunk)

Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

更新2(生成器版本):我没有对它进行基准测试,但也许您可以通过使用块生成器来提高性能:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

#2


1  

There isn't a good way to do this for all .csv files. You should be able to divide the file into chunks using file.seek to skip a section of the file. Then you have to scan one byte at a time to find the end of the row. The you can process the two chunks independently. Something like the following (untested) code should get you started.

没有一种好方法可以为所有.csv文件执行此操作。您应该能够使用file.seek将文件划分为块以跳过文件的一部分。然后,您必须一次扫描一个字节以找到行的结尾。您可以独立处理这两个块。像下面(未经测试的)代码应该让你开始。

file_one = open('foo.csv')
file_two = open('foo.csv') 
file_two.seek(0, 2)     # seek to the end of the file
sz = file_two.tell()    # fetch the offset
file_two.seek(sz / 2)   # seek back to the middle
chr = ''
while chr != '\n':
    chr = file_two.read(1)
# file_two is now positioned at the start of a record
segment_one = csv.reader(file_one)
segment_two = csv.reader(file_two)

I'm not sure how you can tell that you have finished traversing segment_one. If you have a column in the CSV that is a row id, then you can stop processing segment_one when you encounter the row id from the first row in segment_two.

我不知道你怎么知道你已经完成遍历segment_one。如果CSV中的列是行标识,则当您遇到segment_two中第一行的行标识时,可以停止处理segment_one。