获取文本文件的第一行和最后一行的最有效的方法是什么?

时间:2022-12-05 09:25:10

I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?

我有一个文本文件,其中每一行都有一个时间戳。我的目标是找到时间范围。所有的时间都是有序的,所以第一行是最早的时间,最后一行是最晚的时间。我只需要第一行和最后一行。在python中获得这些代码的最有效的方法是什么?

Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.

注意:这些文件的长度相对较大,每行大约有100 - 200万行,我必须对几百个文件这样做。

11 个解决方案

#1


51  

docs for io module

io模块的文档

with open(fname, 'rb') as fh:
    first = next(fh).decode()

    fh.seek(-1024, 2)
    last = fh.readlines()[-1].decode()

The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.

这里的变量值是1024:它表示平均字符串长度。我只选择了1024。如果你估计平均线长你可以用这个值乘以2。

Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:

由于您对线长可能的上限一无所知,显然的解决方案是对文件进行循环:

for line in fh:
    pass
last = line

You don't need to bother with the binary flag you could just use open(fname).

不需要麻烦使用二进制标志,只需使用open(fname)。

ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.

由于您有许多文件要处理,您可以使用random创建几个文件的样本。对它们进行示例并运行此代码,以确定最后一行的长度。位置偏移的先验值(假设是1 MB)。这将帮助您估计整个运行的价值。

#2


54  

You could open the file for reading and read the first line using the builtin readline(), then seek to the end of file and step backwards until you find the line's preceding EOL and read the last line from there.

您可以打开文件以读取第一行,并使用builtin readline()读取第一行,然后查找文件的末尾,然后后退一步,直到找到该行前面的EOL并从中读取最后一行。

with open(file, "rb") as f:
    first = f.readline()        # Read the first line.
    f.seek(-2, os.SEEK_END)     # Jump to the second last byte.
    while f.read(1) != b"\n":   # Until EOL is found...
        f.seek(-2, os.SEEK_CUR) # ...jump back the read byte plus one more.
    last = f.readline()         # Read last line.

Jumping to the second last byte instead of the last one prevents that you return directly because of a trailing EOL. While you're stepping backwards you'll also want to step two bytes since the reading and checking for EOL pushes the position forward one step.

跳转到第二个最后一个字节而不是最后一个字节,可以防止由于末尾的EOL而直接返回。当您后退的时候,您还需要步进两个字节,因为读取和检查EOL将位置向前推进了一步。

When using seek the format is fseek(offset, whence=0) where whence signifies to what the offset is relative to. Quote from docs.python.org:

使用seek时,格式为fseek(offset,从哪里开始=0),其中从哪里表示偏移量相对于什么。引用docs.python.org

  • SEEK_SET or 0 = seek from the start of the stream (the default); offset must either be a number returned by TextIOBase.tell(), or zero. Any other offset value produces undefined behaviour.
  • SEEK_SET或0 =从流的开始查找(默认);偏移量必须是textiobas .tell()返回的数字,或者是零。任何其他偏移值都会产生未定义的行为。
  • SEEK_CUR or 1 = “seek” to the current position; offset must be zero, which is a no-operation (all other values are unsupported).
  • SEEK_CUR或1 =“seek”到当前位置;偏移量必须为零,这是一个无操作(所有其他值都不支持)。
  • SEEK_END or 2 = seek to the end of the stream; offset must be zero (all other values are unsupported).
  • SEEK_END或2 =查找流的末尾;偏移量必须为零(所有其他值都不受支持)。

Running it through timeit 10k times on a file with 6k lines totalling 200kB gave me 1.62s vs 6.92s when comparing to the for-loop beneath that was suggested earlier. Using a 1.3GB sized file, still with 6k lines, a hundred times resulted in 8.93 vs 86.95.

在一个6k行合计200kB的文件上运行它10k次,与前面建议的for循环相比,我得到了1.62s和6.92s。使用1.3GB大小的文件,仍然有6k行,100次结果是8.93 vs 86.95。

with open(file, "rb") as f:
    first = f.readline()     # Read the first line.
    for last in f: pass      # Loop through the whole file reading it all.

#3


22  

Here's a modified version of SilentGhost's answer that will do what you want.

这是SilentGhost的修改版本,它可以做你想做的事情。

with open(fname, 'rb') as fh:
    first = next(fh)
    offs = -100
    while True:
        fh.seek(offs, 2)
        lines = fh.readlines()
        if len(lines)>1:
            last = lines[-1]
            break
        offs *= 2
    print first
    print last

No need for an upper bound for line length here.

这里不需要线长度的上限。

#4


8  

Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.

您能使用unix命令吗?我认为使用head -1和tail - n1可能是最有效的方法。或者,您可以使用一个简单的fid.readline()来获取第一行和fid.readlines()[-1],但这可能会占用太多内存。

#5


3  

First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.

首先以读模式打开文件。然后使用readlines()方法逐行读取。存储在列表中的所有行。现在您可以使用列表片来获得文件的第一行和最后一行。

    a=open('file.txt','rb')
    lines = a.readlines()
    if lines:
        first_line = lines[:1]
        last_line = lines[-1]

#6


3  

w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:  
    x= line
print ('last line is : ',x)
w.close()

The for loop runs through the lines and x gets the last line on the final iteration.

for循环遍历这些行,x在最后一次迭代中获得最后一行。

#7


3  

This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:

这是我的解决方案,与Python3兼容。它还管理边界情况,但它缺少utf-16支持:

def tail(filepath):
    """
    @author Marco Sulla (marcosullaroma@gmail.com)
    @date May 31, 2016
    """

    try:
        filepath.is_file
        fp = str(filepath)
    except AttributeError:
        fp = filepath

    with open(fp, "rb") as f:
        size = os.stat(fp).st_size
        start_pos = 0 if size - 1 < 0 else size - 1

        if start_pos != 0:
            f.seek(start_pos)
            char = f.read(1)

            if char == b"\n":
                start_pos -= 1
                f.seek(start_pos)

            if start_pos == 0:
                f.seek(start_pos)
            else:
                char = ""

                for pos in range(start_pos, -1, -1):
                    f.seek(pos)

                    char = f.read(1)

                    if char == b"\n":
                        break

        return f.readline()

It's ispired by Trasp's answer and AnotherParker's comment.

Trasp的回答和另一个parker的评论让我很兴奋。

#8


1  

with open("myfile.txt") as f:
    lines = f.readlines()
    first_row = lines[0]
    print first_row
    last_row = lines[-1]
    print last_row

#9


1  

Here is an extension of @Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.

这里是@Trasp的答案的扩展,它有额外的逻辑处理一个只有一行的文件的拐角情况。如果您反复想要读取持续更新的文件的最后一行,那么处理这种情况可能是有用的。如果没有这个,如果您试图获取刚刚创建且只有一行的文件的最后一行,则会引发IOError: [Errno 22]无效参数。

def tail(filepath):
    with open(filepath, "rb") as f:
        first = f.readline()      # Read the first line.
        f.seek(-2, 2)             # Jump to the second last byte.
        while f.read(1) != b"\n": # Until EOL is found...
            try:
                f.seek(-2, 1)     # ...jump back the read byte plus one more.
            except IOError:
                f.seek(-1, 1)
                if f.tell() == 0:
                    break
        last = f.readline()       # Read last line.
    return last

#10


1  

Nobody mentioned using reversed:

没有人提到使用了:

f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()

#11


0  

Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.

第一行很简单。对于最后一行,假设您知道行长度上的一个近似上限,os。从SEEK_END查找一些量,找到第二行到最后一行的结尾,然后再读取最后一行。

#1


51  

docs for io module

io模块的文档

with open(fname, 'rb') as fh:
    first = next(fh).decode()

    fh.seek(-1024, 2)
    last = fh.readlines()[-1].decode()

The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.

这里的变量值是1024:它表示平均字符串长度。我只选择了1024。如果你估计平均线长你可以用这个值乘以2。

Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:

由于您对线长可能的上限一无所知,显然的解决方案是对文件进行循环:

for line in fh:
    pass
last = line

You don't need to bother with the binary flag you could just use open(fname).

不需要麻烦使用二进制标志,只需使用open(fname)。

ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.

由于您有许多文件要处理,您可以使用random创建几个文件的样本。对它们进行示例并运行此代码,以确定最后一行的长度。位置偏移的先验值(假设是1 MB)。这将帮助您估计整个运行的价值。

#2


54  

You could open the file for reading and read the first line using the builtin readline(), then seek to the end of file and step backwards until you find the line's preceding EOL and read the last line from there.

您可以打开文件以读取第一行,并使用builtin readline()读取第一行,然后查找文件的末尾,然后后退一步,直到找到该行前面的EOL并从中读取最后一行。

with open(file, "rb") as f:
    first = f.readline()        # Read the first line.
    f.seek(-2, os.SEEK_END)     # Jump to the second last byte.
    while f.read(1) != b"\n":   # Until EOL is found...
        f.seek(-2, os.SEEK_CUR) # ...jump back the read byte plus one more.
    last = f.readline()         # Read last line.

Jumping to the second last byte instead of the last one prevents that you return directly because of a trailing EOL. While you're stepping backwards you'll also want to step two bytes since the reading and checking for EOL pushes the position forward one step.

跳转到第二个最后一个字节而不是最后一个字节,可以防止由于末尾的EOL而直接返回。当您后退的时候,您还需要步进两个字节,因为读取和检查EOL将位置向前推进了一步。

When using seek the format is fseek(offset, whence=0) where whence signifies to what the offset is relative to. Quote from docs.python.org:

使用seek时,格式为fseek(offset,从哪里开始=0),其中从哪里表示偏移量相对于什么。引用docs.python.org

  • SEEK_SET or 0 = seek from the start of the stream (the default); offset must either be a number returned by TextIOBase.tell(), or zero. Any other offset value produces undefined behaviour.
  • SEEK_SET或0 =从流的开始查找(默认);偏移量必须是textiobas .tell()返回的数字,或者是零。任何其他偏移值都会产生未定义的行为。
  • SEEK_CUR or 1 = “seek” to the current position; offset must be zero, which is a no-operation (all other values are unsupported).
  • SEEK_CUR或1 =“seek”到当前位置;偏移量必须为零,这是一个无操作(所有其他值都不支持)。
  • SEEK_END or 2 = seek to the end of the stream; offset must be zero (all other values are unsupported).
  • SEEK_END或2 =查找流的末尾;偏移量必须为零(所有其他值都不受支持)。

Running it through timeit 10k times on a file with 6k lines totalling 200kB gave me 1.62s vs 6.92s when comparing to the for-loop beneath that was suggested earlier. Using a 1.3GB sized file, still with 6k lines, a hundred times resulted in 8.93 vs 86.95.

在一个6k行合计200kB的文件上运行它10k次,与前面建议的for循环相比,我得到了1.62s和6.92s。使用1.3GB大小的文件,仍然有6k行,100次结果是8.93 vs 86.95。

with open(file, "rb") as f:
    first = f.readline()     # Read the first line.
    for last in f: pass      # Loop through the whole file reading it all.

#3


22  

Here's a modified version of SilentGhost's answer that will do what you want.

这是SilentGhost的修改版本,它可以做你想做的事情。

with open(fname, 'rb') as fh:
    first = next(fh)
    offs = -100
    while True:
        fh.seek(offs, 2)
        lines = fh.readlines()
        if len(lines)>1:
            last = lines[-1]
            break
        offs *= 2
    print first
    print last

No need for an upper bound for line length here.

这里不需要线长度的上限。

#4


8  

Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.

您能使用unix命令吗?我认为使用head -1和tail - n1可能是最有效的方法。或者,您可以使用一个简单的fid.readline()来获取第一行和fid.readlines()[-1],但这可能会占用太多内存。

#5


3  

First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.

首先以读模式打开文件。然后使用readlines()方法逐行读取。存储在列表中的所有行。现在您可以使用列表片来获得文件的第一行和最后一行。

    a=open('file.txt','rb')
    lines = a.readlines()
    if lines:
        first_line = lines[:1]
        last_line = lines[-1]

#6


3  

w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:  
    x= line
print ('last line is : ',x)
w.close()

The for loop runs through the lines and x gets the last line on the final iteration.

for循环遍历这些行,x在最后一次迭代中获得最后一行。

#7


3  

This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:

这是我的解决方案,与Python3兼容。它还管理边界情况,但它缺少utf-16支持:

def tail(filepath):
    """
    @author Marco Sulla (marcosullaroma@gmail.com)
    @date May 31, 2016
    """

    try:
        filepath.is_file
        fp = str(filepath)
    except AttributeError:
        fp = filepath

    with open(fp, "rb") as f:
        size = os.stat(fp).st_size
        start_pos = 0 if size - 1 < 0 else size - 1

        if start_pos != 0:
            f.seek(start_pos)
            char = f.read(1)

            if char == b"\n":
                start_pos -= 1
                f.seek(start_pos)

            if start_pos == 0:
                f.seek(start_pos)
            else:
                char = ""

                for pos in range(start_pos, -1, -1):
                    f.seek(pos)

                    char = f.read(1)

                    if char == b"\n":
                        break

        return f.readline()

It's ispired by Trasp's answer and AnotherParker's comment.

Trasp的回答和另一个parker的评论让我很兴奋。

#8


1  

with open("myfile.txt") as f:
    lines = f.readlines()
    first_row = lines[0]
    print first_row
    last_row = lines[-1]
    print last_row

#9


1  

Here is an extension of @Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.

这里是@Trasp的答案的扩展,它有额外的逻辑处理一个只有一行的文件的拐角情况。如果您反复想要读取持续更新的文件的最后一行,那么处理这种情况可能是有用的。如果没有这个,如果您试图获取刚刚创建且只有一行的文件的最后一行,则会引发IOError: [Errno 22]无效参数。

def tail(filepath):
    with open(filepath, "rb") as f:
        first = f.readline()      # Read the first line.
        f.seek(-2, 2)             # Jump to the second last byte.
        while f.read(1) != b"\n": # Until EOL is found...
            try:
                f.seek(-2, 1)     # ...jump back the read byte plus one more.
            except IOError:
                f.seek(-1, 1)
                if f.tell() == 0:
                    break
        last = f.readline()       # Read last line.
    return last

#10


1  

Nobody mentioned using reversed:

没有人提到使用了:

f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()

#11


0  

Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.

第一行很简单。对于最后一行,假设您知道行长度上的一个近似上限,os。从SEEK_END查找一些量,找到第二行到最后一行的结尾,然后再读取最后一行。