在Python中获取大文件的MD5哈希。

时间:2022-10-28 18:57:38

I have used hashlib (which replaces md5 in Python 2.6/3.0) and it worked fine if I opened a file and put its content in hashlib.md5() function.

我使用了hashlib(它代替了Python 2.6/3.0中的md5),如果我打开一个文件并将其内容放到hashlib.md5()函数中,它就可以正常工作了。

The problem is with very big files that their sizes could exceed RAM size.

问题是,很大的文件,它们的大小可能超过RAM大小。

How to get the MD5 hash of a file without loading the whole file to memory?

如何在不将整个文件加载到内存的情况下获取文件的MD5哈希?

11 个解决方案

#1


136  

Break the file into 128-byte chunks and feed them to MD5 consecutively using update().

将文件分解成128个字节的块,并使用update()将它们以MD5的顺序发送给MD5。

This takes advantage of the fact that MD5 has 128-byte digest blocks. Basically, when MD5 digest()s the file, this is exactly what it is doing.

这充分利用了MD5具有128字节的摘要块的事实。基本上,当MD5摘要()文件时,这就是它正在做的事情。

If you make sure you free the memory on each iteration (i.e. not read the entire file to memory), this shall take no more than 128 bytes of memory.

如果您确保在每次迭代中释放内存(即不将整个文件读入内存),那么将不超过128个字节的内存。

One example is to read the chunks like so:

一个例子就是这样读:

f = open(fileName)
while not endOfFile:
    f.read(128)

#2


202  

You need to read the file in chunks of suitable size:

您需要以适合的大小来读取文件:

def md5_for_file(f, block_size=2**20):
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.digest()

NOTE: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.

注意:确保你打开文件的“rb”到打开-否则你将会得到错误的结果。

So to do the whole lot in one method - use something like:

所以用一种方法来做很多事情,比如:

def generate_file_md5(rootdir, filename, blocksize=2**20):
    m = hashlib.md5()
    with open( os.path.join(rootdir, filename) , "rb" ) as f:
        while True:
            buf = f.read(blocksize)
            if not buf:
                break
            m.update( buf )
    return m.hexdigest()

The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 windows installation

上面的更新是基于Frerich Raabe提供的注释——我测试了这个,并发现它在我的Python 2.7.2 windows安装上是正确的。

I cross-checked the results using the 'jacksum' tool.

我用“jacksum”工具反复核对了结果。

jacksum -a md5 <filename>

http://www.jonelo.de/java/jacksum/

http://www.jonelo.de/java/jacksum/

#3


96  

if you care about more pythonic (no 'while True') way of reading the file check this code:

如果你关心更多的python(不是“同时”)阅读文件的方法,请检查以下代码:

import hashlib

def checksum_md5(filename):
    md5 = hashlib.md5()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(8192), b''): 
            md5.update(chunk)
    return md5.digest()

Note that the iter() func needs an empty byte string for the returned iterator to halt at EOF, since read() returns b'' (not just '').

注意,iter() func需要一个空字节字符串,以便返回的迭代器在EOF上停止,因为read()返回b“(不只是”)。

#4


47  

Here's my version of @Piotr Czapla's method:

这是我的@Piotr Czapla的方法:

def md5sum(filename):
    md5 = hashlib.md5()
    with open(filename, 'rb') as f:
        for chunk in iter(lambda: f.read(128 * md5.block_size), b''):
            md5.update(chunk)
    return md5.hexdigest()

#5


27  

Using multiple comment/answers in this thread, here is my solution :

在这个线程中使用多个注释/答案,下面是我的解决方案:

import hashlib
def md5_for_file(path, block_size=256*128, hr=False):
    '''
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)
    '''
    md5 = hashlib.md5()
    with open(path,'rb') as f: 
        for chunk in iter(lambda: f.read(block_size), b''): 
             md5.update(chunk)
    if hr:
        return md5.hexdigest()
    return md5.digest()
  • This is "pythonic"
  • 这是“神谕的”
  • This is a function
  • 这是一个函数
  • It avoids implicit values: always prefer explicit ones.
  • 它避免了隐式值:总是喜欢显式的值。
  • It allows (very important) performances optimizations
  • 它允许(非常重要的)性能优化。

And finally,

最后,

- This has been built by a community, thanks all for your advices/ideas.

-这是一个社区建立的,感谢你的建议/想法。

#6


6  

A Python 2/3 portable solution

Python的2/3便携解决方案。

To calculate a checksum (md5, sha1, etc.), you must open the file in binary mode, because you'll sum bytes values:

要计算校验和(md5、sha1等),您必须以二进制模式打开文件,因为您将对字节值求和:

To be py27/py3 portable, you ought to use the io packages, like this:

要成为py27/py3便携式,您应该使用io包,如下所示:

import hashlib
import io


def md5sum(src):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        content = fd.read()
        md5.update(content)
    return md5

If your files are big, you may prefer to read the file by chunks to avoid storing the whole file content in memory:

如果您的文件很大,您可能更喜欢通过块读取文件,以避免将整个文件内容存储在内存中:

def md5sum(src, length=io.DEFAULT_BUFFER_SIZE):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
    return md5

The trick here is to use the iter() function with a sentinel (the empty string).

这里的技巧是使用iter()函数和一个sentinel(空字符串)。

The iterator created in this case will call o [the lambda function] with no arguments for each call to its next() method; if the value returned is equal to sentinel, StopIteration will be raised, otherwise the value will be returned.

在本例中创建的迭代器将调用o [lambda函数],对每个调用它的next()方法没有参数;如果返回的值与sentinel相等,则将提高StopIteration,否则将返回该值。

If your files are really big, you may also need to display progress information. You can do that by calling a callback function which prints or logs the amount of calculated bytes:

如果您的文件非常大,您可能还需要显示进度信息。您可以通过调用一个回调函数来实现,该函数打印或记录计算的字节数:

def md5sum(src, callback, length=io.DEFAULT_BUFFER_SIZE):
    calculated = 0
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
            calculated += len(chunk)
            callback(calculated)
    return md5

#7


4  

A remix of Bastien Semene code that take Hawkwing comment about generic hashing function into consideration...

一种Bastien Semene代码的混合,该代码将Hawkwing对通用的散列函数的注释考虑在内。

def hash_for_file(path, algorithm=hashlib.algorithms[0], block_size=256*128, human_readable=True):
    """
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)

    Linux Ext4 block size
    sudo tune2fs -l /dev/sda5 | grep -i 'block size'
    > Block size:               4096

    Input:
        path: a path
        algorithm: an algorithm in hashlib.algorithms
                   ATM: ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')
        block_size: a multiple of 128 corresponding to the block size of your filesystem
        human_readable: switch between digest() or hexdigest() output, default hexdigest()
    Output:
        hash
    """
    if algorithm not in hashlib.algorithms:
        raise NameError('The algorithm "{algorithm}" you specified is '
                        'not a member of "hashlib.algorithms"'.format(algorithm=algorithm))

    hash_algo = hashlib.new(algorithm)  # According to hashlib documentation using new()
                                        # will be slower then calling using named
                                        # constructors, ex.: hashlib.md5()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(block_size), b''):
             hash_algo.update(chunk)
    if human_readable:
        file_hash = hash_algo.hexdigest()
    else:
        file_hash = hash_algo.digest()
    return file_hash

#8


1  

u can't get it's md5 without read full content. but u can use update function to read the files content block by block.
m.update(a); m.update(b) is equivalent to m.update(a+b)

如果没有完整的内容,就无法得到md5。但是你可以使用update函数来逐块读取文件内容块。m.update(一个);更新(b)等于m更新(a+b)

#9


-1  

Implementation of accepted answer for Django:

Django被接受的答案的实现:

import hashlib
from django.db import models


class MyModel(models.Model):
    file = models.FileField()  # any field based on django.core.files.File

    def get_hash(self):
        hash = hashlib.md5()
        for chunk in self.file.chunks(chunk_size=8192):
            hash.update(chunk)
        return hash.hexdigest()

#10


-1  

import hashlib,re
opened = open('/home/parrot/pass.txt','r')
opened = open.readlines()
for i in opened:
    strip1 = i.strip('\n')
    hash_object = hashlib.md5(strip1.encode())
    hash2 = hash_object.hexdigest()
    print hash2

#11


-3  

I'm not sure that there isn't a bit too much fussing around here. I recently had problems with md5 and files stored as blobs on MySQL so I experimented with various file sizes and the straightforward Python approach, viz:

我不确定这里是不是有点太过分了。最近我遇到了md5和文件存储为blob的问题,所以我尝试了各种文件大小和简单的Python方法,

FileHash=hashlib.md5(FileData).hexdigest()

I could detect no noticeable performance difference with a range of file sizes 2Kb to 20Mb and therefore no need to 'chunk' the hashing. Anyway, if Linux has to go to disk, it will probably do it at least as well as the average programmer's ability to keep it from doing so. As it happened, the problem was nothing to do with md5. If you're using MySQL, don't forget the md5() and sha1() functions already there.

我可以检测到没有明显的性能差异,文件大小为2Kb到20Mb,因此不需要“块”哈希。无论如何,如果Linux必须要进入磁盘,它可能至少会这样做,也可能是程序员的一般能力来阻止它这样做。当它发生时,问题与md5无关。如果您正在使用MySQL,请不要忘记已经存在的md5()和sha1()函数。

#1


136  

Break the file into 128-byte chunks and feed them to MD5 consecutively using update().

将文件分解成128个字节的块,并使用update()将它们以MD5的顺序发送给MD5。

This takes advantage of the fact that MD5 has 128-byte digest blocks. Basically, when MD5 digest()s the file, this is exactly what it is doing.

这充分利用了MD5具有128字节的摘要块的事实。基本上,当MD5摘要()文件时,这就是它正在做的事情。

If you make sure you free the memory on each iteration (i.e. not read the entire file to memory), this shall take no more than 128 bytes of memory.

如果您确保在每次迭代中释放内存(即不将整个文件读入内存),那么将不超过128个字节的内存。

One example is to read the chunks like so:

一个例子就是这样读:

f = open(fileName)
while not endOfFile:
    f.read(128)

#2


202  

You need to read the file in chunks of suitable size:

您需要以适合的大小来读取文件:

def md5_for_file(f, block_size=2**20):
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.digest()

NOTE: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.

注意:确保你打开文件的“rb”到打开-否则你将会得到错误的结果。

So to do the whole lot in one method - use something like:

所以用一种方法来做很多事情,比如:

def generate_file_md5(rootdir, filename, blocksize=2**20):
    m = hashlib.md5()
    with open( os.path.join(rootdir, filename) , "rb" ) as f:
        while True:
            buf = f.read(blocksize)
            if not buf:
                break
            m.update( buf )
    return m.hexdigest()

The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 windows installation

上面的更新是基于Frerich Raabe提供的注释——我测试了这个,并发现它在我的Python 2.7.2 windows安装上是正确的。

I cross-checked the results using the 'jacksum' tool.

我用“jacksum”工具反复核对了结果。

jacksum -a md5 <filename>

http://www.jonelo.de/java/jacksum/

http://www.jonelo.de/java/jacksum/

#3


96  

if you care about more pythonic (no 'while True') way of reading the file check this code:

如果你关心更多的python(不是“同时”)阅读文件的方法,请检查以下代码:

import hashlib

def checksum_md5(filename):
    md5 = hashlib.md5()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(8192), b''): 
            md5.update(chunk)
    return md5.digest()

Note that the iter() func needs an empty byte string for the returned iterator to halt at EOF, since read() returns b'' (not just '').

注意,iter() func需要一个空字节字符串,以便返回的迭代器在EOF上停止,因为read()返回b“(不只是”)。

#4


47  

Here's my version of @Piotr Czapla's method:

这是我的@Piotr Czapla的方法:

def md5sum(filename):
    md5 = hashlib.md5()
    with open(filename, 'rb') as f:
        for chunk in iter(lambda: f.read(128 * md5.block_size), b''):
            md5.update(chunk)
    return md5.hexdigest()

#5


27  

Using multiple comment/answers in this thread, here is my solution :

在这个线程中使用多个注释/答案,下面是我的解决方案:

import hashlib
def md5_for_file(path, block_size=256*128, hr=False):
    '''
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)
    '''
    md5 = hashlib.md5()
    with open(path,'rb') as f: 
        for chunk in iter(lambda: f.read(block_size), b''): 
             md5.update(chunk)
    if hr:
        return md5.hexdigest()
    return md5.digest()
  • This is "pythonic"
  • 这是“神谕的”
  • This is a function
  • 这是一个函数
  • It avoids implicit values: always prefer explicit ones.
  • 它避免了隐式值:总是喜欢显式的值。
  • It allows (very important) performances optimizations
  • 它允许(非常重要的)性能优化。

And finally,

最后,

- This has been built by a community, thanks all for your advices/ideas.

-这是一个社区建立的,感谢你的建议/想法。

#6


6  

A Python 2/3 portable solution

Python的2/3便携解决方案。

To calculate a checksum (md5, sha1, etc.), you must open the file in binary mode, because you'll sum bytes values:

要计算校验和(md5、sha1等),您必须以二进制模式打开文件,因为您将对字节值求和:

To be py27/py3 portable, you ought to use the io packages, like this:

要成为py27/py3便携式,您应该使用io包,如下所示:

import hashlib
import io


def md5sum(src):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        content = fd.read()
        md5.update(content)
    return md5

If your files are big, you may prefer to read the file by chunks to avoid storing the whole file content in memory:

如果您的文件很大,您可能更喜欢通过块读取文件,以避免将整个文件内容存储在内存中:

def md5sum(src, length=io.DEFAULT_BUFFER_SIZE):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
    return md5

The trick here is to use the iter() function with a sentinel (the empty string).

这里的技巧是使用iter()函数和一个sentinel(空字符串)。

The iterator created in this case will call o [the lambda function] with no arguments for each call to its next() method; if the value returned is equal to sentinel, StopIteration will be raised, otherwise the value will be returned.

在本例中创建的迭代器将调用o [lambda函数],对每个调用它的next()方法没有参数;如果返回的值与sentinel相等,则将提高StopIteration,否则将返回该值。

If your files are really big, you may also need to display progress information. You can do that by calling a callback function which prints or logs the amount of calculated bytes:

如果您的文件非常大,您可能还需要显示进度信息。您可以通过调用一个回调函数来实现,该函数打印或记录计算的字节数:

def md5sum(src, callback, length=io.DEFAULT_BUFFER_SIZE):
    calculated = 0
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
            calculated += len(chunk)
            callback(calculated)
    return md5

#7


4  

A remix of Bastien Semene code that take Hawkwing comment about generic hashing function into consideration...

一种Bastien Semene代码的混合,该代码将Hawkwing对通用的散列函数的注释考虑在内。

def hash_for_file(path, algorithm=hashlib.algorithms[0], block_size=256*128, human_readable=True):
    """
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)

    Linux Ext4 block size
    sudo tune2fs -l /dev/sda5 | grep -i 'block size'
    > Block size:               4096

    Input:
        path: a path
        algorithm: an algorithm in hashlib.algorithms
                   ATM: ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')
        block_size: a multiple of 128 corresponding to the block size of your filesystem
        human_readable: switch between digest() or hexdigest() output, default hexdigest()
    Output:
        hash
    """
    if algorithm not in hashlib.algorithms:
        raise NameError('The algorithm "{algorithm}" you specified is '
                        'not a member of "hashlib.algorithms"'.format(algorithm=algorithm))

    hash_algo = hashlib.new(algorithm)  # According to hashlib documentation using new()
                                        # will be slower then calling using named
                                        # constructors, ex.: hashlib.md5()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(block_size), b''):
             hash_algo.update(chunk)
    if human_readable:
        file_hash = hash_algo.hexdigest()
    else:
        file_hash = hash_algo.digest()
    return file_hash

#8


1  

u can't get it's md5 without read full content. but u can use update function to read the files content block by block.
m.update(a); m.update(b) is equivalent to m.update(a+b)

如果没有完整的内容,就无法得到md5。但是你可以使用update函数来逐块读取文件内容块。m.update(一个);更新(b)等于m更新(a+b)

#9


-1  

Implementation of accepted answer for Django:

Django被接受的答案的实现:

import hashlib
from django.db import models


class MyModel(models.Model):
    file = models.FileField()  # any field based on django.core.files.File

    def get_hash(self):
        hash = hashlib.md5()
        for chunk in self.file.chunks(chunk_size=8192):
            hash.update(chunk)
        return hash.hexdigest()

#10


-1  

import hashlib,re
opened = open('/home/parrot/pass.txt','r')
opened = open.readlines()
for i in opened:
    strip1 = i.strip('\n')
    hash_object = hashlib.md5(strip1.encode())
    hash2 = hash_object.hexdigest()
    print hash2

#11


-3  

I'm not sure that there isn't a bit too much fussing around here. I recently had problems with md5 and files stored as blobs on MySQL so I experimented with various file sizes and the straightforward Python approach, viz:

我不确定这里是不是有点太过分了。最近我遇到了md5和文件存储为blob的问题,所以我尝试了各种文件大小和简单的Python方法,

FileHash=hashlib.md5(FileData).hexdigest()

I could detect no noticeable performance difference with a range of file sizes 2Kb to 20Mb and therefore no need to 'chunk' the hashing. Anyway, if Linux has to go to disk, it will probably do it at least as well as the average programmer's ability to keep it from doing so. As it happened, the problem was nothing to do with md5. If you're using MySQL, don't forget the md5() and sha1() functions already there.

我可以检测到没有明显的性能差异,文件大小为2Kb到20Mb,因此不需要“块”哈希。无论如何,如果Linux必须要进入磁盘,它可能至少会这样做,也可能是程序员的一般能力来阻止它这样做。当它发生时,问题与md5无关。如果您正在使用MySQL,请不要忘记已经存在的md5()和sha1()函数。