如何在python中检测文件是否为二进制（非文本）？

How can I tell if a file is binary (non-text) in python? I am searching through a large set of files in python, and keep getting matches in binary files. This makes the output look incredibly messy.

如何在python中判断文件是否为二进制（非文本）？我在python中搜索大量文件，并继续在二进制文件中获取匹配项。这使得输出看起来非常混乱。

I know I could use grep -I, but I am doing more with the data than what grep allows for.

我知道我可以使用grep -I，但我使用的数据比grep允许的更多。

In the past I would have just searched for characters greater than 0x7f, but utf8 and the like make that impossible on modern systems. Ideally the solution would be fast, but any solution will do.

在过去，我会搜索大于0x7f的字符，但utf8等在现代系统上使这个变得不可能。理想情况下，解决方案会很快，但任何解决方案都可以。

17 个解决方案

#1

You can also use the mimetypes module:

您还可以使用mimetypes模块：

import mimetypes
...
mime = mimetypes.guess_type(file)

It's fairly easy to compile a list of binary mime types. For example Apache distributes with a mime.types file that you could parse into a set of lists, binary and text and then check to see if the mime is in your text or binary list.

编译二进制mime类型列表相当容易。例如，Apache使用mime.types文件进行分发，您可以将其解析为一组列表，二进制文本和文本，然后检查mime是否在您的文本或二进制列表中。

#2

Yet another method based on file(1) behavior:

另一种基于file（1）行为的方法：

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})
>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

Example:

例：

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False

#3

Try this:

尝试这个：

def is_binary(filename):
    """Return true if the given filename is binary.
    @raise EnvironmentError: if the file does not exist or cannot be accessed.
    @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
    @author: Trent Mick <TrentM@ActiveState.com>
    @author: Jorge Orpinel <jorge@orpinel.com>"""
    fin = open(filename, 'rb')
    try:
        CHUNKSIZE = 1024
        while 1:
            chunk = fin.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done
    # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
    finally:
        fin.close()

    return False

#4

If it helps, many many binary types begin with a magic numbers. Here is a list of file signatures.

如果它有所帮助，许多二进制类型都以幻数开头。这是文件签名列表。

#5

Here's a suggestion that uses the Unix file command:

这是一个使用Unix文件命令的建议：

import re
import subprocess

def istext(path):
    return (re.search(r':.* text',
                      subprocess.Popen(["file", '-L', path], 
                                       stdout=subprocess.PIPE).stdout.read())
            is not None)

Example usage:

用法示例：

>>> istext('/etc/motd') 
True
>>> istext('/vmlinuz') 
False
>>> open('/tmp/japanese').read()
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\xbf\xe3\x81\x9a\xe3\x81\x8c\xe3\x82\x81\xe5\xba\xa7\xe3\x81\xae\xe6\x99\x82\xe4\xbb\xa3\xe3\x81\xae\xe5\xb9\x95\xe9\x96\x8b\xe3\x81\x91\xe3\x80\x82\n'
>>> istext('/tmp/japanese') # works on UTF-8
True

It has the downsides of not being portable to Windows (unless you have something like the file command there), and having to spawn an external process for each file, which might not be palatable.

它有不能移植到Windows的缺点（除非你有类似文件命令的东西），并且必须为每个文件生成一个外部进程，这可能不太合适。

#6

Use binaryornot library (GitHub).

使用binaryornot库（GitHub）。

It is very simple and based on the code found in this * question.

它非常简单，基于此*问题中的代码。

You can actually write this in 2 lines of code, however this package saves you from having to write and thoroughly test those 2 lines of code with all sorts of weird file types, cross-platform.

实际上你可以用两行代码来编写它，但是这个包可以让你不必编写和彻底测试这两行代码和各种奇怪的文件类型，跨平台。

#7

Usually you have to guess.

通常你必须猜测。

You can look at the extensions as one clue, if the files have them.

如果文件包含扩展名，您可以将扩展名视为一条线索。

You can also recognise know binary formats, and ignore those.

您还可以识别已知的二进制格式，并忽略它们。

Otherwise see what proportion of non-printable ASCII bytes you have and take a guess from that.

否则，请查看您拥有的不可打印ASCII字节的比例，并从中猜测。

You can also try decoding from UTF-8 and see if that produces sensible output.

您也可以尝试从UTF-8解码，看看是否产生合理的输出。

#8

If you're not on Windows, you can use Python Magic to determine the filetype. Then you can check if it is a text/ mime type.

如果您不在Windows上，则可以使用Python Magic来确定文件类型。然后你可以检查它是否是text / mime类型。

#9

A shorter solution, with a UTF-16 warning:

一个较短的解决方案，带有UTF-16警告：

def is_binary(filename):
    """ 
    Return true if the given filename appears to be binary.
    File is considered to be binary if it contains a NULL byte.
    FIXME: This approach incorrectly reports UTF-16 as binary.
    """
    with open(filename, 'rb') as f:
        for block in f:
            if b'\0' in block:
                return True
    return False

#10

If you're using python3 with utf-8 it is straight forward, just open the file in text mode and stop processing if you get an UnicodeDecodeError. Python3 will use unicode when handling files in text mode (and bytearray in binary mode) - if your encoding can't decode arbitrary files it's quite likely that you will get UnicodeDecodeError.

如果你正在使用带有utf-8的python3，它是直接的，只需在文本模式下打开文件，如果你得到UnicodeDecodeError就停止处理。 Python3在处理文本模式下的文件时会使用unicode（以及二进制模式下的bytearray） - 如果你的编码不能解码任意文件，你很可能会得到UnicodeDecodeError。

Example:

例：

try:
    with open(filename, "r") as f:
        for l in f:
             process_line(l)
except UnicodeDecodeError:
    pass # Fond non-text data

#11

I came here looking for exactly the same thing--a comprehensive solution provided by the standard library to detect binary or text. After reviewing the options people suggested, the nix file command looks to be the best choice (I'm only developing for linux boxen). Some others posted solutions using file but they are unnecessarily complicated in my opinion, so here's what I came up with:

我来到这里寻找完全相同的东西 - 标准库提供的全面解决方案来检测二进制文本或文本。在查看了人们建议的选项后，nix file命令看起来是最好的选择（我只是为linux boxen开发）。其他一些人使用文件发布了解决方案，但我认为它们不必要地复杂化，所以这就是我想出的：

def test_file_isbinary(filename):
    cmd = shlex.split("file -b -e soft '{}'".format(filename))
    if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}:
        return False
    return True

It should go without saying, but your code that calls this function should make sure you can read a file before testing it, otherwise this will be mistakenly detect the file as binary.

应该不言而喻，但是调用此函数的代码应该确保在测试之前可以读取文件，否则会错误地将文件检测为二进制文件。

#12

I guess that the best solution is to use the guess_type function. It holds a list with several mimetypes and you can also include your own types. Here come the script that I did to solve my problem:

我想最好的解决方案是使用guess_type函数。它包含一个包含多个mimetypes的列表，您还可以包含自己的类型。这是我用来解决问题的脚本：

from mimetypes import guess_type
from mimetypes import add_type

def __init__(self):
        self.__addMimeTypes()

def __addMimeTypes(self):
        add_type("text/plain",".properties")

def __listDir(self,path):
        try:
            return listdir(path)
        except IOError:
            print ("The directory {0} could not be accessed".format(path))

def getTextFiles(self, path):
        asciiFiles = []
        for files in self.__listDir(path):
            if guess_type(files)[0].split("/")[0] == "text":
                asciiFiles.append(files)
        try:
            return asciiFiles
        except NameError:
            print ("No text files in directory: {0}".format(path))
        finally:
            del asciiFiles

It is inside of a Class, as you can see based on the ustructure of the code. But you can pretty much change the things you want to implement it inside your application. It`s quite simple to use. The method getTextFiles returns a list object with all the text files that resides on the directory you pass in path variable.

它位于Class内部，您可以根据代码的ustructure看到它。但是你几乎可以改变你想要在你的应用程序中实现它的东西。它使用起来非常简单。方法getTextFiles返回一个列表对象，其中包含您在路径变量中传递的目录中的所有文本文件。

#13

Here's a function that first checks if the file starts with a BOM and if not looks for a zero byte within the initial 8192 bytes:

这是一个函数，首先检查文件是否以BOM开头，如果不是在初始8192字节内查找零字节：

import codecs


#: BOMs to indicate that a file is a text file even if it contains zero bytes.
_TEXT_BOMS = (
    codecs.BOM_UTF16_BE,
    codecs.BOM_UTF16_LE,
    codecs.BOM_UTF32_BE,
    codecs.BOM_UTF32_LE,
    codecs.BOM_UTF8,
)


def is_binary_file(source_path):
    with open(source_path, 'rb') as source_file:
        initial_bytes = source_file.read(8192)
    return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \
           and b'\0' in initial_bytes

Technically the check for the UTF-8 BOM is unnecessary because it should not contain zero bytes for all practical purpose. But as it is a very common encoding it's quicker to check for the BOM in the beginning instead of scanning all the 8192 bytes for 0.

从技术上讲，检查UTF-8 BOM是不必要的，因为它不应包含任何实际用途的零字节。但由于它是一种非常常见的编码，因此在开始时检查BOM的速度要快，而不是将所有8192字节扫描为0。

#14

are you in unix? if so, then try:

你在unix吗？如果是的话，那么试试：

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

The shell return values are inverted (0 is ok, so if it finds "text" then it will return a 0, and in Python that is a False expression).

shell返回值是反转的（0表示正常，因此如果找到“text”则返回0，而在Python中则返回False表达式）。

#15

Simpler way is to check if the file consist NULL character (\x00) by using in operator, for instance:

更简单的方法是使用in运算符检查文件是否包含NULL字符（\ x00），例如：

b'\x00' in open("foo.bar", 'rb').read()

See below the complete example:

请参阅下面的完整示例：

#!/usr/bin/env python3
import argparse
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('file', nargs=1)
    args = parser.parse_args()
    with open(args.file[0], 'rb') as f:
        if b'\x00' in f.read():
            print('The file is binary!')
        else:
            print('The file is not binary!')

Sample usage:

样品用法：

$ ./is_binary.py /etc/hosts
The file is not binary!
$ ./is_binary.py `which which`
The file is binary!

#16

on *NIX:

If you have access to the `file` shell-command, shlex can help make the subprocess module more usable:

from os.path import realpath
from subprocess import check_output
from shlex import split

filepath = realpath('rel/or/abs/path/to/file')
assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

Or, you could also stick that in a for-loop to get output for all files in the current dir using:

import os
for afile in [x for x in os.listdir('.') if os.path.isfile(x)]:
    assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

or for all subdirs:

for curdir, filelist in zip(os.walk('.')[0], os.walk('.')[2]):
     for afile in filelist:
         assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

#17

Most of the programs consider the file to be binary (which is any file that is not "line-oriented") if it contains a NULL character.

如果文件包含NULL字符，则大多数程序认为文件是二进制文件（任何不是“面向行”的文件）。

Here is perl's version of pp_fttext() (pp_sys.c) implemented in Python:

以下是在Python中实现的perl的pp_fttext（）（pp_sys.c）版本：

import sys
PY3 = sys.version_info[0] == 3

# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr

_text_characters = (
        b''.join(int2byte(i) for i in range(32, 127)) +
        b'\n\r\t\f\b')

def istextfile(fileobj, blocksize=512):
    """ Uses heuristics to guess whether the given file is text or binary,
        by reading a single block of bytes from the file.
        If more than 30% of the chars in the block are non-text, or there
        are NUL ('\x00') bytes in the block, assume this is a binary file.
    """
    block = fileobj.read(blocksize)
    if b'\x00' in block:
        # Files with null bytes are binary
        return False
    elif not block:
        # An empty file is considered a valid text file
        return True

    # Use translate's 'deletechars' argument to efficiently remove all
    # occurrences of _text_characters from the block
    nontext = block.translate(None, _text_characters)
    return float(len(nontext)) / len(block) <= 0.30

Note also that this code was written to run on both Python 2 and Python 3 without changes.

另请注意，此代码编写为在Python 2和Python 3上运行而不进行更改。

Source: Perl's "guess if file is text or binary" implemented in Python

来源：Perl的“猜测文件是文本还是二进制”在Python中实现

#1

You can also use the mimetypes module:

您还可以使用mimetypes模块：

import mimetypes
...
mime = mimetypes.guess_type(file)

#2

Yet another method based on file(1) behavior:

另一种基于file（1）行为的方法：

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})
>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

Example:

例：

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False

#3

Try this:

尝试这个：

def is_binary(filename):
    """Return true if the given filename is binary.
    @raise EnvironmentError: if the file does not exist or cannot be accessed.
    @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
    @author: Trent Mick <TrentM@ActiveState.com>
    @author: Jorge Orpinel <jorge@orpinel.com>"""
    fin = open(filename, 'rb')
    try:
        CHUNKSIZE = 1024
        while 1:
            chunk = fin.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done
    # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
    finally:
        fin.close()

    return False

#4

If it helps, many many binary types begin with a magic numbers. Here is a list of file signatures.

如果它有所帮助，许多二进制类型都以幻数开头。这是文件签名列表。

#5

Here's a suggestion that uses the Unix file command:

这是一个使用Unix文件命令的建议：

import re
import subprocess

def istext(path):
    return (re.search(r':.* text',
                      subprocess.Popen(["file", '-L', path], 
                                       stdout=subprocess.PIPE).stdout.read())
            is not None)

Example usage:

用法示例：

>>> istext('/etc/motd') 
True
>>> istext('/vmlinuz') 
False
>>> open('/tmp/japanese').read()
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\xbf\xe3\x81\x9a\xe3\x81\x8c\xe3\x82\x81\xe5\xba\xa7\xe3\x81\xae\xe6\x99\x82\xe4\xbb\xa3\xe3\x81\xae\xe5\xb9\x95\xe9\x96\x8b\xe3\x81\x91\xe3\x80\x82\n'
>>> istext('/tmp/japanese') # works on UTF-8
True

It has the downsides of not being portable to Windows (unless you have something like the file command there), and having to spawn an external process for each file, which might not be palatable.

它有不能移植到Windows的缺点（除非你有类似文件命令的东西），并且必须为每个文件生成一个外部进程，这可能不太合适。

#6

Use binaryornot library (GitHub).

使用binaryornot库（GitHub）。

It is very simple and based on the code found in this * question.

它非常简单，基于此*问题中的代码。

You can actually write this in 2 lines of code, however this package saves you from having to write and thoroughly test those 2 lines of code with all sorts of weird file types, cross-platform.

实际上你可以用两行代码来编写它，但是这个包可以让你不必编写和彻底测试这两行代码和各种奇怪的文件类型，跨平台。

#7

Usually you have to guess.

通常你必须猜测。

You can look at the extensions as one clue, if the files have them.

如果文件包含扩展名，您可以将扩展名视为一条线索。

You can also recognise know binary formats, and ignore those.

您还可以识别已知的二进制格式，并忽略它们。

Otherwise see what proportion of non-printable ASCII bytes you have and take a guess from that.

否则，请查看您拥有的不可打印ASCII字节的比例，并从中猜测。

You can also try decoding from UTF-8 and see if that produces sensible output.

您也可以尝试从UTF-8解码，看看是否产生合理的输出。

#8

If you're not on Windows, you can use Python Magic to determine the filetype. Then you can check if it is a text/ mime type.

如果您不在Windows上，则可以使用Python Magic来确定文件类型。然后你可以检查它是否是text / mime类型。

#9

A shorter solution, with a UTF-16 warning:

一个较短的解决方案，带有UTF-16警告：

def is_binary(filename):
    """ 
    Return true if the given filename appears to be binary.
    File is considered to be binary if it contains a NULL byte.
    FIXME: This approach incorrectly reports UTF-16 as binary.
    """
    with open(filename, 'rb') as f:
        for block in f:
            if b'\0' in block:
                return True
    return False

#10

Example:

例：

try:
    with open(filename, "r") as f:
        for l in f:
             process_line(l)
except UnicodeDecodeError:
    pass # Fond non-text data

#11

def test_file_isbinary(filename):
    cmd = shlex.split("file -b -e soft '{}'".format(filename))
    if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}:
        return False
    return True

It should go without saying, but your code that calls this function should make sure you can read a file before testing it, otherwise this will be mistakenly detect the file as binary.

应该不言而喻，但是调用此函数的代码应该确保在测试之前可以读取文件，否则会错误地将文件检测为二进制文件。

#12

I guess that the best solution is to use the guess_type function. It holds a list with several mimetypes and you can also include your own types. Here come the script that I did to solve my problem:

我想最好的解决方案是使用guess_type函数。它包含一个包含多个mimetypes的列表，您还可以包含自己的类型。这是我用来解决问题的脚本：

from mimetypes import guess_type
from mimetypes import add_type

def __init__(self):
        self.__addMimeTypes()

def __addMimeTypes(self):
        add_type("text/plain",".properties")

def __listDir(self,path):
        try:
            return listdir(path)
        except IOError:
            print ("The directory {0} could not be accessed".format(path))

def getTextFiles(self, path):
        asciiFiles = []
        for files in self.__listDir(path):
            if guess_type(files)[0].split("/")[0] == "text":
                asciiFiles.append(files)
        try:
            return asciiFiles
        except NameError:
            print ("No text files in directory: {0}".format(path))
        finally:
            del asciiFiles

#13

Here's a function that first checks if the file starts with a BOM and if not looks for a zero byte within the initial 8192 bytes:

这是一个函数，首先检查文件是否以BOM开头，如果不是在初始8192字节内查找零字节：

import codecs


#: BOMs to indicate that a file is a text file even if it contains zero bytes.
_TEXT_BOMS = (
    codecs.BOM_UTF16_BE,
    codecs.BOM_UTF16_LE,
    codecs.BOM_UTF32_BE,
    codecs.BOM_UTF32_LE,
    codecs.BOM_UTF8,
)


def is_binary_file(source_path):
    with open(source_path, 'rb') as source_file:
        initial_bytes = source_file.read(8192)
    return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \
           and b'\0' in initial_bytes

#14

are you in unix? if so, then try:

你在unix吗？如果是的话，那么试试：

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

The shell return values are inverted (0 is ok, so if it finds "text" then it will return a 0, and in Python that is a False expression).

shell返回值是反转的（0表示正常，因此如果找到“text”则返回0，而在Python中则返回False表达式）。

#15

Simpler way is to check if the file consist NULL character (\x00) by using in operator, for instance:

更简单的方法是使用in运算符检查文件是否包含NULL字符（\ x00），例如：

b'\x00' in open("foo.bar", 'rb').read()

See below the complete example:

请参阅下面的完整示例：

#!/usr/bin/env python3
import argparse
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('file', nargs=1)
    args = parser.parse_args()
    with open(args.file[0], 'rb') as f:
        if b'\x00' in f.read():
            print('The file is binary!')
        else:
            print('The file is not binary!')

Sample usage:

样品用法：

$ ./is_binary.py /etc/hosts
The file is not binary!
$ ./is_binary.py `which which`
The file is binary!

#16

on *NIX:

If you have access to the `file` shell-command, shlex can help make the subprocess module more usable:

from os.path import realpath
from subprocess import check_output
from shlex import split

filepath = realpath('rel/or/abs/path/to/file')
assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

Or, you could also stick that in a for-loop to get output for all files in the current dir using:

import os
for afile in [x for x in os.listdir('.') if os.path.isfile(x)]:
    assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

or for all subdirs:

for curdir, filelist in zip(os.walk('.')[0], os.walk('.')[2]):
     for afile in filelist:
         assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

#17

Most of the programs consider the file to be binary (which is any file that is not "line-oriented") if it contains a NULL character.

如果文件包含NULL字符，则大多数程序认为文件是二进制文件（任何不是“面向行”的文件）。

Here is perl's version of pp_fttext() (pp_sys.c) implemented in Python:

以下是在Python中实现的perl的pp_fttext（）（pp_sys.c）版本：

import sys
PY3 = sys.version_info[0] == 3

# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr

_text_characters = (
        b''.join(int2byte(i) for i in range(32, 127)) +
        b'\n\r\t\f\b')

def istextfile(fileobj, blocksize=512):
    """ Uses heuristics to guess whether the given file is text or binary,
        by reading a single block of bytes from the file.
        If more than 30% of the chars in the block are non-text, or there
        are NUL ('\x00') bytes in the block, assume this is a binary file.
    """
    block = fileobj.read(blocksize)
    if b'\x00' in block:
        # Files with null bytes are binary
        return False
    elif not block:
        # An empty file is considered a valid text file
        return True

    # Use translate's 'deletechars' argument to efficiently remove all
    # occurrences of _text_characters from the block
    nontext = block.translate(None, _text_characters)
    return float(len(nontext)) / len(block) <= 0.30

Note also that this code was written to run on both Python 2 and Python 3 without changes.

另请注意，此代码编写为在Python 2和Python 3上运行而不进行更改。

Source: Perl's "guess if file is text or binary" implemented in Python

来源：Perl的“猜测文件是文本还是二进制”在Python中实现

如何在python中检测文件是否为二进制（非文本）？

17 个解决方案

#1

#2

#3

#4

#5

#6

#7

#8

#9

#10

#11

#12

#13

#14

#15

#16

on *NIX:

If you have access to the file shell-command, shlex can help make the subprocess module more usable:

Or, you could also stick that in a for-loop to get output for all files in the current dir using:

or for all subdirs:

#17

#1

#2

#3

#4

#5

#6

#7

#8

#9

#10

#11

#12

#13

#14

#15

#16

on *NIX:

If you have access to the file shell-command, shlex can help make the subprocess module more usable:

Or, you could also stick that in a for-loop to get output for all files in the current dir using:

or for all subdirs:

#17

相关文章

If you have access to the `file` shell-command, shlex can help make the subprocess module more usable:

If you have access to the `file` shell-command, shlex can help make the subprocess module more usable: