如何在linux中区分二进制文件和文本文件

时间:2022-08-29 22:57:21

The linux file command does a very good job in recognising file types and gives very fine-grained results. The diff tool is able to tell binary files from text files, producing a different output.

linux文件命令在识别文件类型方面做得非常好,并给出非常细粒度的结果。diff工具能够区分二进制文件和文本文件,产生不同的输出。

Is there a way to tell binary files form text files? All I want is a yes/no answer whether a given file is binary. Because it's difficult to define binary, let's say I want to know if diff will attempt a text-based comparison.

有没有办法把二进制文件和文本文件区分开来?我想要的是一个给定文件是否为二进制的yes/no答案。因为很难定义二进制,假设我想知道diff是否会尝试基于文本的比较。

To clarify the question: I do not care if it's ASCII text or XML as long as it's text. Also, I do not want to differentiate between MP3 and JPEG files, as they're all binary.

为了澄清这个问题:只要是文本,我不关心它是ASCII文本还是XML。另外,我不想区分MP3和JPEG文件,因为它们都是二进制文件。

8 个解决方案

#1


6  

The diff manual specifies that

diff手册指定了这一点

diff determines whether a file is text or binary by checking the first few bytes in the file; the exact number of bytes is system dependent, but it is typically several thousand. If every byte in that part of the file is non-null, diff considers the file to be text; otherwise it considers the file to be binary.

diff通过检查文件中的前几个字节来确定文件是文本还是二进制;确切的字节数与系统有关,但通常是几千个。如果该文件部分中的每个字节都是非空的,那么diff将该文件视为文本;否则它认为文件是二进制的。

#2


11  

file is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file; anything that is binary will not include the word "text".

文件仍然是您想要的命令。任何文本文件(根据其启发式)都将在文件的输出中包含“文本”一词;任何二进制的东西都不会包含“文本”这个词。

If you don't agree with the heuristics that file uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).

如果您不同意文件用来确定文本与非文本的启发式,那么需要更好地指定问题,因为文本与非文本是一个固有的模糊问题。例如,文件没有将ASCII中的PGP公钥块标识为“text”,但是您可以(因为它仅由可打印字符组成,即使它不是人类可读的)。

#3


6  

A quick-and-dirty way is to look for a NUL character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL.

一种快速而肮脏的方法是在文件的前K或2中寻找NUL字符(0字节)。只要您不担心UTF-16或UTF-32,任何文本文件都不应该包含NUL。

Update: According to the diff manual, this is exactly what diff does.

更新:根据diff手册,这正是diff所做的。

#4


3  

You could try to give a

你可以试着给a。

strings yourfile

command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.

命令并将结果的大小与文件大小进行比较……我不太确定,但是如果它们是相同的,那么这个文件就是一个文本文件。

#5


1  

These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.

现在,“文本文件”这个术语很模糊,因为文本文件可以用ASCII、ISO-8859-*、UTF-8、UTF-16、UTF-32等格式编码。

See here for how Subversion does it.

在这里查看Subversion是如何实现的。

#6


1  

This approach uses same criteria as grep in determining whether a file is binary or text:

这种方法使用与grep相同的标准来确定文件是二进制还是文本:

is_text_file() { 
  grep -qI '.' "$1"
}

grep options used:

  • -q Quiet; Exit immediately with zero status if any match is found
  • q安静;如果发现任何匹配,立即以零状态退出
  • -I Process a binary file as if it did not contain matching data
  • -我处理一个二进制文件,就好像它不包含匹配的数据一样。

grep pattern used:

  • '.' match any single character. All files (except an empty file) will match this pattern.
  • ”。“匹配任何一个人物。”所有文件(除了一个空文件)都将匹配此模式。

Notes

  • An empty file is not considered a text file according to this test.
  • 根据这个测试,空文件不被视为文本文件。
  • Symbolic links are followed.
  • 符号链接之后。

#7


0  

A fast way to do this in ubuntu is use nautilus in the "list" view. The type column will show you if its text or binary

在ubuntu中快速实现这一点的方法是在“列表”视图中使用nautilus。type列将显示它的文本或二进制

#8


-1  

Commands like less, grep detect it quite easily(and fast). You can have a look at their source.

像less这样的命令,grep很容易(而且很快)检测到它。你可以看看他们的来源。

#1


6  

The diff manual specifies that

diff手册指定了这一点

diff determines whether a file is text or binary by checking the first few bytes in the file; the exact number of bytes is system dependent, but it is typically several thousand. If every byte in that part of the file is non-null, diff considers the file to be text; otherwise it considers the file to be binary.

diff通过检查文件中的前几个字节来确定文件是文本还是二进制;确切的字节数与系统有关,但通常是几千个。如果该文件部分中的每个字节都是非空的,那么diff将该文件视为文本;否则它认为文件是二进制的。

#2


11  

file is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file; anything that is binary will not include the word "text".

文件仍然是您想要的命令。任何文本文件(根据其启发式)都将在文件的输出中包含“文本”一词;任何二进制的东西都不会包含“文本”这个词。

If you don't agree with the heuristics that file uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).

如果您不同意文件用来确定文本与非文本的启发式,那么需要更好地指定问题,因为文本与非文本是一个固有的模糊问题。例如,文件没有将ASCII中的PGP公钥块标识为“text”,但是您可以(因为它仅由可打印字符组成,即使它不是人类可读的)。

#3


6  

A quick-and-dirty way is to look for a NUL character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL.

一种快速而肮脏的方法是在文件的前K或2中寻找NUL字符(0字节)。只要您不担心UTF-16或UTF-32,任何文本文件都不应该包含NUL。

Update: According to the diff manual, this is exactly what diff does.

更新:根据diff手册,这正是diff所做的。

#4


3  

You could try to give a

你可以试着给a。

strings yourfile

command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.

命令并将结果的大小与文件大小进行比较……我不太确定,但是如果它们是相同的,那么这个文件就是一个文本文件。

#5


1  

These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.

现在,“文本文件”这个术语很模糊,因为文本文件可以用ASCII、ISO-8859-*、UTF-8、UTF-16、UTF-32等格式编码。

See here for how Subversion does it.

在这里查看Subversion是如何实现的。

#6


1  

This approach uses same criteria as grep in determining whether a file is binary or text:

这种方法使用与grep相同的标准来确定文件是二进制还是文本:

is_text_file() { 
  grep -qI '.' "$1"
}

grep options used:

  • -q Quiet; Exit immediately with zero status if any match is found
  • q安静;如果发现任何匹配,立即以零状态退出
  • -I Process a binary file as if it did not contain matching data
  • -我处理一个二进制文件,就好像它不包含匹配的数据一样。

grep pattern used:

  • '.' match any single character. All files (except an empty file) will match this pattern.
  • ”。“匹配任何一个人物。”所有文件(除了一个空文件)都将匹配此模式。

Notes

  • An empty file is not considered a text file according to this test.
  • 根据这个测试,空文件不被视为文本文件。
  • Symbolic links are followed.
  • 符号链接之后。

#7


0  

A fast way to do this in ubuntu is use nautilus in the "list" view. The type column will show you if its text or binary

在ubuntu中快速实现这一点的方法是在“列表”视图中使用nautilus。type列将显示它的文本或二进制

#8


-1  

Commands like less, grep detect it quite easily(and fast). You can have a look at their source.

像less这样的命令,grep很容易(而且很快)检测到它。你可以看看他们的来源。