如何对包含二进制数据的文本文件进行grep ?

时间:2022-06-02 00:26:38

grep returns

grep的回报

Binary file test.log matches

For example

例如

echo    "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log  # in zsh
echo -e "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log  # in bash
grep re test.log

I wish the result will show line1 and line3 (total two lines).

我希望结果显示第1行和第3行(总共两条线)。

Is it possible to use tr convert the unprintable data into readable data, to let grep work again?

是否可以使用tr将不能打印的数据转换为可读的数据,让grep再次工作?

10 个解决方案

#1


53  

You could run the data file through cat -v, e.g

你可以通过cat -v运行数据文件

$ cat -v tmp/test.log | grep re
line1 re ^@^M
line3 re^M

which could be then further post-processed to remove the junk; this is most analogous to your query about using tr for the task.

然后再进行后处理,去除垃圾;这与您关于为任务使用tr的查询非常类似。

#2


80  

One way is to simply treat binary files as text anyway, with grep --text but this may well result in binary information being sent to your terminal. That's not really a good idea if you're running a terminal that interprets the output stream (such as VT/DEC or many others).

一种方法是简单地将二进制文件当作文本,使用grep—text,但是这很可能会导致二进制信息被发送到您的终端。如果您正在运行一个解释输出流的终端(例如VT/DEC或其他许多),那么这并不是一个好主意。

Alternatively, you can send your file through tr with the following command:

您也可以通过tr发送您的文件,使用以下命令:

tr '[\000-\011\013-\037\177-\377]' '.' <test.log | grep whatever

This will change anything less than a space character (except newline) and anything greater than 126, into a . character, leaving only the printables.

这将改变任何小于空格字符(换行除外)和大于126的任何东西。只留下可打印的字符。


If you want every "illegal" character replaced by a different one, you can use something like the following C program, a classic standard input filter:

如果你想让每个“非法”字符都换成另一个字符,你可以使用如下C程序,一个经典的标准输入过滤器:

#include<stdio.h>
int main (void) {
    int ch;
    while ((ch = getchar()) != EOF) {
        if ((ch == '\n') || ((ch >= ' ') && (ch <= '~'))) {
            putchar (ch);
        } else {
            printf ("{{%02x}}", ch);
        }
    }
    return 0;
}

This will give you {{NN}}, where NN is the hex code for the character. You can simply adjust the printf for whatever style of output you want.

这将给您{NN},其中NN是字符的十六进制代码。您可以简单地为您想要的输出样式调整printf。

You can see that program in action here, where it:

你可以在这里看到这个项目,在那里:

pax$ printf 'Hello,\tBob\nGoodbye, Bob\n' | ./filterProg
Hello,{{09}}Bob
Goodbye, Bob

#3


70  

grep -a

It can't get simpler than that.

再简单不过了。

#4


32  

You can use "strings" to extract strings from a binary file, for example

例如,可以使用“strings”从二进制文件中提取字符串

strings binary.file | grep foo

#5


19  

You can force grep to look at binary files with:

您可以强制grep查看二进制文件:

grep --binary-files=text

You might also want to add -o (--only-matching) so you don't get tons of binary gibberish that will bork your terminal.

您可能还想添加-o(——唯一匹配),这样您就不会得到大量的二进制乱语,这将使您的终端无法工作。

#6


11  

Starting with Grep 2.21, binary files are treated differently:

从Grep 2.21开始,二进制文件被区别对待:

When searching binary data, grep now may treat non-text bytes as line terminators. This can boost performance significantly.

当搜索二进制数据时,grep现在可以将非文本字节作为行终止符。这可以显著提高性能。

So what happens now is that with binary data, all non-text bytes (including newlines) are treated as line terminators. If you want to change this behavior, you can:

现在发生的是,对于二进制数据,所有非文本字节(包括换行)都被当作行终止符。如果你想改变这种行为,你可以:

  • use --text. This will ensure that only newlines are line terminators

    使用——文本。这将确保只有换行符是行终止符

  • use --null-data. This will ensure that only null bytes are line terminators

    使用null数据。这将确保只有空字节是行终止符

#7


3  

As James Selvakumar already said, grep -a does the trick. -a or --text forces Grep to handle the inputstream as text. See Manpage http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

正如詹姆斯·塞尔瓦库玛(James Selvakumar)已经说过的,grep -a可以做到这一点。- or -text强制Grep将inputstream作为文本处理。看到从http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

try

试一试

cat test.log | grep -a somestring

#8


2  

you can do

你可以做

strings test.log | grep -i

this will convert give output as a readable string to grep.

这将把输出作为可读字符串转换为grep。

#9


0  

You can also try Word Extractor tool. Word Extractor can be used with any file in your computer to separate the strings that contain human text / words from binary code (exe applications, DLLs).

你也可以尝试单词提取工具。单词提取器可以与计算机中的任何文件一起使用,将包含人类文本/文字的字符串从二进制代码(exe应用程序,dll)中分离出来。

#10


0  

grep -a will force grep to search and output from a file that grep thinks is binary. grep -a re test.log

grep -a将迫使grep从grep认为是二进制的文件中搜索和输出。grep - re test.log

#1


53  

You could run the data file through cat -v, e.g

你可以通过cat -v运行数据文件

$ cat -v tmp/test.log | grep re
line1 re ^@^M
line3 re^M

which could be then further post-processed to remove the junk; this is most analogous to your query about using tr for the task.

然后再进行后处理,去除垃圾;这与您关于为任务使用tr的查询非常类似。

#2


80  

One way is to simply treat binary files as text anyway, with grep --text but this may well result in binary information being sent to your terminal. That's not really a good idea if you're running a terminal that interprets the output stream (such as VT/DEC or many others).

一种方法是简单地将二进制文件当作文本,使用grep—text,但是这很可能会导致二进制信息被发送到您的终端。如果您正在运行一个解释输出流的终端(例如VT/DEC或其他许多),那么这并不是一个好主意。

Alternatively, you can send your file through tr with the following command:

您也可以通过tr发送您的文件,使用以下命令:

tr '[\000-\011\013-\037\177-\377]' '.' <test.log | grep whatever

This will change anything less than a space character (except newline) and anything greater than 126, into a . character, leaving only the printables.

这将改变任何小于空格字符(换行除外)和大于126的任何东西。只留下可打印的字符。


If you want every "illegal" character replaced by a different one, you can use something like the following C program, a classic standard input filter:

如果你想让每个“非法”字符都换成另一个字符,你可以使用如下C程序,一个经典的标准输入过滤器:

#include<stdio.h>
int main (void) {
    int ch;
    while ((ch = getchar()) != EOF) {
        if ((ch == '\n') || ((ch >= ' ') && (ch <= '~'))) {
            putchar (ch);
        } else {
            printf ("{{%02x}}", ch);
        }
    }
    return 0;
}

This will give you {{NN}}, where NN is the hex code for the character. You can simply adjust the printf for whatever style of output you want.

这将给您{NN},其中NN是字符的十六进制代码。您可以简单地为您想要的输出样式调整printf。

You can see that program in action here, where it:

你可以在这里看到这个项目,在那里:

pax$ printf 'Hello,\tBob\nGoodbye, Bob\n' | ./filterProg
Hello,{{09}}Bob
Goodbye, Bob

#3


70  

grep -a

It can't get simpler than that.

再简单不过了。

#4


32  

You can use "strings" to extract strings from a binary file, for example

例如,可以使用“strings”从二进制文件中提取字符串

strings binary.file | grep foo

#5


19  

You can force grep to look at binary files with:

您可以强制grep查看二进制文件:

grep --binary-files=text

You might also want to add -o (--only-matching) so you don't get tons of binary gibberish that will bork your terminal.

您可能还想添加-o(——唯一匹配),这样您就不会得到大量的二进制乱语,这将使您的终端无法工作。

#6


11  

Starting with Grep 2.21, binary files are treated differently:

从Grep 2.21开始,二进制文件被区别对待:

When searching binary data, grep now may treat non-text bytes as line terminators. This can boost performance significantly.

当搜索二进制数据时,grep现在可以将非文本字节作为行终止符。这可以显著提高性能。

So what happens now is that with binary data, all non-text bytes (including newlines) are treated as line terminators. If you want to change this behavior, you can:

现在发生的是,对于二进制数据,所有非文本字节(包括换行)都被当作行终止符。如果你想改变这种行为,你可以:

  • use --text. This will ensure that only newlines are line terminators

    使用——文本。这将确保只有换行符是行终止符

  • use --null-data. This will ensure that only null bytes are line terminators

    使用null数据。这将确保只有空字节是行终止符

#7


3  

As James Selvakumar already said, grep -a does the trick. -a or --text forces Grep to handle the inputstream as text. See Manpage http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

正如詹姆斯·塞尔瓦库玛(James Selvakumar)已经说过的,grep -a可以做到这一点。- or -text强制Grep将inputstream作为文本处理。看到从http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

try

试一试

cat test.log | grep -a somestring

#8


2  

you can do

你可以做

strings test.log | grep -i

this will convert give output as a readable string to grep.

这将把输出作为可读字符串转换为grep。

#9


0  

You can also try Word Extractor tool. Word Extractor can be used with any file in your computer to separate the strings that contain human text / words from binary code (exe applications, DLLs).

你也可以尝试单词提取工具。单词提取器可以与计算机中的任何文件一起使用,将包含人类文本/文字的字符串从二进制代码(exe应用程序,dll)中分离出来。

#10


0  

grep -a will force grep to search and output from a file that grep thinks is binary. grep -a re test.log

grep -a将迫使grep从grep认为是二进制的文件中搜索和输出。grep - re test.log