如何在Bash中检测文件是否有UTF-8 BOM?

时间:2021-02-20 22:20:50

I'm trying to write a script that will automatically remove UTF-8 BOMs from a file. I'm having trouble detecting whether the file has one in the first place or not. Here is my code:

我正在尝试编写一个脚本,它将自动从文件中删除UTF-8 BOM。我无法检测文件是否首先存在。这是我的代码:

function has-bom {
    # Test if the file starts with 0xEF, 0xBB, and 0xBF
    head -c 3 "$1" | grep -P '\xef\xbb\xbf'
    return $?
}

For some reason, head seems to be ignoring the BOM in front of the file. As an example, running this

出于某种原因,head似乎忽略了文件前面的BOM。举个例子,运行它

printf '\xef\xbb\xbf' > file
head -c 3 file

won't print anything.

不会打印任何东西。

I tried looking for an option in head --help that would let me work around this, but no luck. Is there anything I can do to make this work?

我试着寻找一个头脑中的选项 - 帮助我解决这个问题,但没有运气。有什么我可以做的工作吗?

2 个解决方案

#1


11  

First, let's demonstrate that head is actually working correctly:

首先,让我们证明头部实际上正常工作:

$ printf '\xef\xbb\xbf' >file
$ head -c 3 file 
$ head -c 3 file | hexdump -C
00000000  ef bb bf                                          |...|
00000003

Now, let's create a working function has_bom. If your grep supports -P, then one option is:

现在,让我们创建一个工作函数has_bom。如果你的grep支持-P,那么一个选项是:

$ has_bom() { head -c3 "$1" | LC_ALL=C grep -qP '\xef\xbb\xbf'; }
$ has_bom file && echo yes
yes

Currently, only GNU grep supports -P.

目前,只有GNU grep支持-P。

Another option is to use bash's $'...':

另一个选择是使用bash的$'...':

$ has_bom() { head -c3 "$1" | grep -q $'\xef\xbb\xbf'; }
$ has_bom file && echo yes
yes

ksh and zsh also support $'...' but this construct is not POSIX and dash does not support it.

ksh和zsh也支持$'...'但这个结构不是POSIX而且dash不支持它。

Notes:

笔记:

  1. The use of an explicit return $? is optional. The function will, by default, return with the exit code of the last command run.

    使用显式退货$?是可选的。默认情况下,该函数将返回上一个命令运行的退出代码。

  2. I have used the POSIX form for defining functions. This is equivalent to the bash form but gives you one less problem to deal with if you ever have to run the function under another shell.

    我使用POSIX表单来定义函数。这相当于bash表单,但如果你必须在另一个shell下运行该函数,则可以减少一个问题。

  3. bash does accept the use of the character - in a function name but this is a controversial feature. I replaced it with _ which is more widely accepted. (For more on this issue, see this answer.)

    bash接受使用字符 - 在函数名称中,但这是一个有争议的功能。我用_替换它,这被广泛接受。 (有关此问题的更多信息,请参阅此答案。)

  4. The -q option to grep makes it quiet, meaning that it still sets a proper exit code but it does not send any characters to stdout.

    grep的-q选项使其安静,这意味着它仍设置正确的退出代码,但它不会向stdout发送任何字符。

#2


0  

I applied the followings for the first read line:

我将以下内容应用于第一个读取行:

read c
if (( "$(printf "%d" "'${c:0:1}")" == 65279 ))  ; then c="${c:1}" ; fi

This simply removes the BOM from the variable.

这只是从变量中删除BOM。

#1


11  

First, let's demonstrate that head is actually working correctly:

首先,让我们证明头部实际上正常工作:

$ printf '\xef\xbb\xbf' >file
$ head -c 3 file 
$ head -c 3 file | hexdump -C
00000000  ef bb bf                                          |...|
00000003

Now, let's create a working function has_bom. If your grep supports -P, then one option is:

现在,让我们创建一个工作函数has_bom。如果你的grep支持-P,那么一个选项是:

$ has_bom() { head -c3 "$1" | LC_ALL=C grep -qP '\xef\xbb\xbf'; }
$ has_bom file && echo yes
yes

Currently, only GNU grep supports -P.

目前,只有GNU grep支持-P。

Another option is to use bash's $'...':

另一个选择是使用bash的$'...':

$ has_bom() { head -c3 "$1" | grep -q $'\xef\xbb\xbf'; }
$ has_bom file && echo yes
yes

ksh and zsh also support $'...' but this construct is not POSIX and dash does not support it.

ksh和zsh也支持$'...'但这个结构不是POSIX而且dash不支持它。

Notes:

笔记:

  1. The use of an explicit return $? is optional. The function will, by default, return with the exit code of the last command run.

    使用显式退货$?是可选的。默认情况下,该函数将返回上一个命令运行的退出代码。

  2. I have used the POSIX form for defining functions. This is equivalent to the bash form but gives you one less problem to deal with if you ever have to run the function under another shell.

    我使用POSIX表单来定义函数。这相当于bash表单,但如果你必须在另一个shell下运行该函数,则可以减少一个问题。

  3. bash does accept the use of the character - in a function name but this is a controversial feature. I replaced it with _ which is more widely accepted. (For more on this issue, see this answer.)

    bash接受使用字符 - 在函数名称中,但这是一个有争议的功能。我用_替换它,这被广泛接受。 (有关此问题的更多信息,请参阅此答案。)

  4. The -q option to grep makes it quiet, meaning that it still sets a proper exit code but it does not send any characters to stdout.

    grep的-q选项使其安静,这意味着它仍设置正确的退出代码,但它不会向stdout发送任何字符。

#2


0  

I applied the followings for the first read line:

我将以下内容应用于第一个读取行:

read c
if (( "$(printf "%d" "'${c:0:1}")" == 65279 ))  ; then c="${c:1}" ; fi

This simply removes the BOM from the variable.

这只是从变量中删除BOM。