错误:Mac OS X上的非法字节序列。

时间:2022-04-30 08:44:55

I'm trying to replace a string in a Makefile on Mac OS X for cross-compiling to iOS. The string has embedded double quotes. The command is:

我正在尝试在Mac OS X上的Makefile中替换一个字符串,以便将其交叉编译到iOS中。该字符串包含双引号。的命令是:

sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

And the error is:

错误的是:

sed: RE error: illegal byte sequence

I've tried escaping the double quotes, commas, dashes, and colons with no joy. For example:

我试过逃避双引号、逗号、破折号和冒号,没有任何乐趣。例如:

sed -i "" 's|\"iphoneos-cross\"\,\"llvm-gcc\:\-O3|\"iphoneos-cross\"\,\"clang\:\-Os|g' Configure

I'm having a heck of a time debugging the issue. Does anyone know how to get sed to print the position of the illegal byte sequence? Or does anyone know what the illegal byte sequence is?

我花了很多时间来调试这个问题。有没有人知道如何让sed打印非法字节序列的位置?或者有人知道非法字节序列是什么吗?

5 个解决方案

#1


223  

A sample command that exhibits the symptom: sed 's/./@/' <<<$'\xfc': this fails, because byte 0xfc is not a valid UTF-8 char.
Note that, by contrast, GNU sed (Linux, but also installable on macOS) simply passes the invalid byte through, without reporting an error.

显示症状的示例命令:sed 's/。/@/' <<$'\xfc':这是失败的,因为字节0xfc不是一个有效的UTF-8字符。注意,相比之下,GNU sed (Linux,但也可在macOS上安装)只是通过了无效的字节,而没有报告错误。

Using the formerly accepted answer is an option if you don't mind losing support for your true locale (if you're on a US system and you never need to deal with foreign characters, that may be fine.)

如果你不介意对你的真实语言环境失去支持(如果你是在美国的系统中,而且你永远不需要处理外国的角色,那就可以了),如果你不介意的话,使用以前被接受的答案是一种选择。

However, the same effect can be had ad-hoc for a single command only:

然而,同样的效果只适用于单个命令:

LC_ALL=C sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

Note: What matters is an effective LC_CTYPE setting of C, so LC_CTYPE=C sed ... would normally also work, but if LC_ALL happens to be set (to something other than C), it will override individual LC_*-category variables such as LC_CTYPE. Thus, the most robust approach is to set LC_ALL.

注意:重要的是C的有效LC_CTYPE设置,所以LC_CTYPE=C sed…通常也会工作,但是如果LC_ALL碰巧被设置(除了C),它将覆盖单个LC_*-类变量,如LC_CTYPE。因此,最健壮的方法是设置LC_ALL。

However, (effectively) setting LC_CTYPE to C treats strings as if each byte were its own character (no interpretation based on encoding rules is performed), with no regard for the - multibyte-on-demand - UTF-8 encoding that OS X employs by default, where foreign characters have multibyte encodings.

但是,(有效地)将LC_CTYPE设置为C处理字符串,就好像每个字节都是它自己的字符(没有基于编码规则的解释),不考虑OS X在默认情况下使用的-多字节按需- UTF-8编码,在这里,外文字符具有多字节编码。

In a nutshell: setting LC_CTYPE to C causes the shell and utilities to only recognize basic English letters as letters (the ones in the 7-bit ASCII range), so that foreign chars. will not be treated as letters, causing, for instance, upper-/lowercase conversions to fail.

简单地说:将LC_CTYPE设置为C会导致shell和实用程序只识别基本的英文字母作为字母(7位ASCII范围内的字母),这样就可以使用外国字符了。将不会被视为字母,例如,导致上/小写转换失败。

Again, this may be fine if you needn't match multibyte-encoded characters such as é, and simply want to pass such characters through.

同样,如果您不需要匹配像e这样的多字节编码字符,并且仅仅希望通过这些字符,那么这可能会很好。

If this is insufficient and/or you want to understand the cause of the original error (including determining what input bytes caused the problem) and perform encoding conversions on demand, read on below.

如果这是不够的,或者您想了解原始错误的原因(包括确定导致问题的输入字节)和按需执行编码转换,请在下面阅读。


The problem is that the input file's encoding does not match the shell's.
More specifically, the input file contains characters encoded in a way that is not valid in UTF-8 (as @Klas Lindbäck stated in a comment) - that's what the sed error message is trying to say by invalid byte sequence.

问题是输入文件的编码与shell的不匹配。更具体地说,输入文件包含以某种方式编码的字符,这种方式在UTF-8中无效(正如@Klas Lindback在注释中所述)——这就是sed错误消息试图通过无效的字节序列表示的内容。

Most likely, your input file uses a single-byte 8-bit encoding such as ISO-8859-1, frequently used to encode "Western European" languages.

很有可能,您的输入文件使用的是单字节8位编码,如ISO-8859-1,常用于编码“西欧”语言。

Example:

例子:

The accented letter à has Unicode codepoint 0xE0 (224) - the same as in ISO-8859-1. However, due to the nature of UTF-8 encoding, this single codepoint is represented as 2 bytes - 0xC3 0xA0, whereas trying to pass the single byte 0xE0 is invalid under UTF-8.

重音字母a有Unicode codepoint 0xE0(224) -与ISO-8859-1相同。然而,由于UTF-8编码的性质,这个单码点被表示为2字节- 0xC3 0xA0,而试图传递单个字节0xE0在UTF-8下是无效的。

Here's a demonstration of the problem using the string voilà encoded as ISO-8859-1, with the à represented as one byte (via an ANSI-C-quoted bash string ($'...') that uses \x{e0} to create the byte):

这里展示了使用string voila编码为ISO-8859-1的问题,并将a表示为一个字节(通过ansi - c引用的bash字符串($'…'),使用\x{e0}来创建字节):

Note that the sed command is effectively a no-op that simply passes the input through, but we need it to provoke the error:

请注意,sed命令实际上是一个简单地传递输入的无操作命令,但是我们需要它来引发错误:

  # -> 'illegal byte sequence': byte 0xE0 is not a valid char.
sed 's/.*/&/' <<<$'voil\x{e0}'

To simply ignore the problem, the above LCTYPE=C approach can be used:

为了简单地忽略这个问题,可以使用上面的LCTYPE=C方法:

  # No error, bytes are passed through ('á' will render as '?', though).
LC_CTYPE=C sed 's/.*/&/' <<<$'voil\x{e0}'

If you want to determine which parts of the input cause the problem, try the following:

如果您想确定输入的哪个部分导致了问题,请尝试以下步骤:

  # Convert bytes in the 8-bit range (high bit set) to hex. representation.
  # -> 'voil\x{e0}'
iconv -f ASCII --byte-subst='\x{%02x}' <<<$'voil\x{e0}'

The output will show you all bytes that have the high bit set (bytes that exceed the 7-bit ASCII range) in hexadecimal form. (Note, however, that that also includes correctly encoded UTF-8 multibyte sequences - a more sophisticated approach would be needed to specifically identify invalid-in-UTF-8 bytes.)

输出将显示所有具有高位集的字节(超过7位ASCII范围的字节)以十六进制形式。(但是,请注意,这也包括正确编码的UTF-8多字节序列——需要更复杂的方法来明确识别出UTF-8字节。)


Performing encoding conversions on demand:

按需执行编码转换:

Standard utility iconv can be used to convert to (-t) and/or from (-f) encodings; iconv -l lists all supported ones.

标准工具iconv可用于转换为(-t)和/或从(-f)编码;iconv -l列出了所有支持的内容。

Examples:

例子:

Convert FROM ISO-8859-1 to the encoding in effect in the shell (based on LC_CTYPE, which is UTF-8-based by default), building on the above example:

从ISO-8859-1转换为在shell中生效的编码(基于LC_CTYPE,默认为utf -8),在上面的示例中构建:

  # Converts to UTF-8; output renders correctly as 'voilà'
sed 's/.*/&/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')"

Note that this conversion allows you to properly match foreign characters:

请注意,此转换允许您正确匹配外部字符:

  # Correctly matches 'à' and replaces it with 'ü': -> 'voilü'
sed 's/à/ü/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')"

To convert the input BACK to ISO-8859-1 after processing, simply pipe the result to another iconv command:

在处理后将输入转换回ISO-8859-1,只需将结果导入另一个iconv命令:

sed 's/à/ü/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')" | iconv -t ISO-8859-1

#2


104  

Add the following lines to your ~/.bash_profile or ~/.zshrc file(s).

在你的~/上增加以下几行。bash_profile或~ /。zshrc中(年代)。

export LC_CTYPE=C 
export LANG=C

#3


4  

mklement0's answer is great, but I have some small tweaks.

mklement0的答案很好,但是我有一些小小的调整。

It seems like a good idea to explicitly specify bash's encoding when using iconv. Also, we should prepend a byte-order mark (even though the unicode standard doesn't recommend it) because there can be legitimate confusions between UTF-8 and ASCII without a byte-order mark. Unfortunately, iconv doesn't prepend a byte-order mark when you explicitly specify an endianness (UTF-16BE or UTF-16LE), so we need to use UTF-16, which uses platform-specific endianness, and then use file --mime-encoding to discover the true endianness iconv used.

在使用iconv时显式指定bash的编码似乎是个好主意。另外,我们应该预置一个字节顺序标记(即使unicode标准不推荐它),因为在UTF-8和ASCII之间可能有合法的混淆,而没有字节顺序标记。不幸的是,iconv并没有在你明确指定一个字节顺序(UTF-16BE或UTF-16LE)时预设一个字节顺序的标记,所以我们需要使用UTF-16,它使用特定于平台的endianness,然后使用文件-mime-encoding来发现使用的真正的endianness iconv。

(I uppercase all my encodings because when you list all of iconv's supported encodings with iconv -l they are all uppercase.)

(我大写所有的编码,因为当你列出所有iconv支持的编码时,它们都是大写的。)

# Find out MY_FILE's encoding
# We'll convert back to this at the end
FILE_ENCODING="$( file --brief --mime-encoding MY_FILE )"
# Find out bash's encoding, with which we should encode
# MY_FILE so sed doesn't fail with 
# sed: RE error: illegal byte sequence
BASH_ENCODING="$( locale charmap | tr [:lower:] [:upper:] )"
# Convert to UTF-16 (unknown endianness) so iconv ensures
# we have a byte-order mark
iconv -f "$FILE_ENCODING" -t UTF-16 MY_FILE > MY_FILE.utf16_encoding
# Whether we're using UTF-16BE or UTF-16LE
UTF16_ENCODING="$( file --brief --mime-encoding MY_FILE.utf16_encoding )"
# Now we can use MY_FILE.bash_encoding with sed
iconv -f "$UTF16_ENCODING" -t "$BASH_ENCODING" MY_FILE.utf16_encoding > MY_FILE.bash_encoding
# sed!
sed 's/.*/&/' MY_FILE.bash_encoding > MY_FILE_SEDDED.bash_encoding
# now convert MY_FILE_SEDDED.bash_encoding back to its original encoding
iconv -f "$BASH_ENCODING" -t "$FILE_ENCODING" MY_FILE_SEDDED.bash_encoding > MY_FILE_SEDDED
# Now MY_FILE_SEDDED has been processed by sed, and is in the same encoding as MY_FILE

#4


0  

My workaround had been using gnu sed. Worked fine for my purposes.

我的工作是使用gnu sed。为我的目的而工作。

#5


0  

My workaround had been using Perl:

我的工作区一直在使用Perl:

find . -type f -print0 | xargs -0 perl -pi -e 's/was/now/g'

#1


223  

A sample command that exhibits the symptom: sed 's/./@/' <<<$'\xfc': this fails, because byte 0xfc is not a valid UTF-8 char.
Note that, by contrast, GNU sed (Linux, but also installable on macOS) simply passes the invalid byte through, without reporting an error.

显示症状的示例命令:sed 's/。/@/' <<$'\xfc':这是失败的,因为字节0xfc不是一个有效的UTF-8字符。注意,相比之下,GNU sed (Linux,但也可在macOS上安装)只是通过了无效的字节,而没有报告错误。

Using the formerly accepted answer is an option if you don't mind losing support for your true locale (if you're on a US system and you never need to deal with foreign characters, that may be fine.)

如果你不介意对你的真实语言环境失去支持(如果你是在美国的系统中,而且你永远不需要处理外国的角色,那就可以了),如果你不介意的话,使用以前被接受的答案是一种选择。

However, the same effect can be had ad-hoc for a single command only:

然而,同样的效果只适用于单个命令:

LC_ALL=C sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

Note: What matters is an effective LC_CTYPE setting of C, so LC_CTYPE=C sed ... would normally also work, but if LC_ALL happens to be set (to something other than C), it will override individual LC_*-category variables such as LC_CTYPE. Thus, the most robust approach is to set LC_ALL.

注意:重要的是C的有效LC_CTYPE设置,所以LC_CTYPE=C sed…通常也会工作,但是如果LC_ALL碰巧被设置(除了C),它将覆盖单个LC_*-类变量,如LC_CTYPE。因此,最健壮的方法是设置LC_ALL。

However, (effectively) setting LC_CTYPE to C treats strings as if each byte were its own character (no interpretation based on encoding rules is performed), with no regard for the - multibyte-on-demand - UTF-8 encoding that OS X employs by default, where foreign characters have multibyte encodings.

但是,(有效地)将LC_CTYPE设置为C处理字符串,就好像每个字节都是它自己的字符(没有基于编码规则的解释),不考虑OS X在默认情况下使用的-多字节按需- UTF-8编码,在这里,外文字符具有多字节编码。

In a nutshell: setting LC_CTYPE to C causes the shell and utilities to only recognize basic English letters as letters (the ones in the 7-bit ASCII range), so that foreign chars. will not be treated as letters, causing, for instance, upper-/lowercase conversions to fail.

简单地说:将LC_CTYPE设置为C会导致shell和实用程序只识别基本的英文字母作为字母(7位ASCII范围内的字母),这样就可以使用外国字符了。将不会被视为字母,例如,导致上/小写转换失败。

Again, this may be fine if you needn't match multibyte-encoded characters such as é, and simply want to pass such characters through.

同样,如果您不需要匹配像e这样的多字节编码字符,并且仅仅希望通过这些字符,那么这可能会很好。

If this is insufficient and/or you want to understand the cause of the original error (including determining what input bytes caused the problem) and perform encoding conversions on demand, read on below.

如果这是不够的,或者您想了解原始错误的原因(包括确定导致问题的输入字节)和按需执行编码转换,请在下面阅读。


The problem is that the input file's encoding does not match the shell's.
More specifically, the input file contains characters encoded in a way that is not valid in UTF-8 (as @Klas Lindbäck stated in a comment) - that's what the sed error message is trying to say by invalid byte sequence.

问题是输入文件的编码与shell的不匹配。更具体地说,输入文件包含以某种方式编码的字符,这种方式在UTF-8中无效(正如@Klas Lindback在注释中所述)——这就是sed错误消息试图通过无效的字节序列表示的内容。

Most likely, your input file uses a single-byte 8-bit encoding such as ISO-8859-1, frequently used to encode "Western European" languages.

很有可能,您的输入文件使用的是单字节8位编码,如ISO-8859-1,常用于编码“西欧”语言。

Example:

例子:

The accented letter à has Unicode codepoint 0xE0 (224) - the same as in ISO-8859-1. However, due to the nature of UTF-8 encoding, this single codepoint is represented as 2 bytes - 0xC3 0xA0, whereas trying to pass the single byte 0xE0 is invalid under UTF-8.

重音字母a有Unicode codepoint 0xE0(224) -与ISO-8859-1相同。然而,由于UTF-8编码的性质,这个单码点被表示为2字节- 0xC3 0xA0,而试图传递单个字节0xE0在UTF-8下是无效的。

Here's a demonstration of the problem using the string voilà encoded as ISO-8859-1, with the à represented as one byte (via an ANSI-C-quoted bash string ($'...') that uses \x{e0} to create the byte):

这里展示了使用string voila编码为ISO-8859-1的问题,并将a表示为一个字节(通过ansi - c引用的bash字符串($'…'),使用\x{e0}来创建字节):

Note that the sed command is effectively a no-op that simply passes the input through, but we need it to provoke the error:

请注意,sed命令实际上是一个简单地传递输入的无操作命令,但是我们需要它来引发错误:

  # -> 'illegal byte sequence': byte 0xE0 is not a valid char.
sed 's/.*/&/' <<<$'voil\x{e0}'

To simply ignore the problem, the above LCTYPE=C approach can be used:

为了简单地忽略这个问题,可以使用上面的LCTYPE=C方法:

  # No error, bytes are passed through ('á' will render as '?', though).
LC_CTYPE=C sed 's/.*/&/' <<<$'voil\x{e0}'

If you want to determine which parts of the input cause the problem, try the following:

如果您想确定输入的哪个部分导致了问题,请尝试以下步骤:

  # Convert bytes in the 8-bit range (high bit set) to hex. representation.
  # -> 'voil\x{e0}'
iconv -f ASCII --byte-subst='\x{%02x}' <<<$'voil\x{e0}'

The output will show you all bytes that have the high bit set (bytes that exceed the 7-bit ASCII range) in hexadecimal form. (Note, however, that that also includes correctly encoded UTF-8 multibyte sequences - a more sophisticated approach would be needed to specifically identify invalid-in-UTF-8 bytes.)

输出将显示所有具有高位集的字节(超过7位ASCII范围的字节)以十六进制形式。(但是,请注意,这也包括正确编码的UTF-8多字节序列——需要更复杂的方法来明确识别出UTF-8字节。)


Performing encoding conversions on demand:

按需执行编码转换:

Standard utility iconv can be used to convert to (-t) and/or from (-f) encodings; iconv -l lists all supported ones.

标准工具iconv可用于转换为(-t)和/或从(-f)编码;iconv -l列出了所有支持的内容。

Examples:

例子:

Convert FROM ISO-8859-1 to the encoding in effect in the shell (based on LC_CTYPE, which is UTF-8-based by default), building on the above example:

从ISO-8859-1转换为在shell中生效的编码(基于LC_CTYPE,默认为utf -8),在上面的示例中构建:

  # Converts to UTF-8; output renders correctly as 'voilà'
sed 's/.*/&/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')"

Note that this conversion allows you to properly match foreign characters:

请注意,此转换允许您正确匹配外部字符:

  # Correctly matches 'à' and replaces it with 'ü': -> 'voilü'
sed 's/à/ü/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')"

To convert the input BACK to ISO-8859-1 after processing, simply pipe the result to another iconv command:

在处理后将输入转换回ISO-8859-1,只需将结果导入另一个iconv命令:

sed 's/à/ü/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')" | iconv -t ISO-8859-1

#2


104  

Add the following lines to your ~/.bash_profile or ~/.zshrc file(s).

在你的~/上增加以下几行。bash_profile或~ /。zshrc中(年代)。

export LC_CTYPE=C 
export LANG=C

#3


4  

mklement0's answer is great, but I have some small tweaks.

mklement0的答案很好,但是我有一些小小的调整。

It seems like a good idea to explicitly specify bash's encoding when using iconv. Also, we should prepend a byte-order mark (even though the unicode standard doesn't recommend it) because there can be legitimate confusions between UTF-8 and ASCII without a byte-order mark. Unfortunately, iconv doesn't prepend a byte-order mark when you explicitly specify an endianness (UTF-16BE or UTF-16LE), so we need to use UTF-16, which uses platform-specific endianness, and then use file --mime-encoding to discover the true endianness iconv used.

在使用iconv时显式指定bash的编码似乎是个好主意。另外,我们应该预置一个字节顺序标记(即使unicode标准不推荐它),因为在UTF-8和ASCII之间可能有合法的混淆,而没有字节顺序标记。不幸的是,iconv并没有在你明确指定一个字节顺序(UTF-16BE或UTF-16LE)时预设一个字节顺序的标记,所以我们需要使用UTF-16,它使用特定于平台的endianness,然后使用文件-mime-encoding来发现使用的真正的endianness iconv。

(I uppercase all my encodings because when you list all of iconv's supported encodings with iconv -l they are all uppercase.)

(我大写所有的编码,因为当你列出所有iconv支持的编码时,它们都是大写的。)

# Find out MY_FILE's encoding
# We'll convert back to this at the end
FILE_ENCODING="$( file --brief --mime-encoding MY_FILE )"
# Find out bash's encoding, with which we should encode
# MY_FILE so sed doesn't fail with 
# sed: RE error: illegal byte sequence
BASH_ENCODING="$( locale charmap | tr [:lower:] [:upper:] )"
# Convert to UTF-16 (unknown endianness) so iconv ensures
# we have a byte-order mark
iconv -f "$FILE_ENCODING" -t UTF-16 MY_FILE > MY_FILE.utf16_encoding
# Whether we're using UTF-16BE or UTF-16LE
UTF16_ENCODING="$( file --brief --mime-encoding MY_FILE.utf16_encoding )"
# Now we can use MY_FILE.bash_encoding with sed
iconv -f "$UTF16_ENCODING" -t "$BASH_ENCODING" MY_FILE.utf16_encoding > MY_FILE.bash_encoding
# sed!
sed 's/.*/&/' MY_FILE.bash_encoding > MY_FILE_SEDDED.bash_encoding
# now convert MY_FILE_SEDDED.bash_encoding back to its original encoding
iconv -f "$BASH_ENCODING" -t "$FILE_ENCODING" MY_FILE_SEDDED.bash_encoding > MY_FILE_SEDDED
# Now MY_FILE_SEDDED has been processed by sed, and is in the same encoding as MY_FILE

#4


0  

My workaround had been using gnu sed. Worked fine for my purposes.

我的工作是使用gnu sed。为我的目的而工作。

#5


0  

My workaround had been using Perl:

我的工作区一直在使用Perl:

find . -type f -print0 | xargs -0 perl -pi -e 's/was/now/g'