没有BOM, UTF-8和UTF-8有什么不同?

时间:2023-01-06 15:48:33

What's different between UTF-8 and UTF-8 without a BOM? Which is better?

没有BOM, UTF-8和UTF-8有什么不同?哪个更好?

20 个解决方案

#1


558  

The UTF-8 BOM is a sequence of bytes (EF BB BF) that allows the reader to identify a file as being encoded in UTF-8.

UTF-8 BOM是一个字节序列(EF BB BF),它允许读者识别以UTF-8编码的文件。

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

通常,BOM被用来表示编码的endi性,但是由于endianness与UTF-8无关,所以BOM是不必要的。

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

根据Unicode标准,UTF-8文件的BOM不推荐:

2.6 Encoding Schemes

... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

…对于UTF-8来说,使用BOM既不需要也不推荐,但是在UTF-8数据转换为使用BOM或BOM用作UTF-8签名的其他编码形式时,可能会遇到这种情况。请参阅第16.8节中的“字节顺序标记”小节,以获取更多信息。

#2


182  

The other excellent answers already answered that:

其他优秀的答案已经回答了这个问题:

  • There is no official difference between UTF-8 and BOM-ed UTF-8
  • UTF-8和bomed UTF-8没有官方的区别。
  • A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF
  • 一个bomed UTF-8字符串将从以下三个字节开始。EF BB男朋友
  • Those bytes, if present, must be ignored when extracting the string from the file/stream.
  • 当从文件/流中提取字符串时,必须忽略这些字节。

But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...

但是,作为额外的信息,UTF-8的BOM可能是一种很好的“气味”的方式,如果一个字符串被编码在UTF-8中…或者它可以是任何其他编码中的合法字符串…

For example, the data [EF BB BF 41 42 43] could either be:

例如,数据[EF BB BF 41 42 43]可以是:

  • The legitimate ISO-8859-1 string "ABC"
  • 合法的ISO-8859-1字符串“i”
  • The legitimate UTF-8 string "ABC"
  • 合法的UTF-8字符串"ABC"

So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above

因此,虽然通过查看第一个字节来识别文件内容的编码是很酷的,但是您不应该依赖于此,正如上面的示例所示。

Encodings should be known, not divined.

编码应该是已知的,而不是推测出来的。

#3


99  

There are at least three problems with putting a BOM in UTF-8 encoded files.

在UTF-8编码的文件中放置BOM至少有三个问题。

  1. Files that hold no text are no longer empty because they always contain the BOM.
  2. 没有文本的文件不再是空的,因为它们总是包含BOM。
  3. Files that hold text that is within the ASCII subset of UTF-8 is no longer themselves ASCII because the BOM is not ASCII, which makes some existing tools break down, and it can be impossible for users to replace such legacy tools.
  4. 在UTF-8的ASCII子集内保存文本的文件不再是ASCII码,因为BOM不是ASCII码,这使得一些现有的工具崩溃,而且用户不可能替换这些遗留工具。
  5. It is not possible to concatenate several files together because each file now has a BOM at the beginning.
  6. 不可能将多个文件连接在一起,因为每个文件在开始时都有一个BOM。

And, as others have mentioned, it is neither sufficient nor necessary to have a BOM to detect that something is UTF-8:

而且,正如其他人所提到的那样,有一个BOM来检测某些东西是UTF-8是不够的,也不是必要的。

  • It is not sufficient because an arbitrary byte sequence can happen to start with the exact sequence that constitutes the BOM.
  • 这是不够的,因为一个任意的字节序列可以碰巧从构成BOM的确切序列开始。
  • It is not necessary because you can just read the bytes as if they were UTF-8; if that succeeds, it is, by definition, valid UTF-8.
  • 它不是必需的,因为你可以像UTF-8那样读取字节;如果成功,根据定义,它是有效的UTF-8。

#4


50  

It'a an old question with many good answers but one thing should be added.

这是一个古老的问题,有很多好的答案,但有一件事需要补充。

All answers are very general. What I'd like to add are examples of the BOM usage that actually cause real problems and yet many people don't know about it.

所有的答案都很笼统。我想要添加的是BOM使用的例子,它确实造成了实际问题,但是很多人并不知道。

BOM breaks scripts

Shell scripts, Perl scripts, Python scripts, Ruby scripts, Node.js scripts or any other executable that needs to be run by an interpreter - all start with a shebang line which looks like one of those:

Shell脚本、Perl脚本、Python脚本、Ruby脚本、节点。js脚本或任何其他需要由解释器运行的可执行文件——所有这些都以一个看起来像其中一个的shebang行开头:

#!/bin/sh
#!/usr/bin/python
#!/usr/local/bin/perl
#!/usr/bin/env node

It tells the system which interpreter needs to be run when invoking such a script. If the script is encoded in UTF-8, one may be tempted to include a BOM at the beginning. But actually the "#!" characters are not just characters. They are in fact a magic number that happens to be composed out of two ASCII characters. If you put something (like a BOM) before those characters, then the file will look like it had a different magic number and that can lead to problems.

它告诉系统在调用这样的脚本时需要运行解释器。如果脚本是用UTF-8编码的,那么在开始时可能会尝试包含一个BOM。但实际上,“#!”字符不只是字符。实际上,它们是由两个ASCII字符组成的一个神奇数字。如果在这些字符之前放置一些东西(比如BOM),那么这个文件看起来会有一个不同的神奇数字,这会导致问题。

See Wikipedia, article: Shebang, section: Magic number:

参见*,文章:Shebang, section: Magic number:

The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 and 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[14] for this reason and for wider interoperability and philosophical concerns. Additionally, a byte order mark is not necessary in UTF-8, as that encoding does not have endianness issues; it serves only to identify the encoding as UTF-8. [emphasis added]

在扩展的ASCII编码(包括UTF-8)中,shebang字符以相同的两个字节表示,这通常用于当前类unix系统的脚本和其他文本文件。但是,UTF-8文件可能以可选字节顺序标记(BOM)开始;如果“exec”函数专门检测字节0x23和0x21,那么在shebang之前BOM (0xEF 0xBB 0xBF)的存在将阻止脚本解释器被执行。一些权威人士建议不要使用POSIX(类似unix的)脚本中的字节顺序标记,[14]出于这个原因,以及更广泛的互操作性和哲学问题。另外,在UTF-8中没有必要使用字节顺序标记,因为编码没有发现问题;它只用于识别编码为UTF-8。(强调添加)

BOM is illegal in JSON

See RFC 7159, Section 8.1:

参见RFC 7159,第8.1条:

Implementations MUST NOT add a byte order mark to the beginning of a JSON text.

实现不能将字节顺序标记添加到JSON文本的开头。

BOM is redundant in JSON

Not only it is illegal in JSON, it is also not needed to determine the character encoding because there are more reliable ways to unambiguously determine both the character encoding and endianness used in any JSON stream (see this answer for details).

不仅在JSON中是非法的,也不需要确定字符编码,因为有更可靠的方法来明确地确定在任何JSON流中使用的字符编码和endianness(详细信息)。

BOM breaks JSON parsers

Not only it is illegal in JSON and not needed, it actually breaks all software that determine the encoding using the method presented in RFC 4627:

它不仅在JSON中是非法的,而且不需要,它实际上打破了所有使用RFC 4627中提出的方法来决定编码的软件:

Determining the encoding and endianness of JSON, examining the first 4 bytes for the NUL byte:

确定JSON的编码和字节顺序,检查NUL字节的前4个字节:

00 00 00 xx - UTF-32BE
00 xx 00 xx - UTF-16BE
xx 00 00 00 - UTF-32LE
xx 00 xx 00 - UTF-16LE
xx xx xx xx - UTF-8

Now, if the file starts with BOM it will look like this:

现在,如果文件从BOM开始,它会是这样的:

00 00 FE FF - UTF-32BE
FE FF 00 xx - UTF-16BE
FF FE 00 00 - UTF-32LE
FF FE xx 00 - UTF-16LE
EF BB BF xx - UTF-8

Note that:

注意:

  1. UTF-32BE doesn't start with three NULs so it won't be recognized
  2. utf -32不是从三个空开始的,所以它不会被识别。
  3. UTF-32LE the first byte is not followed by 3 NULs so it won't be recognized
  4. UTF-32LE第一个字节没有被3个空号接,所以它不会被识别。
  5. UTF-16BE has only 1 NUL in the first 4 bytes so it won't be recognized
  6. utf -16在前四个字节中只有1个NUL,所以它不会被识别。
  7. UTF-16LE has only 1 NUL in the first 4 bytes so it won't be recognized
  8. UTF-16LE在前四个字节中只有1个NUL,所以它不会被识别。

Depending on the implementation, all of those may be interpreted incorrectly as UTF-8 and then misinterpreted or rejected as invalid UTF-8, or not recognized at all.

根据执行情况,所有这些可能被错误地解释为UTF-8,然后被错误解释或拒绝为无效的UTF-8,或者根本不被识别。

Additionally if the implementation tests for valid JSON as I recommend, it will reject even the input that is indeed encoded as UTF-8 because it doesn't start with an ASCII character < 128 as it should according to the RFC.

另外,如果我推荐的有效JSON的实现测试,它甚至会拒绝接受编码为UTF-8的输入,因为它不以ASCII字符< 128开头,因为它应该根据RFC。

Other data formats

BOM in JSON is not needed, is illegal and breaks software that works correctly according to the RFC. It should be a nobrainer to just not use it then and yet, there are always people who insist on breaking JSON by using BOMs, comments, different quoting rules or different data types. Of course anyone is free to use things like BOMs or anything else if you need it - just don't call it JSON then.

不需要在JSON中使用BOM,这是非法的,并破坏了根据RFC正确工作的软件。它应该是一个不需要使用它的人,但是,总是有人坚持使用bom、注释、不同的引用规则或不同的数据类型来破坏JSON。当然,任何人都可以随意使用bom之类的东西,如果你需要的话,那就不要叫它JSON。

For other data formats than JSON, take a look how it really looks like. If the only encodings are UTF-* and the first character must be an ASCII character lower than 128 then you already have all the information needed to determine both the encoding and the endianness of your data. Adding BOMs even as an optional feature would only make it more complicated and error prone.

对于其他数据格式,请看看它的实际情况。如果唯一的编码是UTF-*,而第一个字符必须是低于128的ASCII字符,那么您就已经拥有了确定数据的编码和字节顺序所需的所有信息。添加bom甚至作为一个可选的特性只会使它更加复杂和容易出错。

Other uses of BOM

As for the uses outside of JSON or scripts, I think there are already very good answers here. I wanted to add more detailed info specifically about scripting and serialization because it is an example of BOM characters causing real problems.

至于JSON或脚本之外的用途,我认为这里已经有了很好的答案。我想要添加更详细的关于脚本和序列化的信息,因为它是BOM字符导致实际问题的一个例子。

#5


42  

What's different between UTF-8 and UTF-8 without BOM?

没有BOM, UTF-8和UTF-8有什么不同?

Short answer: In UTF-8, a BOM is encoded as the bytes EF BB BF at the beginning of the file.

简答:在UTF-8中,BOM被编码为文件开头的字节EF BB BF。

Long answer:

长一点的回答:

Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.

最初,人们期望Unicode编码在UTF-16/UCS-2中。BOM是为这种编码形式设计的。当您有2字节的代码单元时,有必要指出这两个字节所在的顺序,并且一个常见的惯例是在数据开始时将字符U+FEFF包含为“字节顺序标记”。字符U+FFFE是永久未分配的,因此它的存在可以用来检测错误的字节顺序。

UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.

UTF-8具有相同的字节顺序,而不考虑平台的字节顺序,因此不需要一个字节顺序标记。但是,它可能会发生(作为字节序列EF BB FF),数据中从UTF-16转换为UTF-8,或者作为“签名”表示数据是UTF-8。

Which is better?

哪个更好?

Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.

没有。正如马丁•考特(Martin Cote)所回答的,Unicode标准并不推荐它。它会给非炸弹敏感的软件带来问题。

A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.

检测文件是否为UTF-8的更好方法是执行有效性检查。UTF-8对于字节序列的有效性有严格的规则,因此假阳性的概率可以忽略不计。如果一个字节序列看起来像UTF-8,它很可能是。

#6


29  

UTF-8 with BOM is better identified. I have reached this conclusion the hard way. I am working on a project where one of the results is a CSV file, including Unicode characters.

UTF-8与BOM的识别更好。我已艰难地得出了这个结论。我正在做一个项目,其中一个结果是一个CSV文件,包括Unicode字符。

If the CSV file is saved without a BOM, Excel thinks it's ANSI and shows gibberish. Once you add "EF BB BF" at the front (for example, by re-saving it using Notepad with UTF-8; or Notepad++ with UTF-8 with BOM), Excel opens it fine.

如果CSV文件在没有BOM的情况下保存,Excel就会认为它是ANSI,并且显示了gibberish。一旦你在前面添加了“EF BB BF”(例如,通过使用带有UTF-8的记事本重新保存它);或者使用UTF-8和BOM的Notepad++, Excel可以打开它。

Prepending the BOM character to Unicode text files is recommended by RFC 3629: "UTF-8, a transformation format of ISO 10646", November 2003 at http://tools.ietf.org/html/rfc3629 (this last info found at: http://www.herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html)

RFC 3629建议将BOM字符添加到Unicode文本文件中:“UTF-8, ISO 10646的转换格式”,2003年11月http://tools.ietf.org/html/rfc3629(最后的信息在:http://www.herongyang.com/unicode/notepad-byte - order - mark - bom-feff-efbbf.html)。

#7


15  

BOM tends to boom (no pun intended (sic)) somewhere, someplace. And when it booms (for example, doesn't get recognized by browsers, editors, etc.), it shows up as the weird characters  at the start of the document (for example, HTML file, JSON response, RSS, etc.) and causes the kind of embarrassments like the recent encoding issue experienced during the talk of Obama on Twitter.

BOM倾向于在某处,某个地方(没有双关语)。和繁荣时(例如,不会被浏览器,编辑器,等等),它显示了奇怪的字符我害怕»的文档(例如,HTML文件,JSON响应,RSS,等等),造成这种尴尬像最近的编码问题经历了在谈论奥巴马在Twitter上。

It's very annoying when it shows up at places hard to debug or when testing is neglected. So it's best to avoid it unless you must use it.

当它出现在难以调试的地方或测试被忽略时,它非常令人讨厌。所以最好避免使用它,除非你必须使用它。

#8


11  

Question: What's different between UTF-8 and UTF-8 without a BOM? Which is better?

问:没有BOM, UTF-8和UTF-8有什么不同?哪个更好?

Here are some excerpts from the Wikipedia article on the byte order mark (BOM) that I believe offer a solid answer to this question.

以下是*上关于字节顺序标记(BOM)的一些摘录,我认为这是对这个问题的一个可靠的回答。

On the meaning of the BOM and UTF-8:

关于BOM和UTF-8的含义:

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8.

Unicode标准允许在UTF-8中使用BOM,但不需要或推荐使用它。字节顺序在UTF-8中没有意义,所以它在UTF-8中唯一的用途是在开始时发出信号,即文本流被编码在UTF-8中。

Argument for NOT using a BOM:

不使用BOM的理由:

The primary motivation for not using a BOM is backwards-compatibility with software that is not Unicode-aware... Another motivation for not using a BOM is to encourage UTF-8 as the "default" encoding.

不使用BOM的主要动机是向后兼容的软件,而不是unicodeaware…另一个不使用BOM的动机是鼓励UTF-8作为“默认”编码。

Argument FOR using a BOM:

使用BOM的理由:

The argument for using a BOM is that without it, heuristic analysis is required to determine what character encoding a file is using. Historically such analysis, to distinguish various 8-bit encodings, is complicated, error-prone, and sometimes slow. A number of libraries are available to ease the task, such as Mozilla Universal Charset Detector and International Components for Unicode.

使用BOM的理由是,没有它,需要进行启发式分析,以确定文件使用的字符编码。从历史上看,这样的分析是为了区分不同的8位编码,是复杂的、容易出错的,有时是慢的。可以使用一些库来简化任务,例如Mozilla Universal Charset检测器和Unicode的国际组件。

Programmers mistakenly assume that detection of UTF-8 is equally difficult (it is not because of the vast majority of byte sequences are invalid UTF-8, while the encodings these libraries are trying to distinguish allow all possible byte sequences). Therefore not all Unicode-aware programs perform such an analysis and instead rely on the BOM.

程序员错误地认为,对UTF-8的检测同样困难(这不是因为绝大多数字节序列都是无效的UTF-8,而这些库试图区分的编码允许所有可能的字节序列)。因此,并不是所有的unicodeaware程序都执行这样的分析,而是依赖于BOM。

In particular, Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM, and will add a BOM to the start when saving text as UTF-8. Google Docs will add a BOM when a Microsoft Word document is downloaded as a plain text file.

特别是,微软的编译器和解释器,以及微软Windows上的许多软件,如Notepad,将不能正确地读取UTF-8文本,除非它只有ASCII字符或从BOM开始,并在将文本保存为UTF-8时添加BOM。当Microsoft Word文档作为纯文本文件下载时,谷歌文档将添加一个BOM。

On which is better, WITH or WITHOUT the BOM:

有或没有BOM,哪个更好?

The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it “SHOULD forbid use of U+FEFF as a signature.”

IETF建议,如果一个协议(a)总是使用UTF-8,或者(b)有另一种方式表示正在使用什么编码,那么它“应该禁止使用U+FEFF作为签名”。

My Conclusion:

我的结论是:

Use the BOM only if compatibility with a software application is absolutely essential.

只有在与软件应用程序的兼容性非常重要时才使用BOM。

Also note that while the referenced Wikipedia article indicates that many Microsoft applications rely on the BOM to correctly detect UTF-8, this is not the case for all Microsoft applications. For example, as pointed out by @barlop, when using the Windows Command Prompt with UTF-8, commands such type and more do not expect the BOM to be present. If the BOM is present, it can be problematic as it is for other applications.

还要注意的是,虽然引用的Wikipedia文章指出许多Microsoft应用程序都依赖于BOM来正确地检测UTF-8,但对于所有的Microsoft应用程序来说,情况并非如此。例如,如@barlop所指出的,当使用UTF-8的Windows命令提示符时,命令这样的类型和更多的不期望BOM出现。如果BOM是存在的,它可能会有问题,因为它适用于其他应用程序。


† The chcp command offers support for UTF-8 (without the BOM) via code page 65001.

chcp命令通过代码页65001提供对UTF-8(没有BOM)的支持。

#9


7  

I look at this from a different perspective. I think UTF-8 with BOM is better as it provides more information about the file. I use UTF-8 without BOM only if I face problems.

我从不同的角度来看待这个问题。我认为UTF-8和BOM更好,因为它提供了更多关于文件的信息。我只有在遇到问题时才使用UTF-8。

I am using multiple languages (even Cyrillic) on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor (as cherouvim also noted), some characters are corrupted.

我使用多种语言(甚至是Cyrillic)在我的页面上使用了很长时间,当文件被保存时没有BOM,我重新打开它们进行编辑(正如cherouvim所指出的),一些字符被损坏了。

Note that Windows' classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding.

注意,当你试图用UTF-8编码来保存新创建的文件时,Windows的经典记事本自动保存文件。

I personally save server side scripting files (.asp, .ini, .aspx) with BOM and .html files without BOM.

我个人保存服务器端脚本文件(。使用BOM和.html文件,没有BOM。

#10


6  

Quoted at the bottom of the Wikipedia page on BOM: http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2

在*页面底部引用:http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2。

"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature"

“对于UTF-8来说,使用BOM既不需要也不推荐,但在UTF-8数据转换为使用BOM或BOM用作UTF-8签名的其他编码形式时,可能会遇到这种情况。

#11


6  

UTF-8 without BOM has no BOM, which doesn't make it any better than UTF-8 with BOM, except when the consumer of the file needs to know (or would benefit from knowing) whether the file is UTF-8-encoded or not.

没有BOM的UTF-8没有BOM,这与BOM的UTF-8没有任何关系,除非文件的使用者需要知道(或者知道)文件是否为UTF-8编码的。

The BOM is usually useful to determine the endianness of the encoding, which is not required for most use cases.

BOM通常用于确定编码的endi性,这对于大多数用例来说不是必需的。

Also, the BOM can be unnecessary noise/pain for those consumers that don't know or care about it, and can result in user confusion.

而且,对于那些不知道或不关心的消费者来说,BOM可能是不必要的噪音/痛苦,并可能导致用户的混乱。

#12


6  

When you want to display information encoded in UTF-8 you may not face problems. Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document.

当您想要显示以UTF-8编码的信息时,您可能不会遇到问题。将HTML文档声明为UTF-8,您将在浏览器中显示包含在文档正文中的所有内容。

But this is not the case when we have text, CSV and XML files, either on Windows or Linux.

但是,当我们在Windows或Linux上有文本、CSV和XML文件时,情况就不同了。

For example, a text file in Windows or Linux, one of the easiest things imaginable, it is not (usually) UTF-8.

例如,Windows或Linux中的一个文本文件,是可以想象到的最简单的东西之一,它不是(通常)UTF-8。

Save it as XML and declare it as UTF-8:

将其保存为XML并声明为UTF-8:

<?xml version="1.0" encoding="UTF-8"?>

It will not display (it will not be be read) correctly, even if it's declared as UTF-8.

即使它被声明为UTF-8,它也不会正确显示(它不会被读取)。

I had a string of data containing French letters, that needed to be saved as XML for syndication. Without creating a UTF-8 file from the very beginning (changing options in IDE and "Create New File") or adding the BOM at the beginning of the file

我有一串包含法语字母的数据,需要以XML格式保存。不需要从一开始就创建一个UTF-8文件(在IDE中更改选项并“创建新文件”),或者在文件的开头添加BOM。

$file="\xEF\xBB\xBF".$string;

I was not able to save the French letters in an XML file.

我无法将法语字母保存在一个XML文件中。

#13


6  

It should be noted that for some files you must not have the BOM even on Windows. Examples are SQL*plus or VBScript files. In case such files contains a BOM you get an error when you try to execute them.

应该注意的是,对于某些文件,即使在Windows上也不能有BOM。示例是SQL*plus或VBScript文件。如果这些文件包含一个BOM,当您试图执行它们时,您会得到一个错误。

#14


6  

UTF-8 with BOM only helps if the file actually contains some non-ASCII characters. If it is included and there aren't any, then it will possibly break older applications that would have otherwise interpreted the file as plain ASCII. These applications will definitely fail when they come across a non ASCII character, so in my opinion the BOM should only be added when the file can, and should, no longer be interpreted as plain ASCII.

只有当文件中包含一些非ascii字符时,UTF-8才会起作用。如果它被包含并且没有任何,那么它可能会破坏旧的应用程序,否则将会将该文件解释为普通ASCII。当它们遇到非ASCII字符时,这些应用程序肯定会失败,所以在我看来,只有当文件能够并且应该不再被解释为普通ASCII时,BOM才应该被添加。

Edit: Just want to make it clear that I prefer to not have the BOM at all, add it in if some old rubbish breaks with out it, and replacing that legacy application is not feasible.

编辑:我只是想说明一下,我更倾向于不使用BOM,如果有一些旧的垃圾打破了它,并且替换旧的应用程序是不可行的。

Don't make anything expect a BOM for UTF8.

不要为UTF8做任何准备。

#15


5  

This question already has a million-and-one answers and many of them are quite good, but I wanted to try and clarify when a BOM should or should not be used.

这个问题已经有了上百万个答案,其中很多都很好,但是我想试着澄清一个BOM应该或者不应该被使用。

As mentioned, any use of the UTF BOM (Byte Order Mark) in determining whether a string is UTF-8 or not is educated guesswork. If there is proper metadata available (like charset="utf-8"), then you already know what you're supposed to be using, but otherwise you'll need to test and make some assumptions. This involves checking whether the file a string comes from begins with the hexadecimal byte code, EF BB BF.

如前所述,在确定字符串是否为UTF-8时,任何使用UTF BOM(字节顺序标记)都是受教育的猜测。如果有适当的元数据可用(如charset=“utf-8”),那么您已经知道应该使用什么,但是,否则您将需要进行测试并做出一些假设。这包括检查一个字符串是否来自于十六进制字节码,EF BB BF。

If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on it's source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.

如果找到与UTF-8 BOM相对应的字节码,那么概率就足够高,可以假设它是UTF-8,您可以从那里开始。然而,当*进行这种猜测时,在阅读时进行额外的错误检查仍然是一个好主意,以防出现问题。您应该只假设BOM不是UTF-8(即latin-1或ANSI),如果输入绝对不应该基于它的源代码UTF-8。然而,如果没有BOM,您可以通过验证编码来确定它是否应该是UTF-8。

Why is a BOM not recommended?

  1. Non-Unicode-aware or poorly compliant software may assume it's latin-1 or ANSI and won't strip the BOM from the string, which can obviously cause issues.
  2. 非unicode或不兼容的软件可能认为它是latin-1或ANSI,不会从字符串中去掉BOM,这显然会导致问题。
  3. It's not really needed (just check if the contents are compliant and always use UTF-8 as the fallback when no compliant encoding can be found)
  4. 它不是真正需要的(只要检查内容是否兼容并始终使用UTF-8作为不兼容编码的回退)

When should you encode with a BOM?

If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.

如果您无法以其他方式记录元数据(通过charset标记或文件系统元数据),并且程序被使用如BOM,那么您应该使用BOM进行编码。在Windows上尤其如此,没有BOM的任何东西通常被假定为使用遗留代码页。BOM告诉像Office这样的程序,是的,这个文件中的文本是Unicode;这是所使用的编码。

When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data.

当涉及到它时,我唯一真正遇到的问题是CSV。根据程序,它要么必须,要么不能有一个BOM。例如,如果您在Windows上使用Excel 2007+,则必须使用BOM进行编码,如果您想顺利打开它,而不必使用导入数据。

#16


4  

One practical difference is that if you write a shell script for Mac OS X and save it as plain UTF-8, you will get the response:

一个实际的区别是,如果您为Mac OS X编写一个shell脚本并将其保存为普通的UTF-8,您将得到响应:

#!/bin/bash: No such file or directory

in response to the shebang line specifying which shell you wish to use:

针对指定您希望使用的shell的shebang行:

#!/bin/bash

If you save as UTF-8, no BOM (say in BBEdit) all will be well.

如果你把UTF-8存起来,没有BOM(在BBEdit中说),一切都会好起来的。

#17


3  

As mentioned above, UTF-8 with BOM may cause problems with non-BOM-aware (or compatible) software. I once edited HTML files encoded as UTF-8 + BOM with the Mozilla-based KompoZer, as a client required that WYSIWYG program.

如上所述,使用BOM的UTF-8可能会导致非bomaware(或兼容)软件的问题。我曾经编辑过HTML文件编码为UTF-8 + BOM和基于mozilla的KompoZer,作为一个客户端需要WYSIWYG程序。

Invariably the layout would get destroyed when saving. It took my some time to fiddle my way around this. These files then worked well in Firefox, but showed a CSS quirk in Internet Explorer destroying the layout, again. After fiddling with the linked CSS files for hours to no avail I discovered that Internet Explorer didn't like the BOMfed HTML file. Never again.

在保存时,布局总是会被破坏。我花了很长时间才弄明白这一点。这些文件在Firefox中运行良好,但在ie浏览器中显示了一个CSS特性,再次破坏了布局。在对链接的CSS文件进行了数小时的摆弄之后,我发现Internet Explorer不喜欢使用BOMfed的HTML文件。我再也不会见你了。

Also, I just found this in Wikipedia:

另外,我在*上找到了这个:

The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[15] for this reason and for wider interoperability and philosophical concerns

在扩展的ASCII编码(包括UTF-8)中,shebang字符以相同的两个字节表示,这通常用于当前类unix系统的脚本和其他文本文件。但是,UTF-8文件可能以可选字节顺序标记(BOM)开始;如果“exec”函数专门检测字节0x23 0x21,那么在shebang之前BOM (0xEF 0xBB 0xBF)的存在将阻止脚本解释器被执行。一些权威人士建议不要使用POSIX(类似unix的)脚本中的字节顺序标记,[15]出于这个原因,以及更广泛的互操作性和哲学问题。

#18


2  

From http://en.wikipedia.org/wiki/Byte-order_mark:

从http://en.wikipedia.org/wiki/Byte-order_mark:

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

字节顺序标记(BOM)是一个Unicode字符,用来表示文本文件或流的字节顺序。它的代码点是U+FEFF。BOM使用是可选的,如果使用,应该出现在文本流的开头。除了作为字节顺序指示符的特定用途之外,BOM字符还可以表示文本所编码的几种Unicode表示中的哪一个。

Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM.

在您的文件中总是使用BOM将确保它总是在支持UTF-8和BOM的编辑器中正确地打开。

My real problem with the absence of BOM is the following. Suppose we've got a file which contains:

以下是我缺席BOM的真正问题。假设我们有一个包含以下内容的文件:

abc

Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, for example:

没有BOM,这将在大多数编辑器中作为ANSI打开。因此,该文件的另一个用户打开它并附加一些本地字符,例如:

abg-αβγ

Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes, but 3. This is not UTF-8 and this causes other problems later on in the development chain.

哎呀……现在文件还在ANSI和猜测,“αβγ”并不会占用6字节,但3。这不是UTF-8,这导致了开发链后面的其他问题。

#19


1  

The Unicode Byte Order Mark (BOM) FAQ provides a concise answer:

Unicode字节顺序标记(BOM) FAQ提供了一个简洁的答案:

Q: How I should deal with BOMs?

问:我应该如何处理bom ?

A: Here are some guidelines to follow:

A:下面是一些指导原则:

  1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.

    一个特定的协议(例如,Microsoft约定的.txt文件)可能需要在某些Unicode数据流(比如文件)上使用BOM。当您需要遵循这样的协议时,使用BOM。

  2. Some protocols allow optional BOMs in the case of untagged text. In those cases,

    一些协议允许在未标记文本的情况下选择bom。在这些情况下,

    • Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.

      在已知文本数据流为纯文本的情况下,但未知编码的情况下,BOM可以用作签名。如果没有BOM,编码可以是任何东西。

    • Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.

      如果文本数据流被认为是纯Unicode文本(而不是endian),那么BOM可以用作签名。如果没有BOM,文本应该被解释为big-endian。

  3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.

    一些面向字节的协议期望在文件开头的ASCII字符。如果UTF-8与这些协议一起使用,应该避免使用BOM作为编码表单签名。

  4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

    如果已知数据流的精确类型(例如Unicode big-endian或Unicode little-endian),则不应该使用BOM。特别是,当数据流被声明为UTF-16BE时,UTF-16LE、UTF-32BE或UTF-32LE一个BOM都不能使用。

#20


-3  

UTF with BOM is better if you use UTF-8 in HTML files, if you use Serbian Cyrillic, Serbian Latin, German, Hungarian or something exotic language in the same page. That is my opinion (30 years of computing and IT industry).

如果你在HTML文件中使用UTF-8,如果你使用塞尔维亚斯拉夫语、塞尔维亚语、德语、匈牙利语或同一页面上的一些异国语言,那么UTF和BOM就更好了。这是我的观点(30年的计算和IT行业)。

#1


558  

The UTF-8 BOM is a sequence of bytes (EF BB BF) that allows the reader to identify a file as being encoded in UTF-8.

UTF-8 BOM是一个字节序列(EF BB BF),它允许读者识别以UTF-8编码的文件。

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

通常,BOM被用来表示编码的endi性,但是由于endianness与UTF-8无关,所以BOM是不必要的。

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

根据Unicode标准,UTF-8文件的BOM不推荐:

2.6 Encoding Schemes

... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

…对于UTF-8来说,使用BOM既不需要也不推荐,但是在UTF-8数据转换为使用BOM或BOM用作UTF-8签名的其他编码形式时,可能会遇到这种情况。请参阅第16.8节中的“字节顺序标记”小节,以获取更多信息。

#2


182  

The other excellent answers already answered that:

其他优秀的答案已经回答了这个问题:

  • There is no official difference between UTF-8 and BOM-ed UTF-8
  • UTF-8和bomed UTF-8没有官方的区别。
  • A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF
  • 一个bomed UTF-8字符串将从以下三个字节开始。EF BB男朋友
  • Those bytes, if present, must be ignored when extracting the string from the file/stream.
  • 当从文件/流中提取字符串时,必须忽略这些字节。

But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...

但是,作为额外的信息,UTF-8的BOM可能是一种很好的“气味”的方式,如果一个字符串被编码在UTF-8中…或者它可以是任何其他编码中的合法字符串…

For example, the data [EF BB BF 41 42 43] could either be:

例如,数据[EF BB BF 41 42 43]可以是:

  • The legitimate ISO-8859-1 string "ABC"
  • 合法的ISO-8859-1字符串“i”
  • The legitimate UTF-8 string "ABC"
  • 合法的UTF-8字符串"ABC"

So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above

因此,虽然通过查看第一个字节来识别文件内容的编码是很酷的,但是您不应该依赖于此,正如上面的示例所示。

Encodings should be known, not divined.

编码应该是已知的,而不是推测出来的。

#3


99  

There are at least three problems with putting a BOM in UTF-8 encoded files.

在UTF-8编码的文件中放置BOM至少有三个问题。

  1. Files that hold no text are no longer empty because they always contain the BOM.
  2. 没有文本的文件不再是空的,因为它们总是包含BOM。
  3. Files that hold text that is within the ASCII subset of UTF-8 is no longer themselves ASCII because the BOM is not ASCII, which makes some existing tools break down, and it can be impossible for users to replace such legacy tools.
  4. 在UTF-8的ASCII子集内保存文本的文件不再是ASCII码,因为BOM不是ASCII码,这使得一些现有的工具崩溃,而且用户不可能替换这些遗留工具。
  5. It is not possible to concatenate several files together because each file now has a BOM at the beginning.
  6. 不可能将多个文件连接在一起,因为每个文件在开始时都有一个BOM。

And, as others have mentioned, it is neither sufficient nor necessary to have a BOM to detect that something is UTF-8:

而且,正如其他人所提到的那样,有一个BOM来检测某些东西是UTF-8是不够的,也不是必要的。

  • It is not sufficient because an arbitrary byte sequence can happen to start with the exact sequence that constitutes the BOM.
  • 这是不够的,因为一个任意的字节序列可以碰巧从构成BOM的确切序列开始。
  • It is not necessary because you can just read the bytes as if they were UTF-8; if that succeeds, it is, by definition, valid UTF-8.
  • 它不是必需的,因为你可以像UTF-8那样读取字节;如果成功,根据定义,它是有效的UTF-8。

#4


50  

It'a an old question with many good answers but one thing should be added.

这是一个古老的问题,有很多好的答案,但有一件事需要补充。

All answers are very general. What I'd like to add are examples of the BOM usage that actually cause real problems and yet many people don't know about it.

所有的答案都很笼统。我想要添加的是BOM使用的例子,它确实造成了实际问题,但是很多人并不知道。

BOM breaks scripts

Shell scripts, Perl scripts, Python scripts, Ruby scripts, Node.js scripts or any other executable that needs to be run by an interpreter - all start with a shebang line which looks like one of those:

Shell脚本、Perl脚本、Python脚本、Ruby脚本、节点。js脚本或任何其他需要由解释器运行的可执行文件——所有这些都以一个看起来像其中一个的shebang行开头:

#!/bin/sh
#!/usr/bin/python
#!/usr/local/bin/perl
#!/usr/bin/env node

It tells the system which interpreter needs to be run when invoking such a script. If the script is encoded in UTF-8, one may be tempted to include a BOM at the beginning. But actually the "#!" characters are not just characters. They are in fact a magic number that happens to be composed out of two ASCII characters. If you put something (like a BOM) before those characters, then the file will look like it had a different magic number and that can lead to problems.

它告诉系统在调用这样的脚本时需要运行解释器。如果脚本是用UTF-8编码的,那么在开始时可能会尝试包含一个BOM。但实际上,“#!”字符不只是字符。实际上,它们是由两个ASCII字符组成的一个神奇数字。如果在这些字符之前放置一些东西(比如BOM),那么这个文件看起来会有一个不同的神奇数字,这会导致问题。

See Wikipedia, article: Shebang, section: Magic number:

参见*,文章:Shebang, section: Magic number:

The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 and 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[14] for this reason and for wider interoperability and philosophical concerns. Additionally, a byte order mark is not necessary in UTF-8, as that encoding does not have endianness issues; it serves only to identify the encoding as UTF-8. [emphasis added]

在扩展的ASCII编码(包括UTF-8)中,shebang字符以相同的两个字节表示,这通常用于当前类unix系统的脚本和其他文本文件。但是,UTF-8文件可能以可选字节顺序标记(BOM)开始;如果“exec”函数专门检测字节0x23和0x21,那么在shebang之前BOM (0xEF 0xBB 0xBF)的存在将阻止脚本解释器被执行。一些权威人士建议不要使用POSIX(类似unix的)脚本中的字节顺序标记,[14]出于这个原因,以及更广泛的互操作性和哲学问题。另外,在UTF-8中没有必要使用字节顺序标记,因为编码没有发现问题;它只用于识别编码为UTF-8。(强调添加)

BOM is illegal in JSON

See RFC 7159, Section 8.1:

参见RFC 7159,第8.1条:

Implementations MUST NOT add a byte order mark to the beginning of a JSON text.

实现不能将字节顺序标记添加到JSON文本的开头。

BOM is redundant in JSON

Not only it is illegal in JSON, it is also not needed to determine the character encoding because there are more reliable ways to unambiguously determine both the character encoding and endianness used in any JSON stream (see this answer for details).

不仅在JSON中是非法的,也不需要确定字符编码,因为有更可靠的方法来明确地确定在任何JSON流中使用的字符编码和endianness(详细信息)。

BOM breaks JSON parsers

Not only it is illegal in JSON and not needed, it actually breaks all software that determine the encoding using the method presented in RFC 4627:

它不仅在JSON中是非法的,而且不需要,它实际上打破了所有使用RFC 4627中提出的方法来决定编码的软件:

Determining the encoding and endianness of JSON, examining the first 4 bytes for the NUL byte:

确定JSON的编码和字节顺序,检查NUL字节的前4个字节:

00 00 00 xx - UTF-32BE
00 xx 00 xx - UTF-16BE
xx 00 00 00 - UTF-32LE
xx 00 xx 00 - UTF-16LE
xx xx xx xx - UTF-8

Now, if the file starts with BOM it will look like this:

现在,如果文件从BOM开始,它会是这样的:

00 00 FE FF - UTF-32BE
FE FF 00 xx - UTF-16BE
FF FE 00 00 - UTF-32LE
FF FE xx 00 - UTF-16LE
EF BB BF xx - UTF-8

Note that:

注意:

  1. UTF-32BE doesn't start with three NULs so it won't be recognized
  2. utf -32不是从三个空开始的,所以它不会被识别。
  3. UTF-32LE the first byte is not followed by 3 NULs so it won't be recognized
  4. UTF-32LE第一个字节没有被3个空号接,所以它不会被识别。
  5. UTF-16BE has only 1 NUL in the first 4 bytes so it won't be recognized
  6. utf -16在前四个字节中只有1个NUL,所以它不会被识别。
  7. UTF-16LE has only 1 NUL in the first 4 bytes so it won't be recognized
  8. UTF-16LE在前四个字节中只有1个NUL,所以它不会被识别。

Depending on the implementation, all of those may be interpreted incorrectly as UTF-8 and then misinterpreted or rejected as invalid UTF-8, or not recognized at all.

根据执行情况,所有这些可能被错误地解释为UTF-8,然后被错误解释或拒绝为无效的UTF-8,或者根本不被识别。

Additionally if the implementation tests for valid JSON as I recommend, it will reject even the input that is indeed encoded as UTF-8 because it doesn't start with an ASCII character < 128 as it should according to the RFC.

另外,如果我推荐的有效JSON的实现测试,它甚至会拒绝接受编码为UTF-8的输入,因为它不以ASCII字符< 128开头,因为它应该根据RFC。

Other data formats

BOM in JSON is not needed, is illegal and breaks software that works correctly according to the RFC. It should be a nobrainer to just not use it then and yet, there are always people who insist on breaking JSON by using BOMs, comments, different quoting rules or different data types. Of course anyone is free to use things like BOMs or anything else if you need it - just don't call it JSON then.

不需要在JSON中使用BOM,这是非法的,并破坏了根据RFC正确工作的软件。它应该是一个不需要使用它的人,但是,总是有人坚持使用bom、注释、不同的引用规则或不同的数据类型来破坏JSON。当然,任何人都可以随意使用bom之类的东西,如果你需要的话,那就不要叫它JSON。

For other data formats than JSON, take a look how it really looks like. If the only encodings are UTF-* and the first character must be an ASCII character lower than 128 then you already have all the information needed to determine both the encoding and the endianness of your data. Adding BOMs even as an optional feature would only make it more complicated and error prone.

对于其他数据格式,请看看它的实际情况。如果唯一的编码是UTF-*,而第一个字符必须是低于128的ASCII字符,那么您就已经拥有了确定数据的编码和字节顺序所需的所有信息。添加bom甚至作为一个可选的特性只会使它更加复杂和容易出错。

Other uses of BOM

As for the uses outside of JSON or scripts, I think there are already very good answers here. I wanted to add more detailed info specifically about scripting and serialization because it is an example of BOM characters causing real problems.

至于JSON或脚本之外的用途,我认为这里已经有了很好的答案。我想要添加更详细的关于脚本和序列化的信息,因为它是BOM字符导致实际问题的一个例子。

#5


42  

What's different between UTF-8 and UTF-8 without BOM?

没有BOM, UTF-8和UTF-8有什么不同?

Short answer: In UTF-8, a BOM is encoded as the bytes EF BB BF at the beginning of the file.

简答:在UTF-8中,BOM被编码为文件开头的字节EF BB BF。

Long answer:

长一点的回答:

Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.

最初,人们期望Unicode编码在UTF-16/UCS-2中。BOM是为这种编码形式设计的。当您有2字节的代码单元时,有必要指出这两个字节所在的顺序,并且一个常见的惯例是在数据开始时将字符U+FEFF包含为“字节顺序标记”。字符U+FFFE是永久未分配的,因此它的存在可以用来检测错误的字节顺序。

UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.

UTF-8具有相同的字节顺序,而不考虑平台的字节顺序,因此不需要一个字节顺序标记。但是,它可能会发生(作为字节序列EF BB FF),数据中从UTF-16转换为UTF-8,或者作为“签名”表示数据是UTF-8。

Which is better?

哪个更好?

Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.

没有。正如马丁•考特(Martin Cote)所回答的,Unicode标准并不推荐它。它会给非炸弹敏感的软件带来问题。

A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.

检测文件是否为UTF-8的更好方法是执行有效性检查。UTF-8对于字节序列的有效性有严格的规则,因此假阳性的概率可以忽略不计。如果一个字节序列看起来像UTF-8,它很可能是。

#6


29  

UTF-8 with BOM is better identified. I have reached this conclusion the hard way. I am working on a project where one of the results is a CSV file, including Unicode characters.

UTF-8与BOM的识别更好。我已艰难地得出了这个结论。我正在做一个项目,其中一个结果是一个CSV文件,包括Unicode字符。

If the CSV file is saved without a BOM, Excel thinks it's ANSI and shows gibberish. Once you add "EF BB BF" at the front (for example, by re-saving it using Notepad with UTF-8; or Notepad++ with UTF-8 with BOM), Excel opens it fine.

如果CSV文件在没有BOM的情况下保存,Excel就会认为它是ANSI,并且显示了gibberish。一旦你在前面添加了“EF BB BF”(例如,通过使用带有UTF-8的记事本重新保存它);或者使用UTF-8和BOM的Notepad++, Excel可以打开它。

Prepending the BOM character to Unicode text files is recommended by RFC 3629: "UTF-8, a transformation format of ISO 10646", November 2003 at http://tools.ietf.org/html/rfc3629 (this last info found at: http://www.herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html)

RFC 3629建议将BOM字符添加到Unicode文本文件中:“UTF-8, ISO 10646的转换格式”,2003年11月http://tools.ietf.org/html/rfc3629(最后的信息在:http://www.herongyang.com/unicode/notepad-byte - order - mark - bom-feff-efbbf.html)。

#7


15  

BOM tends to boom (no pun intended (sic)) somewhere, someplace. And when it booms (for example, doesn't get recognized by browsers, editors, etc.), it shows up as the weird characters  at the start of the document (for example, HTML file, JSON response, RSS, etc.) and causes the kind of embarrassments like the recent encoding issue experienced during the talk of Obama on Twitter.

BOM倾向于在某处,某个地方(没有双关语)。和繁荣时(例如,不会被浏览器,编辑器,等等),它显示了奇怪的字符我害怕»的文档(例如,HTML文件,JSON响应,RSS,等等),造成这种尴尬像最近的编码问题经历了在谈论奥巴马在Twitter上。

It's very annoying when it shows up at places hard to debug or when testing is neglected. So it's best to avoid it unless you must use it.

当它出现在难以调试的地方或测试被忽略时,它非常令人讨厌。所以最好避免使用它,除非你必须使用它。

#8


11  

Question: What's different between UTF-8 and UTF-8 without a BOM? Which is better?

问:没有BOM, UTF-8和UTF-8有什么不同?哪个更好?

Here are some excerpts from the Wikipedia article on the byte order mark (BOM) that I believe offer a solid answer to this question.

以下是*上关于字节顺序标记(BOM)的一些摘录,我认为这是对这个问题的一个可靠的回答。

On the meaning of the BOM and UTF-8:

关于BOM和UTF-8的含义:

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8.

Unicode标准允许在UTF-8中使用BOM,但不需要或推荐使用它。字节顺序在UTF-8中没有意义,所以它在UTF-8中唯一的用途是在开始时发出信号,即文本流被编码在UTF-8中。

Argument for NOT using a BOM:

不使用BOM的理由:

The primary motivation for not using a BOM is backwards-compatibility with software that is not Unicode-aware... Another motivation for not using a BOM is to encourage UTF-8 as the "default" encoding.

不使用BOM的主要动机是向后兼容的软件,而不是unicodeaware…另一个不使用BOM的动机是鼓励UTF-8作为“默认”编码。

Argument FOR using a BOM:

使用BOM的理由:

The argument for using a BOM is that without it, heuristic analysis is required to determine what character encoding a file is using. Historically such analysis, to distinguish various 8-bit encodings, is complicated, error-prone, and sometimes slow. A number of libraries are available to ease the task, such as Mozilla Universal Charset Detector and International Components for Unicode.

使用BOM的理由是,没有它,需要进行启发式分析,以确定文件使用的字符编码。从历史上看,这样的分析是为了区分不同的8位编码,是复杂的、容易出错的,有时是慢的。可以使用一些库来简化任务,例如Mozilla Universal Charset检测器和Unicode的国际组件。

Programmers mistakenly assume that detection of UTF-8 is equally difficult (it is not because of the vast majority of byte sequences are invalid UTF-8, while the encodings these libraries are trying to distinguish allow all possible byte sequences). Therefore not all Unicode-aware programs perform such an analysis and instead rely on the BOM.

程序员错误地认为,对UTF-8的检测同样困难(这不是因为绝大多数字节序列都是无效的UTF-8,而这些库试图区分的编码允许所有可能的字节序列)。因此,并不是所有的unicodeaware程序都执行这样的分析,而是依赖于BOM。

In particular, Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM, and will add a BOM to the start when saving text as UTF-8. Google Docs will add a BOM when a Microsoft Word document is downloaded as a plain text file.

特别是,微软的编译器和解释器,以及微软Windows上的许多软件,如Notepad,将不能正确地读取UTF-8文本,除非它只有ASCII字符或从BOM开始,并在将文本保存为UTF-8时添加BOM。当Microsoft Word文档作为纯文本文件下载时,谷歌文档将添加一个BOM。

On which is better, WITH or WITHOUT the BOM:

有或没有BOM,哪个更好?

The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it “SHOULD forbid use of U+FEFF as a signature.”

IETF建议,如果一个协议(a)总是使用UTF-8,或者(b)有另一种方式表示正在使用什么编码,那么它“应该禁止使用U+FEFF作为签名”。

My Conclusion:

我的结论是:

Use the BOM only if compatibility with a software application is absolutely essential.

只有在与软件应用程序的兼容性非常重要时才使用BOM。

Also note that while the referenced Wikipedia article indicates that many Microsoft applications rely on the BOM to correctly detect UTF-8, this is not the case for all Microsoft applications. For example, as pointed out by @barlop, when using the Windows Command Prompt with UTF-8, commands such type and more do not expect the BOM to be present. If the BOM is present, it can be problematic as it is for other applications.

还要注意的是,虽然引用的Wikipedia文章指出许多Microsoft应用程序都依赖于BOM来正确地检测UTF-8,但对于所有的Microsoft应用程序来说,情况并非如此。例如,如@barlop所指出的,当使用UTF-8的Windows命令提示符时,命令这样的类型和更多的不期望BOM出现。如果BOM是存在的,它可能会有问题,因为它适用于其他应用程序。


† The chcp command offers support for UTF-8 (without the BOM) via code page 65001.

chcp命令通过代码页65001提供对UTF-8(没有BOM)的支持。

#9


7  

I look at this from a different perspective. I think UTF-8 with BOM is better as it provides more information about the file. I use UTF-8 without BOM only if I face problems.

我从不同的角度来看待这个问题。我认为UTF-8和BOM更好,因为它提供了更多关于文件的信息。我只有在遇到问题时才使用UTF-8。

I am using multiple languages (even Cyrillic) on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor (as cherouvim also noted), some characters are corrupted.

我使用多种语言(甚至是Cyrillic)在我的页面上使用了很长时间,当文件被保存时没有BOM,我重新打开它们进行编辑(正如cherouvim所指出的),一些字符被损坏了。

Note that Windows' classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding.

注意,当你试图用UTF-8编码来保存新创建的文件时,Windows的经典记事本自动保存文件。

I personally save server side scripting files (.asp, .ini, .aspx) with BOM and .html files without BOM.

我个人保存服务器端脚本文件(。使用BOM和.html文件,没有BOM。

#10


6  

Quoted at the bottom of the Wikipedia page on BOM: http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2

在*页面底部引用:http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2。

"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature"

“对于UTF-8来说,使用BOM既不需要也不推荐,但在UTF-8数据转换为使用BOM或BOM用作UTF-8签名的其他编码形式时,可能会遇到这种情况。

#11


6  

UTF-8 without BOM has no BOM, which doesn't make it any better than UTF-8 with BOM, except when the consumer of the file needs to know (or would benefit from knowing) whether the file is UTF-8-encoded or not.

没有BOM的UTF-8没有BOM,这与BOM的UTF-8没有任何关系,除非文件的使用者需要知道(或者知道)文件是否为UTF-8编码的。

The BOM is usually useful to determine the endianness of the encoding, which is not required for most use cases.

BOM通常用于确定编码的endi性,这对于大多数用例来说不是必需的。

Also, the BOM can be unnecessary noise/pain for those consumers that don't know or care about it, and can result in user confusion.

而且,对于那些不知道或不关心的消费者来说,BOM可能是不必要的噪音/痛苦,并可能导致用户的混乱。

#12


6  

When you want to display information encoded in UTF-8 you may not face problems. Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document.

当您想要显示以UTF-8编码的信息时,您可能不会遇到问题。将HTML文档声明为UTF-8,您将在浏览器中显示包含在文档正文中的所有内容。

But this is not the case when we have text, CSV and XML files, either on Windows or Linux.

但是,当我们在Windows或Linux上有文本、CSV和XML文件时,情况就不同了。

For example, a text file in Windows or Linux, one of the easiest things imaginable, it is not (usually) UTF-8.

例如,Windows或Linux中的一个文本文件,是可以想象到的最简单的东西之一,它不是(通常)UTF-8。

Save it as XML and declare it as UTF-8:

将其保存为XML并声明为UTF-8:

<?xml version="1.0" encoding="UTF-8"?>

It will not display (it will not be be read) correctly, even if it's declared as UTF-8.

即使它被声明为UTF-8,它也不会正确显示(它不会被读取)。

I had a string of data containing French letters, that needed to be saved as XML for syndication. Without creating a UTF-8 file from the very beginning (changing options in IDE and "Create New File") or adding the BOM at the beginning of the file

我有一串包含法语字母的数据,需要以XML格式保存。不需要从一开始就创建一个UTF-8文件(在IDE中更改选项并“创建新文件”),或者在文件的开头添加BOM。

$file="\xEF\xBB\xBF".$string;

I was not able to save the French letters in an XML file.

我无法将法语字母保存在一个XML文件中。

#13


6  

It should be noted that for some files you must not have the BOM even on Windows. Examples are SQL*plus or VBScript files. In case such files contains a BOM you get an error when you try to execute them.

应该注意的是,对于某些文件,即使在Windows上也不能有BOM。示例是SQL*plus或VBScript文件。如果这些文件包含一个BOM,当您试图执行它们时,您会得到一个错误。

#14


6  

UTF-8 with BOM only helps if the file actually contains some non-ASCII characters. If it is included and there aren't any, then it will possibly break older applications that would have otherwise interpreted the file as plain ASCII. These applications will definitely fail when they come across a non ASCII character, so in my opinion the BOM should only be added when the file can, and should, no longer be interpreted as plain ASCII.

只有当文件中包含一些非ascii字符时,UTF-8才会起作用。如果它被包含并且没有任何,那么它可能会破坏旧的应用程序,否则将会将该文件解释为普通ASCII。当它们遇到非ASCII字符时,这些应用程序肯定会失败,所以在我看来,只有当文件能够并且应该不再被解释为普通ASCII时,BOM才应该被添加。

Edit: Just want to make it clear that I prefer to not have the BOM at all, add it in if some old rubbish breaks with out it, and replacing that legacy application is not feasible.

编辑:我只是想说明一下,我更倾向于不使用BOM,如果有一些旧的垃圾打破了它,并且替换旧的应用程序是不可行的。

Don't make anything expect a BOM for UTF8.

不要为UTF8做任何准备。

#15


5  

This question already has a million-and-one answers and many of them are quite good, but I wanted to try and clarify when a BOM should or should not be used.

这个问题已经有了上百万个答案,其中很多都很好,但是我想试着澄清一个BOM应该或者不应该被使用。

As mentioned, any use of the UTF BOM (Byte Order Mark) in determining whether a string is UTF-8 or not is educated guesswork. If there is proper metadata available (like charset="utf-8"), then you already know what you're supposed to be using, but otherwise you'll need to test and make some assumptions. This involves checking whether the file a string comes from begins with the hexadecimal byte code, EF BB BF.

如前所述,在确定字符串是否为UTF-8时,任何使用UTF BOM(字节顺序标记)都是受教育的猜测。如果有适当的元数据可用(如charset=“utf-8”),那么您已经知道应该使用什么,但是,否则您将需要进行测试并做出一些假设。这包括检查一个字符串是否来自于十六进制字节码,EF BB BF。

If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on it's source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.

如果找到与UTF-8 BOM相对应的字节码,那么概率就足够高,可以假设它是UTF-8,您可以从那里开始。然而,当*进行这种猜测时,在阅读时进行额外的错误检查仍然是一个好主意,以防出现问题。您应该只假设BOM不是UTF-8(即latin-1或ANSI),如果输入绝对不应该基于它的源代码UTF-8。然而,如果没有BOM,您可以通过验证编码来确定它是否应该是UTF-8。

Why is a BOM not recommended?

  1. Non-Unicode-aware or poorly compliant software may assume it's latin-1 or ANSI and won't strip the BOM from the string, which can obviously cause issues.
  2. 非unicode或不兼容的软件可能认为它是latin-1或ANSI,不会从字符串中去掉BOM,这显然会导致问题。
  3. It's not really needed (just check if the contents are compliant and always use UTF-8 as the fallback when no compliant encoding can be found)
  4. 它不是真正需要的(只要检查内容是否兼容并始终使用UTF-8作为不兼容编码的回退)

When should you encode with a BOM?

If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.

如果您无法以其他方式记录元数据(通过charset标记或文件系统元数据),并且程序被使用如BOM,那么您应该使用BOM进行编码。在Windows上尤其如此,没有BOM的任何东西通常被假定为使用遗留代码页。BOM告诉像Office这样的程序,是的,这个文件中的文本是Unicode;这是所使用的编码。

When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data.

当涉及到它时,我唯一真正遇到的问题是CSV。根据程序,它要么必须,要么不能有一个BOM。例如,如果您在Windows上使用Excel 2007+,则必须使用BOM进行编码,如果您想顺利打开它,而不必使用导入数据。

#16


4  

One practical difference is that if you write a shell script for Mac OS X and save it as plain UTF-8, you will get the response:

一个实际的区别是,如果您为Mac OS X编写一个shell脚本并将其保存为普通的UTF-8,您将得到响应:

#!/bin/bash: No such file or directory

in response to the shebang line specifying which shell you wish to use:

针对指定您希望使用的shell的shebang行:

#!/bin/bash

If you save as UTF-8, no BOM (say in BBEdit) all will be well.

如果你把UTF-8存起来,没有BOM(在BBEdit中说),一切都会好起来的。

#17


3  

As mentioned above, UTF-8 with BOM may cause problems with non-BOM-aware (or compatible) software. I once edited HTML files encoded as UTF-8 + BOM with the Mozilla-based KompoZer, as a client required that WYSIWYG program.

如上所述,使用BOM的UTF-8可能会导致非bomaware(或兼容)软件的问题。我曾经编辑过HTML文件编码为UTF-8 + BOM和基于mozilla的KompoZer,作为一个客户端需要WYSIWYG程序。

Invariably the layout would get destroyed when saving. It took my some time to fiddle my way around this. These files then worked well in Firefox, but showed a CSS quirk in Internet Explorer destroying the layout, again. After fiddling with the linked CSS files for hours to no avail I discovered that Internet Explorer didn't like the BOMfed HTML file. Never again.

在保存时,布局总是会被破坏。我花了很长时间才弄明白这一点。这些文件在Firefox中运行良好,但在ie浏览器中显示了一个CSS特性,再次破坏了布局。在对链接的CSS文件进行了数小时的摆弄之后,我发现Internet Explorer不喜欢使用BOMfed的HTML文件。我再也不会见你了。

Also, I just found this in Wikipedia:

另外,我在*上找到了这个:

The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[15] for this reason and for wider interoperability and philosophical concerns

在扩展的ASCII编码(包括UTF-8)中,shebang字符以相同的两个字节表示,这通常用于当前类unix系统的脚本和其他文本文件。但是,UTF-8文件可能以可选字节顺序标记(BOM)开始;如果“exec”函数专门检测字节0x23 0x21,那么在shebang之前BOM (0xEF 0xBB 0xBF)的存在将阻止脚本解释器被执行。一些权威人士建议不要使用POSIX(类似unix的)脚本中的字节顺序标记,[15]出于这个原因,以及更广泛的互操作性和哲学问题。

#18


2  

From http://en.wikipedia.org/wiki/Byte-order_mark:

从http://en.wikipedia.org/wiki/Byte-order_mark:

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

字节顺序标记(BOM)是一个Unicode字符,用来表示文本文件或流的字节顺序。它的代码点是U+FEFF。BOM使用是可选的,如果使用,应该出现在文本流的开头。除了作为字节顺序指示符的特定用途之外,BOM字符还可以表示文本所编码的几种Unicode表示中的哪一个。

Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM.

在您的文件中总是使用BOM将确保它总是在支持UTF-8和BOM的编辑器中正确地打开。

My real problem with the absence of BOM is the following. Suppose we've got a file which contains:

以下是我缺席BOM的真正问题。假设我们有一个包含以下内容的文件:

abc

Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, for example:

没有BOM,这将在大多数编辑器中作为ANSI打开。因此,该文件的另一个用户打开它并附加一些本地字符,例如:

abg-αβγ

Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes, but 3. This is not UTF-8 and this causes other problems later on in the development chain.

哎呀……现在文件还在ANSI和猜测,“αβγ”并不会占用6字节,但3。这不是UTF-8,这导致了开发链后面的其他问题。

#19


1  

The Unicode Byte Order Mark (BOM) FAQ provides a concise answer:

Unicode字节顺序标记(BOM) FAQ提供了一个简洁的答案:

Q: How I should deal with BOMs?

问:我应该如何处理bom ?

A: Here are some guidelines to follow:

A:下面是一些指导原则:

  1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.

    一个特定的协议(例如,Microsoft约定的.txt文件)可能需要在某些Unicode数据流(比如文件)上使用BOM。当您需要遵循这样的协议时,使用BOM。

  2. Some protocols allow optional BOMs in the case of untagged text. In those cases,

    一些协议允许在未标记文本的情况下选择bom。在这些情况下,

    • Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.

      在已知文本数据流为纯文本的情况下,但未知编码的情况下,BOM可以用作签名。如果没有BOM,编码可以是任何东西。

    • Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.

      如果文本数据流被认为是纯Unicode文本(而不是endian),那么BOM可以用作签名。如果没有BOM,文本应该被解释为big-endian。

  3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.

    一些面向字节的协议期望在文件开头的ASCII字符。如果UTF-8与这些协议一起使用,应该避免使用BOM作为编码表单签名。

  4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

    如果已知数据流的精确类型(例如Unicode big-endian或Unicode little-endian),则不应该使用BOM。特别是,当数据流被声明为UTF-16BE时,UTF-16LE、UTF-32BE或UTF-32LE一个BOM都不能使用。

#20


-3  

UTF with BOM is better if you use UTF-8 in HTML files, if you use Serbian Cyrillic, Serbian Latin, German, Hungarian or something exotic language in the same page. That is my opinion (30 years of computing and IT industry).

如果你在HTML文件中使用UTF-8,如果你使用塞尔维亚斯拉夫语、塞尔维亚语、德语、匈牙利语或同一页面上的一些异国语言,那么UTF和BOM就更好了。这是我的观点(30年的计算和IT行业)。