XML声明中的默认编码(UTF-8)的默认值如何?

时间:2021-05-21 22:22:52

I know that the default encoding of XML is UTF-8. All XML consumers MUST and so on and so forth. So this is not just a question whether or not XML has a default encoding.

我知道XML的默认编码是UTF-8。所有XML消费者都必须等等。因此,这不仅仅是XML是否具有默认编码的问题。

I also know that the XML-Declarataion <?xml version="1.0" ... ?> at the beginning of the document itself is optional. And that specifying the encoding therein is optional as well.

我也知道文档开头的XML-Declarataion <?xml version =“1.0”...?>是可选的。并且指定其中的编码也是可选的。

So I ask myself if the following two XML-Declarations are two expressions for the exact same thing:

所以我问自己,以下两个XML声明是否是完全相同的两个表达式:

<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>

From my own current understanding I would say those are equivalent but I do not known. Has the equivalence of these two declarations been specified somewhere?

根据我自己目前的理解,我会说这些是等价的,但我不知道。是否已在某处指定了这两个声明的等效性?

(Consider these two example lines being each the first line of an XML document, preceded by any (zero) bytes and being UTF-8 encoded)

(考虑这两个示例行,每个行都是XML文档的第一行,前面是任意(零)字节,并且是UTF-8编码的)

4 个解决方案

#1


7  

The Short Answer

简答

Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you're interested in), there is no difference between the two declarations.

在没有外部编码信息的UTF-8编码文档的特定情况下(我从评论中理解你感兴趣的内容),两个声明之间没有区别。

The long answer is far more interesting though.

但答案很长很有趣。

What The Spec Says

规格说的是什么

If you look at Appendix F1 of the XML specification, that explains the process that should be followed to determine the encoding when there is no external encoding information.

如果查看XML规范的附录F1,则说明了在没有外部编码信息时确定编码时应遵循的过程。

If the document is encoded as one of the UTF variants, the parser should be able to detect the encoding within the first 4 bytes, either from the Byte Order Mark, or the start of the XML declaration.

如果文档被编码为UTF变体之一,则解析器应该能够检测前4个字节内的编码,可以是字节顺序标记,也可以是XML声明的开头。

However, according to the spec, it should still read the encoding declaration.

但是,根据规范,它仍应读取编码声明。

In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the entity.

在上述情况下,不需要读取编码声明来确定编码,4.3.3节仍然要求读取编码声明(如果存在),并检查编码名称以匹配实体的实际编码。

If they don't match, according to section 4.3.3:

如果它们不匹配,则根据第4.3.3节:

...it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration

...对于实体来说,这是一个致命的错误,包括以不同于声明中指定的编码方式呈现给XML处理器的编码声明

Encoded UTF-16, Declared UTF-8

编码UTF-16,声明为UTF-8

Let's see what happens in reality when we create an XML document encoded as UTF-16 but with the encoding declaration set to UTF-8.

让我们看看当我们创建一个编码为UTF-16但编码声明设置为UTF-8的XML文档时会发生什么。

Opera, Firefox and Chrome all interpret the document as UTF-16, ignoring the encoding declaration. Internet Explorer (version 9 at least), displays a blank document, but no actual error.

Opera,Firefox和Chrome都将该文档解释为UTF-16,忽略了编码声明。 Internet Explorer(至少版本9)显示空白文档,但没有实际错误。

So if you include a UTF-8 encoding declaration on your UTF-8 document and someone at a later stage converts it to UTF-16, it'll work in most browsers, but fail in IE (and, I assume, most Microsoft XML APIs). If you had left the encoding declaration off, you would have been fine.

因此,如果您在UTF-8文档中包含UTF-8编码声明,并且稍后某人将其转换为UTF-16,则它将在大多数浏览器中工作,但在IE中失败(我认为,大多数Microsoft XML蜜蜂)。如果您关闭了编码声明,那就没问题了。

Technically I think IE is the most accurate. The fact that it doesn't display an error as such might be explained by the fact that the error is occurring at the encoding level rather than the XML level. It is assumedly doing its best to interpret the UTF-16 characters as UTF-8, failing to find any characters that decode, and ending up passing on an empty character sequence to the XML parser.

从技术上讲,我认为IE是最准确的。事实上,它不会显示错误,这可能是因为错误发生在编码级而不是XML级。假设它正在尽力将UTF-16字符解释为UTF-8,未能找到任何解码的字符,并最终将空字符序列传递给XML解析器。

Encoded UTF-8, Declared Otherwise

编码的UTF-8,否则声明

You might now think that Firefox, Chrome and Opera are just ignoring the encoding declaration altogether, but that's not always the case.

您现在可能认为Firefox,Chrome和Opera完全忽略了编码声明,但情况并非总是如此。

If you encode a document as UTF-8 (with a byte order marker so it's unmistakable as anything else), but set the encoding declaration to Latin1, all of the browsers will successfully decode the content as Latin1, ignoring the UTF-8 BOM.

如果您将文档编码为UTF-8(带有字节顺序标记,因此它可以像其他任何内容一样明确),但将编码声明设置为Latin1,所有浏览器都会成功将内容解码为Latin1,忽略UTF-8 BOM。

Again this seems right to me. The fact that the BOM characters aren't valid in Latin1 just means they are silently dropped at the character decoding level.

这似乎对我来说也是对的。 BOM字符在Latin1中无效的事实只是意味着它们在字符解码级别被静默删除。

This doesn't work for all declared encodings on a UTF-8 document though. If the declared encoding is UTF-16, we're back with Opera, Firefox and Chrome ignoring the declared encoding, while Internet Explorer returns a blank document.

但这不适用于UTF-8文档上的所有已声明编码。如果声明的编码是UTF-16,我们回到Opera,Firefox和Chrome忽略声明的编码,而Internet Explorer返回一个空白文档。

Essentially, anything that makes IE return a blank document is going to make other browsers ignore the declared encoding.

基本上,任何使IE返回空白文档的东西都会使其他浏览器忽略声明的编码。

Other Inconsistencies

其他不一致

It's also worth mentioning the importance of the Byte Order Mark. According to section 4.3.3 of the spec:

值得一提的是字节顺序标记的重要性。根据规范第4.3.3节:

Entities encoded in UTF-16 MUST [...] begin with the Byte Order Mark

以UTF-16编码的实体必须以字节顺序标记开头

However, if you try and read a UTF-16 encoded XML document without a BOM, most browsers will nevertheless accept it as valid. Only Firefox reports it as an XML Parsing Error.

但是,如果您尝试在没有BOM的情况下读取UTF-16编码的XML文档,则大多数浏览器仍会接受它作为有效文件。只有Firefox将其报告为XML解析错误。

External Encoding Information

外部编码信息

Up to now, we've been considering what happens when there is no external encoding information, but, as others have mentioned, if the document is received via HTTP or enclosed in a MIME envelope of some sort, the encoding information from those sources should take preference over the document encoding.

到目前为止,我们一直在考虑在没有外部编码信息时会发生什么,但正如其他人所提到的,如果文档是通过HTTP接收的,或者包含在某种类型的MIME信封中,那么这些来源的编码信息应该是优先于文档编码。

Most of the details for the various XML MIME types are described in RFC3023. However, the reality is somewhat different from what is specified.

RFC3023中描述了各种XML MIME类型的大部分细节。然而,现实与指定的有些不同。

First of all, text/xml with an omitted charset parameter should use a charset of US-ASCII, but that requirement has almost always been ignored. Browsers will typically use the value of the XML encoding declaration, or default to UTF-8 if there is none.

首先,带有省略的charset参数的text / xml应使用US-ASCII的字符集,但该要求几乎总是被忽略。浏览器通常使用XML编码声明的值,如果没有,则默认使用UTF-8。

Second, if there is a UTF-8 BOM on the document, and the XML encoding declaration is either UTF-8 or not included, the document will be interpreted as UTF-8, regardless of the charset used in the Content-Type.

其次,如果文档上有UTF-8 BOM,并且XML编码声明是UTF-8或不包括在内,则无论Content-Type中使用何种字符集,该文档都将被解释为UTF-8。

The only time the encoding from the Content-Type seems to take precedence is when there is no BOM and an explicit charset is specified in the Content-Type.

Content-Type中的编码似乎优先考虑的唯一情况是没有BOM并且Content-Type中指定了显式字符集。

In any event, there are no cases (involving Content-Type) where including a UTF-8 XML encoding declaration on a UTF-8 document is any different from not having an encoding declaration at all.

无论如何,没有任何情况(涉及Content-Type),其中UTF-8文档上包含UTF-8 XML编码声明与根本没有编码声明有任何不同。

#2


5  

It would not be unreasonable for the second declaration to be rejected if it arrived at the start of a document that had already been detected as having a non-UTF-8 compatible encoding (such as UTF-16). However, given your statement that the document is UTF-8 encoded, there is no difference between how they would be treated.

如果第二个声明到达已经被检测为具有非UTF-8兼容编码(例如UTF-16)的文档的开头,则拒绝第二个声明是不合理的。但是,鉴于您声明该文档是UTF-8编码,它们的处理方式没有区别。

An externally-specified encoding would take precedence in both cases; both documents would still be treated identically.

在两种情况下,外部指定的编码都优先;这两份文件仍然会被视为相同。

#3


4  

In isolation, both are equivalent. You have already cited the relevant parts of the specifications which show that both declarations are equivalent.

孤立地,两者都是等价的。您已经引用了规范的相关部分,这些部分表明两个声明都是等效的。

However XML can have an envelope, such as the HTTP Content-Type header. The W3C specifies that this envelope information has priority over any other declarations in the file. So for example, if you are retrieving XML via http, you could potentially get this:

但是,XML可以有一个信封,例如HTTP Content-Type标头。 W3C指定此信封信息优先于文件中的任何其他声明。例如,如果您通过http检索XML,则可能会得到以下结果:

HTTP/1.1 200 OK
Content-Type: text/xml

<root/>

In this case, the XML should be read as ascii, because the default charset for text/* mime types is ascii. This is why you should use application/xml mime types--these default to utf-8. The "application" prefix means that the relevant application specifications define things like default encoding. (I.e. the XML spec takes over.) With text/* mime types, the default is ascii and the charset parameter must be included in the mime type to change charset.

在这种情况下,XML应该读作ascii,因为text / * mime类型的默认字符集是ascii。这就是你应该使用application / xml mime类型的原因 - 这些默认为utf-8。 “application”前缀表示相关的应用程序规范定义了默认编码等内容。 (即XML规范接管。)对于text / * mime类型,默认为ascii,并且charset参数必须包含在mime类型中才能更改charset。

Here's another case:

这是另一个案例:

HTTP/1.1 200 OK
Content-Type: text/xml; charset=win-1252

<?xml version="1.0" encoding="utf-8"?>
<root/>

In this case, a conforming XML processor should read this file as win-1252, not utf-8.

在这种情况下,符合标准的XML处理器应将此文件读取为win-1252,而不是utf-8。

Another case:

另一个案例:

HTTP/1.1 200 OK
Content-Type: application/xml

<?xml version="1.0" encoding="win-1252"?>
<root/>

Here the encoding is win-1252.

这里的编码是win-1252。

HTTP/1.1 200 OK
Content-Type: application/xml; charset=ascii

<?xml version="1.0" encoding="win-1252"?>
<root/>

Here the encoding is ascii.

这里的编码是ascii。

#4


1  

The way I read the spec, UTF-8 is not the default encoding in an XML declaration. It is only the default encoding "for an entity which begins with neither a Byte Order Mark nor an encoding declaration". If a document is in UTF-16 and has a BOM, it may have an XML declaration without an encoding declaration or no XML declaration at all and still be valid XML.

我读取规范的方式,UTF-8不是XML声明中的默认编码。它只是默认编码“对于既不以字节顺序标记也不以编码声明开头的实体”。如果文档是UTF-16并且具有BOM,则它可能具有没有编码声明的XML声明或根本没有XML声明,并且仍然是有效的XML。

Only for documents without a BOM, the two XML declarations you mentioned should be equivalent.

仅对于没有BOM的文档,您提到的两个XML声明应该是等效的。

#1


7  

The Short Answer

简答

Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you're interested in), there is no difference between the two declarations.

在没有外部编码信息的UTF-8编码文档的特定情况下(我从评论中理解你感兴趣的内容),两个声明之间没有区别。

The long answer is far more interesting though.

但答案很长很有趣。

What The Spec Says

规格说的是什么

If you look at Appendix F1 of the XML specification, that explains the process that should be followed to determine the encoding when there is no external encoding information.

如果查看XML规范的附录F1,则说明了在没有外部编码信息时确定编码时应遵循的过程。

If the document is encoded as one of the UTF variants, the parser should be able to detect the encoding within the first 4 bytes, either from the Byte Order Mark, or the start of the XML declaration.

如果文档被编码为UTF变体之一,则解析器应该能够检测前4个字节内的编码,可以是字节顺序标记,也可以是XML声明的开头。

However, according to the spec, it should still read the encoding declaration.

但是,根据规范,它仍应读取编码声明。

In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the entity.

在上述情况下,不需要读取编码声明来确定编码,4.3.3节仍然要求读取编码声明(如果存在),并检查编码名称以匹配实体的实际编码。

If they don't match, according to section 4.3.3:

如果它们不匹配,则根据第4.3.3节:

...it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration

...对于实体来说,这是一个致命的错误,包括以不同于声明中指定的编码方式呈现给XML处理器的编码声明

Encoded UTF-16, Declared UTF-8

编码UTF-16,声明为UTF-8

Let's see what happens in reality when we create an XML document encoded as UTF-16 but with the encoding declaration set to UTF-8.

让我们看看当我们创建一个编码为UTF-16但编码声明设置为UTF-8的XML文档时会发生什么。

Opera, Firefox and Chrome all interpret the document as UTF-16, ignoring the encoding declaration. Internet Explorer (version 9 at least), displays a blank document, but no actual error.

Opera,Firefox和Chrome都将该文档解释为UTF-16,忽略了编码声明。 Internet Explorer(至少版本9)显示空白文档,但没有实际错误。

So if you include a UTF-8 encoding declaration on your UTF-8 document and someone at a later stage converts it to UTF-16, it'll work in most browsers, but fail in IE (and, I assume, most Microsoft XML APIs). If you had left the encoding declaration off, you would have been fine.

因此,如果您在UTF-8文档中包含UTF-8编码声明,并且稍后某人将其转换为UTF-16,则它将在大多数浏览器中工作,但在IE中失败(我认为,大多数Microsoft XML蜜蜂)。如果您关闭了编码声明,那就没问题了。

Technically I think IE is the most accurate. The fact that it doesn't display an error as such might be explained by the fact that the error is occurring at the encoding level rather than the XML level. It is assumedly doing its best to interpret the UTF-16 characters as UTF-8, failing to find any characters that decode, and ending up passing on an empty character sequence to the XML parser.

从技术上讲,我认为IE是最准确的。事实上,它不会显示错误,这可能是因为错误发生在编码级而不是XML级。假设它正在尽力将UTF-16字符解释为UTF-8,未能找到任何解码的字符,并最终将空字符序列传递给XML解析器。

Encoded UTF-8, Declared Otherwise

编码的UTF-8,否则声明

You might now think that Firefox, Chrome and Opera are just ignoring the encoding declaration altogether, but that's not always the case.

您现在可能认为Firefox,Chrome和Opera完全忽略了编码声明,但情况并非总是如此。

If you encode a document as UTF-8 (with a byte order marker so it's unmistakable as anything else), but set the encoding declaration to Latin1, all of the browsers will successfully decode the content as Latin1, ignoring the UTF-8 BOM.

如果您将文档编码为UTF-8(带有字节顺序标记,因此它可以像其他任何内容一样明确),但将编码声明设置为Latin1,所有浏览器都会成功将内容解码为Latin1,忽略UTF-8 BOM。

Again this seems right to me. The fact that the BOM characters aren't valid in Latin1 just means they are silently dropped at the character decoding level.

这似乎对我来说也是对的。 BOM字符在Latin1中无效的事实只是意味着它们在字符解码级别被静默删除。

This doesn't work for all declared encodings on a UTF-8 document though. If the declared encoding is UTF-16, we're back with Opera, Firefox and Chrome ignoring the declared encoding, while Internet Explorer returns a blank document.

但这不适用于UTF-8文档上的所有已声明编码。如果声明的编码是UTF-16,我们回到Opera,Firefox和Chrome忽略声明的编码,而Internet Explorer返回一个空白文档。

Essentially, anything that makes IE return a blank document is going to make other browsers ignore the declared encoding.

基本上,任何使IE返回空白文档的东西都会使其他浏览器忽略声明的编码。

Other Inconsistencies

其他不一致

It's also worth mentioning the importance of the Byte Order Mark. According to section 4.3.3 of the spec:

值得一提的是字节顺序标记的重要性。根据规范第4.3.3节:

Entities encoded in UTF-16 MUST [...] begin with the Byte Order Mark

以UTF-16编码的实体必须以字节顺序标记开头

However, if you try and read a UTF-16 encoded XML document without a BOM, most browsers will nevertheless accept it as valid. Only Firefox reports it as an XML Parsing Error.

但是,如果您尝试在没有BOM的情况下读取UTF-16编码的XML文档,则大多数浏览器仍会接受它作为有效文件。只有Firefox将其报告为XML解析错误。

External Encoding Information

外部编码信息

Up to now, we've been considering what happens when there is no external encoding information, but, as others have mentioned, if the document is received via HTTP or enclosed in a MIME envelope of some sort, the encoding information from those sources should take preference over the document encoding.

到目前为止,我们一直在考虑在没有外部编码信息时会发生什么,但正如其他人所提到的,如果文档是通过HTTP接收的,或者包含在某种类型的MIME信封中,那么这些来源的编码信息应该是优先于文档编码。

Most of the details for the various XML MIME types are described in RFC3023. However, the reality is somewhat different from what is specified.

RFC3023中描述了各种XML MIME类型的大部分细节。然而,现实与指定的有些不同。

First of all, text/xml with an omitted charset parameter should use a charset of US-ASCII, but that requirement has almost always been ignored. Browsers will typically use the value of the XML encoding declaration, or default to UTF-8 if there is none.

首先,带有省略的charset参数的text / xml应使用US-ASCII的字符集,但该要求几乎总是被忽略。浏览器通常使用XML编码声明的值,如果没有,则默认使用UTF-8。

Second, if there is a UTF-8 BOM on the document, and the XML encoding declaration is either UTF-8 or not included, the document will be interpreted as UTF-8, regardless of the charset used in the Content-Type.

其次,如果文档上有UTF-8 BOM,并且XML编码声明是UTF-8或不包括在内,则无论Content-Type中使用何种字符集,该文档都将被解释为UTF-8。

The only time the encoding from the Content-Type seems to take precedence is when there is no BOM and an explicit charset is specified in the Content-Type.

Content-Type中的编码似乎优先考虑的唯一情况是没有BOM并且Content-Type中指定了显式字符集。

In any event, there are no cases (involving Content-Type) where including a UTF-8 XML encoding declaration on a UTF-8 document is any different from not having an encoding declaration at all.

无论如何,没有任何情况(涉及Content-Type),其中UTF-8文档上包含UTF-8 XML编码声明与根本没有编码声明有任何不同。

#2


5  

It would not be unreasonable for the second declaration to be rejected if it arrived at the start of a document that had already been detected as having a non-UTF-8 compatible encoding (such as UTF-16). However, given your statement that the document is UTF-8 encoded, there is no difference between how they would be treated.

如果第二个声明到达已经被检测为具有非UTF-8兼容编码(例如UTF-16)的文档的开头,则拒绝第二个声明是不合理的。但是,鉴于您声明该文档是UTF-8编码,它们的处理方式没有区别。

An externally-specified encoding would take precedence in both cases; both documents would still be treated identically.

在两种情况下,外部指定的编码都优先;这两份文件仍然会被视为相同。

#3


4  

In isolation, both are equivalent. You have already cited the relevant parts of the specifications which show that both declarations are equivalent.

孤立地,两者都是等价的。您已经引用了规范的相关部分,这些部分表明两个声明都是等效的。

However XML can have an envelope, such as the HTTP Content-Type header. The W3C specifies that this envelope information has priority over any other declarations in the file. So for example, if you are retrieving XML via http, you could potentially get this:

但是,XML可以有一个信封,例如HTTP Content-Type标头。 W3C指定此信封信息优先于文件中的任何其他声明。例如,如果您通过http检索XML,则可能会得到以下结果:

HTTP/1.1 200 OK
Content-Type: text/xml

<root/>

In this case, the XML should be read as ascii, because the default charset for text/* mime types is ascii. This is why you should use application/xml mime types--these default to utf-8. The "application" prefix means that the relevant application specifications define things like default encoding. (I.e. the XML spec takes over.) With text/* mime types, the default is ascii and the charset parameter must be included in the mime type to change charset.

在这种情况下,XML应该读作ascii,因为text / * mime类型的默认字符集是ascii。这就是你应该使用application / xml mime类型的原因 - 这些默认为utf-8。 “application”前缀表示相关的应用程序规范定义了默认编码等内容。 (即XML规范接管。)对于text / * mime类型,默认为ascii,并且charset参数必须包含在mime类型中才能更改charset。

Here's another case:

这是另一个案例:

HTTP/1.1 200 OK
Content-Type: text/xml; charset=win-1252

<?xml version="1.0" encoding="utf-8"?>
<root/>

In this case, a conforming XML processor should read this file as win-1252, not utf-8.

在这种情况下,符合标准的XML处理器应将此文件读取为win-1252,而不是utf-8。

Another case:

另一个案例:

HTTP/1.1 200 OK
Content-Type: application/xml

<?xml version="1.0" encoding="win-1252"?>
<root/>

Here the encoding is win-1252.

这里的编码是win-1252。

HTTP/1.1 200 OK
Content-Type: application/xml; charset=ascii

<?xml version="1.0" encoding="win-1252"?>
<root/>

Here the encoding is ascii.

这里的编码是ascii。

#4


1  

The way I read the spec, UTF-8 is not the default encoding in an XML declaration. It is only the default encoding "for an entity which begins with neither a Byte Order Mark nor an encoding declaration". If a document is in UTF-16 and has a BOM, it may have an XML declaration without an encoding declaration or no XML declaration at all and still be valid XML.

我读取规范的方式,UTF-8不是XML声明中的默认编码。它只是默认编码“对于既不以字节顺序标记也不以编码声明开头的实体”。如果文档是UTF-16并且具有BOM,则它可能具有没有编码声明的XML声明或根本没有XML声明,并且仍然是有效的XML。

Only for documents without a BOM, the two XML declarations you mentioned should be equivalent.

仅对于没有BOM的文档,您提到的两个XML声明应该是等效的。