使用DOM创建XML后,标头包含UTF-8?

时间:2022-10-24 23:58:33

I need to create an XML file using DOM under Eclipse (for Java) and using the following code :

我需要在Eclipse(用于Java)下使用DOM创建一个XML文件,并使用以下代码:

        // write the content into xml file
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        DOMSource source = new DOMSource(doc);
        StreamResult result = new StreamResult(new File("output.xml"));  
        transformer.transform(source, result);

My XML's first line is :

我的XML的第一行是:

<?xml version="1.0" encoding="UTF-8"?>

and not :

并不是 :

<?xml version="1.0"?>

My questions are :

我的问题是:

  1. What is the difference between those two declarations ?

    这两个声明之间有什么区别?

  2. How can I generate the XML file using the header : <?xml version="1.0"?>

    如何使用标头生成XML文件:<?xml version =“1.0”?>

Regards

问候

1 个解决方案

#1


1  

In the modern world, text files have an "encoding", which defines how characters are represented in the file. You won't see this if your file contains ONLY plain ASCII characters (0x01 thru 0x7f) but if you need to represent anything else, such as symbols or accented characters, then a consumer of the file needs to know how those characters are encoded.

在现代世界中,文本文件具有“编码”,其定义了如何在文件中表示字符。如果您的文件只包含纯ASCII字符(0x01到0x7f),则不会看到这一点,但如果您需要表示其他任何内容,例如符号或重音字符,则该文件的使用者需要知道这些字符是如何编码的。

There are several different ways to encode extended characters, the most common ones being ISO-8859-x (where x depends on the language) and Unicode, which assigns a unique number to every possible character. The ISO code pages use the range 0x80 thru 0xFF for extended characters. UTF-8 is a system of representing Unicode characters (aka "code points") of arbitrary length in multiple 8-bit bytes. The same extended character (for example e-circumflex) will have different representations in different encodings.

编码扩展字符有几种不同的方法,最常见的方法是ISO-8859-x(其中x取决于语言)和Unicode,它为每个可能的字符分配唯一的编号。 ISO代码页对扩展字符使用0x80到0xFF的范围。 UTF-8是在多个8位字节中表示任意长度的Unicode字符(也称为“代码点”)的系统。相同的扩展字符(例如e-circumflex)在不同的编码中将具有不同的表示。

The serializer you used is configured to output UTF-8 encoding. A consumer of that file must be aware that UTF-8 encoding was used, or risk mangling the data. You have probably seen web pages containing black-diamond characters, or text where things like apostrophes or other special characters are replaced with 2 weird characters. These are symptoms of incompatible encoding/decoding.

您使用的序列化程序配置为输出UTF-8编码。该文件的使用者必须知道使用了UTF-8编码,否则可能会损坏数据。您可能已经看过包含黑钻石字符的网页,或者用撇号或其他特殊字符替换为2个奇怪字符的文本。这些是不兼容的编码/解码的症状。

There is probably a way to force the serializer to omit the encoding declaration, but if you do the consumer of the file may not be able to decode it correctly, since it will have to guess about the encoding.

可能有一种方法可以强制序列化程序省略编码声明,但如果你这样做,文件的使用者可能无法正确解码它,因为它必须猜测编码。

#1


1  

In the modern world, text files have an "encoding", which defines how characters are represented in the file. You won't see this if your file contains ONLY plain ASCII characters (0x01 thru 0x7f) but if you need to represent anything else, such as symbols or accented characters, then a consumer of the file needs to know how those characters are encoded.

在现代世界中,文本文件具有“编码”,其定义了如何在文件中表示字符。如果您的文件只包含纯ASCII字符(0x01到0x7f),则不会看到这一点,但如果您需要表示其他任何内容,例如符号或重音字符,则该文件的使用者需要知道这些字符是如何编码的。

There are several different ways to encode extended characters, the most common ones being ISO-8859-x (where x depends on the language) and Unicode, which assigns a unique number to every possible character. The ISO code pages use the range 0x80 thru 0xFF for extended characters. UTF-8 is a system of representing Unicode characters (aka "code points") of arbitrary length in multiple 8-bit bytes. The same extended character (for example e-circumflex) will have different representations in different encodings.

编码扩展字符有几种不同的方法,最常见的方法是ISO-8859-x(其中x取决于语言)和Unicode,它为每个可能的字符分配唯一的编号。 ISO代码页对扩展字符使用0x80到0xFF的范围。 UTF-8是在多个8位字节中表示任意长度的Unicode字符(也称为“代码点”)的系统。相同的扩展字符(例如e-circumflex)在不同的编码中将具有不同的表示。

The serializer you used is configured to output UTF-8 encoding. A consumer of that file must be aware that UTF-8 encoding was used, or risk mangling the data. You have probably seen web pages containing black-diamond characters, or text where things like apostrophes or other special characters are replaced with 2 weird characters. These are symptoms of incompatible encoding/decoding.

您使用的序列化程序配置为输出UTF-8编码。该文件的使用者必须知道使用了UTF-8编码,否则可能会损坏数据。您可能已经看过包含黑钻石字符的网页,或者用撇号或其他特殊字符替换为2个奇怪字符的文本。这些是不兼容的编码/解码的症状。

There is probably a way to force the serializer to omit the encoding declaration, but if you do the consumer of the file may not be able to decode it correctly, since it will have to guess about the encoding.

可能有一种方法可以强制序列化程序省略编码声明,但如果你这样做,文件的使用者可能无法正确解码它,因为它必须猜测编码。