我需要在XML文档中转义哪些字符?

时间:2021-02-13 22:27:02

What characters must be escaped in XML documents, or where could I find such a list?

哪些字符必须在XML文档中转义,或者在哪里可以找到这样的列表?

10 个解决方案

#1


1113  

If you use an appropriate class or library, they will do the escaping for you. Many XML issues are caused by string concatenation.

如果您使用适当的类或库,它们将为您进行转义。许多XML问题是由字符串连接引起的。

XML escape characters

There are only five:

只有五:

"   "
'   '
<   &lt;
>   &gt;
&   &amp;

Escaping characters depends on where the special character is used.

转义字符取决于使用特殊字符的位置。

The examples can be validated at W3C Markup Validation Service.

这些示例可以在W3C标记验证服务中进行验证。

Text

The safe way is to escape all five characters in text, however, the three characters ", ' and > needn't be escaped in text:

安全的方法是在文本中避免所有五个字符,但是,三个字符“,”和>不需要在文本中转义:

<?xml version="1.0"?>
<valid>"'></valid>

Attributes

The safe way is to escape all five characters in attributes, however, the > character needn't be escaped in attributes:

安全的方法是在属性中脱逃所有五个字符,但是,在属性中不需要转义>字符:

<?xml version="1.0"?>
<valid attribute=">"/>

The ' character needn't be escaped in attributes if the quotes are ":

如果引号是:“字符不需要在属性中转义”:

<?xml version="1.0"?>
<valid attribute="'"/>

Likewise, the " needn't be escaped in attributes if the quotes are ':

同样地,“如果引号是”,则“不必在属性中转义”:

<?xml version="1.0"?>
<valid attribute='"'/>

Comments

All 5 special characters must not be escaped in comments:

所有5个特殊字符不能在注释中转义:

<?xml version="1.0"?>
<valid>
<!-- "'<>& -->
</valid>

CDATA

All 5 special characters must not be escaped in CDATA sections:

所有5个特殊字符不能在CDATA区域中转义:

<?xml version="1.0"?>
<valid>
<![CDATA["'<>&]]>
</valid>

Processing instructions

All 5 special characters must not be escaped in XML processing instructions:

所有5个特殊字符不能在XML处理指令中转义:

<?xml version="1.0"?>
<?process <"'&> ?>
<valid/>

XML vs. HTML

HTML has its own set of escape codes which cover a lot more characters.

HTML有自己的一套转义代码,可以覆盖更多的字符。

#2


87  

Perhaps this will help:

也许这将帮助:

List of XML and HTML character entity references:

XML和HTML字符实体引用列表:

In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a character reference, of which there are two types: a numeric character reference and a character entity reference. This article lists the character entity references that are valid in HTML and XML documents.

在SGML、HTML和XML文档的逻辑结构被称为字符数据和属性值序列的字符组成,每一个字符可以直接显化(代表本身),或可以表示为一系列字符称为字符引用,有两种类型:数字字符引用和字符实体引用。本文列出了在HTML和XML文档中有效的字符实体引用。

That article lists the following five predefined XML entities:

这篇文章列出了以下五个预定义的XML实体:

quot  "
amp   &
apos  '
lt    <
gt    >

#3


64  

According to the specifications of the World Wide Web Consortium (w3C), there are 5 characters that must not appear in their literal form in an XML document, except when used as markup delimiters or within a comment, a processing instruction, or a CDATA section. In all the other cases, these characters must be replaced either using the corresponding entity or the numeric reference according to the following table:

根据万维网联盟(w3C)的规范,在XML文档中,除了用作标记分隔符或注释、处理指令或CDATA区域之外,XML文档中不能出现5个字符。在所有其他情况下,这些字符必须用相应的实体或数字引用替换,根据下表:

Original CharacterXML entity replacementXML numeric replacement
<                              &lt;                                    &#60;                                    
>                              &gt;                                   &#62;                                    
"                               &quot;                               &#34;                                    
&                              &amp;                               &#38;                                    
'                               &apos;                               &#39;                                    

原始CharacterXML实体replacementXML数字替换< & lt;& # 60;>比;& # 62;”“;& # 34;&,事情就让它& # 38;”,& # 39;

Notice that the aforementioned entities can be used also in HTML, with the exception of &apos;, that was introduced with XHTML 1.0 and is not declared in HTML 4. For this reason, and to ensure retro-compatibility, the XHTML specification recommends the use of &#39; instead.

注意,前面提到的实体也可以在HTML中使用,除了'这是用XHTML 1.0引入的,并没有在HTML 4中声明。出于这个原因,为了确保向后兼容性,XHTML规范建议使用'代替。

#4


44  

Escaping characters is different for tags and attributes.

转义字符对于标记和属性是不同的。

For tags:

标签:

 < &lt;
 > &gt; (only for compatibility, read below)
 & &amp;

For attributes:

属性:

" &quot;
' &apos;

http://www.w3.org/TR/2008/REC-xml-20081126/#syntax

http://www.w3.org/TR/2008/REC-xml-20081126/语法

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively. The right angle bracket (>) may be represented using the string " &gt; ", and must, for compatibility, be escaped using either " &gt; " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

符号和字符(&)和左角括号(<)不能以文字形式出现,除非用作标记分隔符,或在注释、处理指令或CDATA节中使用。如果在其他地方需要它们,则必须使用数字字符引用或字符串“&”转义。”和“& lt;分别”。右尖括号(>)可以用字符串“>”表示;“为了兼容性,必须使用”>当字符串在内容中出现时,当该字符串不标记CDATA区域的结束时,就会出现一个字符引用。

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " &quot; ".

为了允许属性值包含单引号和双引号,撇号或单引号字符(')可以表示为' ',和双引号字符(")为"”。

#5


18  

in addition to the commonly known five characters [<, >, &, ", '] I would also escape the vertical tab character (0x0B). It is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.

除了通常已知的5个字符[<,>,&,'],我还可以避免垂直制表符(0x0B)。它是有效的UTF-8,但不是有效的XML 1.0,甚至许多库(包括libxml2)都忽略了它,并且静默地输出无效的XML。

#6


4  

New, simplified answer to an old, commonly asked question...

对一个老生常谈的问题的简单答案。

Simplified XML Escaping

  1. Always (90% important to remember)

    总是(90%重要的要记住)

    • Escape < as &lt; unless < is starting a <tag/>.
    • 逃避< & lt;除非 <是开始a
    • Escape & as &amp; unless & is starting an &entity;.
    • 和逃避&;除非和正在建立一个实体;。
  2. Attribute Values (9% important to remember)

    属性值(记住9%重要)

    • attr=" 'Single quotes' are ok within double quotes."
    • “单引号”在双引号内是可以的。
    • attr=' "Double quotes" are ok within single quotes.'
    • “双引号”在单引号内是可以的。
    • Escape " as &quot; and ' as &apos; otherwise.
    • 逃避”“事情就让它和“,否则。
  3. Comments, CDATA, and Processing Instructions (1% important to remember)

    注释、CDATA和处理指令(记住1%重要)

    • <!-- Within comments --> nothing has to be escaped but no -- strings are allowed.
    • < !——在评论中——>什么都不需要转义,但不允许有字符串。
    • <![CDATA[ Within CDATA ]]> nothing has to be escaped, but no ]]> strings are allowed.
    • < ![CDATA[CDATA]]>不需要转义,但不允许有>字符串。
    • <?PITarget Within PIs ?> nothing has to be escaped, but no ?> strings are allowed.
    • < ?在PIs内的PITarget ?>什么都不需要转义,但是不允许有>字符串。

#7


3  

Abridged from: http://en.wikipedia.org/wiki/XML#Escaping

删节:http://en.wikipedia.org/wiki/XML逃跑

There are five predefined entities:

有五个预定义的实体:

&lt; represents "<"
&gt; represents ">"
&amp; represents "&"
&apos; represents '
&quot; represents "

"All permitted Unicode characters may be represented with a numeric character reference. " For example:

所有允许的Unicode字符都可以用数字字符引用来表示。”例如:

&#20013;

Most of the control characters and other unicode ranges are specifically excluded, meaning (I think) they can't occur either escaped or direct:

大多数控制字符和其他unicode范围都被明确地排除了,这意味着(我认为)它们不会发生或者直接发生:

http://en.wikipedia.org/wiki/Valid_characters_in_XML

http://en.wikipedia.org/wiki/Valid_characters_in_XML

#8


3  

It depends on the context. For the content, it is < and &, and ]]>(though string of 3 instead of one char). For attribute values, it is < and & and " and '. For CDATA, it is ]]>.

这取决于上下文。对于内容,它是 <和&,以及]> (尽管字符串为3,而不是一个char)。对于属性值,它是 <和&和“和”。对于cdata,它是]> 。

#9


-2  

Only < and & are required to be escaped if the are to be treated character data and not markup:

如果要处理字符数据而不是标记,则只有 <和&需要被转义:< p>

http://www.w3.org/TR/xml11/#syntax

http://www.w3.org/TR/xml11/语法

#10


-2  

These need to be escaped:

这些需要逃脱:

"   &quot;
'   &apos;
<   &lt;
>   &gt;
&   &amp;

#1


1113  

If you use an appropriate class or library, they will do the escaping for you. Many XML issues are caused by string concatenation.

如果您使用适当的类或库,它们将为您进行转义。许多XML问题是由字符串连接引起的。

XML escape characters

There are only five:

只有五:

"   &quot;
'   &apos;
<   &lt;
>   &gt;
&   &amp;

Escaping characters depends on where the special character is used.

转义字符取决于使用特殊字符的位置。

The examples can be validated at W3C Markup Validation Service.

这些示例可以在W3C标记验证服务中进行验证。

Text

The safe way is to escape all five characters in text, however, the three characters ", ' and > needn't be escaped in text:

安全的方法是在文本中避免所有五个字符,但是,三个字符“,”和>不需要在文本中转义:

<?xml version="1.0"?>
<valid>"'></valid>

Attributes

The safe way is to escape all five characters in attributes, however, the > character needn't be escaped in attributes:

安全的方法是在属性中脱逃所有五个字符,但是,在属性中不需要转义>字符:

<?xml version="1.0"?>
<valid attribute=">"/>

The ' character needn't be escaped in attributes if the quotes are ":

如果引号是:“字符不需要在属性中转义”:

<?xml version="1.0"?>
<valid attribute="'"/>

Likewise, the " needn't be escaped in attributes if the quotes are ':

同样地,“如果引号是”,则“不必在属性中转义”:

<?xml version="1.0"?>
<valid attribute='"'/>

Comments

All 5 special characters must not be escaped in comments:

所有5个特殊字符不能在注释中转义:

<?xml version="1.0"?>
<valid>
<!-- "'<>& -->
</valid>

CDATA

All 5 special characters must not be escaped in CDATA sections:

所有5个特殊字符不能在CDATA区域中转义:

<?xml version="1.0"?>
<valid>
<![CDATA["'<>&]]>
</valid>

Processing instructions

All 5 special characters must not be escaped in XML processing instructions:

所有5个特殊字符不能在XML处理指令中转义:

<?xml version="1.0"?>
<?process <"'&> ?>
<valid/>

XML vs. HTML

HTML has its own set of escape codes which cover a lot more characters.

HTML有自己的一套转义代码,可以覆盖更多的字符。

#2


87  

Perhaps this will help:

也许这将帮助:

List of XML and HTML character entity references:

XML和HTML字符实体引用列表:

In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a character reference, of which there are two types: a numeric character reference and a character entity reference. This article lists the character entity references that are valid in HTML and XML documents.

在SGML、HTML和XML文档的逻辑结构被称为字符数据和属性值序列的字符组成,每一个字符可以直接显化(代表本身),或可以表示为一系列字符称为字符引用,有两种类型:数字字符引用和字符实体引用。本文列出了在HTML和XML文档中有效的字符实体引用。

That article lists the following five predefined XML entities:

这篇文章列出了以下五个预定义的XML实体:

quot  "
amp   &
apos  '
lt    <
gt    >

#3


64  

According to the specifications of the World Wide Web Consortium (w3C), there are 5 characters that must not appear in their literal form in an XML document, except when used as markup delimiters or within a comment, a processing instruction, or a CDATA section. In all the other cases, these characters must be replaced either using the corresponding entity or the numeric reference according to the following table:

根据万维网联盟(w3C)的规范,在XML文档中,除了用作标记分隔符或注释、处理指令或CDATA区域之外,XML文档中不能出现5个字符。在所有其他情况下,这些字符必须用相应的实体或数字引用替换,根据下表:

Original CharacterXML entity replacementXML numeric replacement
<                              &lt;                                    &#60;                                    
>                              &gt;                                   &#62;                                    
"                               &quot;                               &#34;                                    
&                              &amp;                               &#38;                                    
'                               &apos;                               &#39;                                    

原始CharacterXML实体replacementXML数字替换< & lt;& # 60;>比;& # 62;”“;& # 34;&,事情就让它& # 38;”,& # 39;

Notice that the aforementioned entities can be used also in HTML, with the exception of &apos;, that was introduced with XHTML 1.0 and is not declared in HTML 4. For this reason, and to ensure retro-compatibility, the XHTML specification recommends the use of &#39; instead.

注意,前面提到的实体也可以在HTML中使用,除了'这是用XHTML 1.0引入的,并没有在HTML 4中声明。出于这个原因,为了确保向后兼容性,XHTML规范建议使用'代替。

#4


44  

Escaping characters is different for tags and attributes.

转义字符对于标记和属性是不同的。

For tags:

标签:

 < &lt;
 > &gt; (only for compatibility, read below)
 & &amp;

For attributes:

属性:

" &quot;
' &apos;

http://www.w3.org/TR/2008/REC-xml-20081126/#syntax

http://www.w3.org/TR/2008/REC-xml-20081126/语法

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively. The right angle bracket (>) may be represented using the string " &gt; ", and must, for compatibility, be escaped using either " &gt; " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

符号和字符(&)和左角括号(<)不能以文字形式出现,除非用作标记分隔符,或在注释、处理指令或CDATA节中使用。如果在其他地方需要它们,则必须使用数字字符引用或字符串“&”转义。”和“& lt;分别”。右尖括号(>)可以用字符串“>”表示;“为了兼容性,必须使用”>当字符串在内容中出现时,当该字符串不标记CDATA区域的结束时,就会出现一个字符引用。

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " &quot; ".

为了允许属性值包含单引号和双引号,撇号或单引号字符(')可以表示为' ',和双引号字符(")为"”。

#5


18  

in addition to the commonly known five characters [<, >, &, ", '] I would also escape the vertical tab character (0x0B). It is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.

除了通常已知的5个字符[<,>,&,'],我还可以避免垂直制表符(0x0B)。它是有效的UTF-8,但不是有效的XML 1.0,甚至许多库(包括libxml2)都忽略了它,并且静默地输出无效的XML。

#6


4  

New, simplified answer to an old, commonly asked question...

对一个老生常谈的问题的简单答案。

Simplified XML Escaping

  1. Always (90% important to remember)

    总是(90%重要的要记住)

    • Escape < as &lt; unless < is starting a <tag/>.
    • 逃避< & lt;除非 <是开始a
    • Escape & as &amp; unless & is starting an &entity;.
    • 和逃避&;除非和正在建立一个实体;。
  2. Attribute Values (9% important to remember)

    属性值(记住9%重要)

    • attr=" 'Single quotes' are ok within double quotes."
    • “单引号”在双引号内是可以的。
    • attr=' "Double quotes" are ok within single quotes.'
    • “双引号”在单引号内是可以的。
    • Escape " as &quot; and ' as &apos; otherwise.
    • 逃避”“事情就让它和“,否则。
  3. Comments, CDATA, and Processing Instructions (1% important to remember)

    注释、CDATA和处理指令(记住1%重要)

    • <!-- Within comments --> nothing has to be escaped but no -- strings are allowed.
    • < !——在评论中——>什么都不需要转义,但不允许有字符串。
    • <![CDATA[ Within CDATA ]]> nothing has to be escaped, but no ]]> strings are allowed.
    • < ![CDATA[CDATA]]>不需要转义,但不允许有>字符串。
    • <?PITarget Within PIs ?> nothing has to be escaped, but no ?> strings are allowed.
    • < ?在PIs内的PITarget ?>什么都不需要转义,但是不允许有>字符串。

#7


3  

Abridged from: http://en.wikipedia.org/wiki/XML#Escaping

删节:http://en.wikipedia.org/wiki/XML逃跑

There are five predefined entities:

有五个预定义的实体:

&lt; represents "<"
&gt; represents ">"
&amp; represents "&"
&apos; represents '
&quot; represents "

"All permitted Unicode characters may be represented with a numeric character reference. " For example:

所有允许的Unicode字符都可以用数字字符引用来表示。”例如:

&#20013;

Most of the control characters and other unicode ranges are specifically excluded, meaning (I think) they can't occur either escaped or direct:

大多数控制字符和其他unicode范围都被明确地排除了,这意味着(我认为)它们不会发生或者直接发生:

http://en.wikipedia.org/wiki/Valid_characters_in_XML

http://en.wikipedia.org/wiki/Valid_characters_in_XML

#8


3  

It depends on the context. For the content, it is < and &, and ]]>(though string of 3 instead of one char). For attribute values, it is < and & and " and '. For CDATA, it is ]]>.

这取决于上下文。对于内容,它是 <和&,以及]> (尽管字符串为3,而不是一个char)。对于属性值,它是 <和&和“和”。对于cdata,它是]> 。

#9


-2  

Only < and & are required to be escaped if the are to be treated character data and not markup:

如果要处理字符数据而不是标记,则只有 <和&需要被转义:< p>

http://www.w3.org/TR/xml11/#syntax

http://www.w3.org/TR/xml11/语法

#10


-2  

These need to be escaped:

这些需要逃脱:

"   &quot;
'   &apos;
<   &lt;
>   &gt;
&   &amp;