为什么en-dash(-)会引发非法XML字符错误(c# /SSMS)?

时间:2022-01-18 07:43:51

This is not a question on how to overcome the "XML parsing: ... illegal xml character" error, but about why it is happening? I know that there are fixes(1, 2, 3), but need to know where the problem arises from before choosing the best solution (what causes the error under the hood?).

这不是关于如何克服“XML解析:……”非法xml字符“错误,但是为什么会发生呢?”我知道有修复(1,2,3),但在选择最佳解决方案之前,我需要知道问题源自何处(什么导致了引擎盖下面的错误?)

We are calling a Java-based webservice using C#. From the strongly-typed data returned, we are creating an XML file that will be passed to SQL Server. The webservice data is encoding using UTF-8, so in C# we create the file, and specify UTF-8 where appropriate:

我们正在使用c#调用一个基于java的web服务。从返回的强类型数据中,我们正在创建一个XML文件,该文件将传递给SQL Server。webservice数据使用UTF-8进行编码,因此在c#中我们创建文件,并在适当的地方指定UTF-8:

var encodingType = Encoding.UTF8;
// logic removed...
var xdoc = new XDocument();
xdoc.Declaration = new XDeclaration("1.0", encodingType.WebName, "yes");
// logic removed...
System.IO.File.WriteAllText(xmlFullPath, xdoc.Declaration.ToString() + xdoc.Document.ToString(), encodingType);

This creates an XML file on disk that has contains the following (abbreviated) data:

这将在磁盘上创建一个包含以下(缩写)数据的XML文件:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>

Notice that in the second record, - is different to . I believe the second instance is en-dash.

请注意,在第二个记录中,-不同于-。我认为第二个例子是en-dash。

If I open that XML file in Firefox/IE/VS2015. it opens without error. The W3C XML validator also works fine. But, SSMS 2012 does not like it:

如果我在Firefox/IE/VS2015中打开那个XML文件。它没有错误。W3C XML validator也可以正常工作。但是,SSMS 2012不喜欢它:

declare @xml XML = '<?xml version="1.0" encoding="utf-8" standalone="yes"?><records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 3, character 25, illegal xml character

XML解析:第3行,第25字符,非法XML字符

So why does en-dash cause the error? From my research, it would appear that

那么,为什么en-dash会导致错误呢?从我的研究来看,似乎

...only a few entities that need escaping: <,>,\,' and & in both HTML and XML. Source

…只有少数实体需要转义:<、>、\ '和&,在HTML和XML中都需要转义。源

...of which en-dash is not one. An encoded version (replacing with &#8211;) works fine.

…其中的“en-dash”不是一个。一个编码版本(替换为–)可以正常工作。

UPDATE

Based on the input, people state that en-dash isn't recognised as UTF-8, but yet it is listed here http://www.fileformat.info/info/unicode/char/2013/index.htm So, as a perfectly legal character, why won't SSMS read it when passed as XML (using UTF-8 OR UTF-16)?

基于输入,人们认为en-dash不被认为是UTF-8,但它却被列在这里http://www.fileformat.info/info/info/unicode/char/2013/index.htm,因此,作为一个完全合法的角色,为什么SSMS在作为XML传递时不会读取它(使用UTF-8或UTF-16)?

4 个解决方案

#1


4  

Can you modify the XML encoding declaration? If so;

可以修改XML编码声明吗?如果是这样的话;

declare @xml XML = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?><records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>';

select @xml

(No column name)
<records><r RecordName="Option - Foo" /><r RecordName="Option – Bar" /></records>

Speculative Edit

Both of these fail with illegal xml character:

这两种方法都失败于非法的xml字符:

set @xml = '<?xml version="1.0" encoding="utf-8"?><x> – </x>'
set @xml = '<?xml version="1.0" encoding="utf-16"?><x> – </x>'

because they pass a non-unicode varchar to the XML parser; the string contains Unicode so must be treated as such, i.e. as an nvarchar (utf-16) (otherwise the 3 bytes comprising the are misinterpreted as multiple characters and one or more is not in the acceptable range for XML)

因为它们将非unicode varchar传递给XML解析器;字符串包含Unicode,所以必须这样处理,例如nvarchar (utf-16)(否则组成-的3个字节被误解为多个字符,一个或多个字符不在XML可接受的范围内)

This does pass a nvarchar string to the parser, but fails with unable to switch the encoding:

这将向解析器传递一个nvarchar字符串,但是无法切换编码:

set @xml = N'<?xml version="1.0" encoding="utf-8"?><x> – </x>'

This is because an nvarchar (utf-16) string is passed to the XML parser but the XML document states its utf-8 and the is not equivalent in the two encodings

这是因为一个nvarchar (utf-16)字符串被传递给XML解析器,但是XML文档声明它的utf-8和-在两个编码中是不相等的

This works as everything is utf-16

这可以工作,因为一切都是utf-16

set @xml = N'<?xml version="1.0" encoding="utf-16"?><x> – </x>'

#2


5  

Please permit me to answer my own question, for the purpose of me understanding it fully myself. I won't accept this as the answer; it is the combination of the other answers that lead me here. If this answer helps you in the future, please upvote the other posts also.

请允许我回答我自己的问题,以便我自己充分理解它。我不接受这个答案;是其他答案的组合引导我来到这里。如果这个答案在将来对你有帮助,请把其他的帖子也投上去。

The basic underlying rule is that XML with Unicode characters should be passed to, and parsed as, Unicode by SQL Server. Therefore C# should generate XML as UTF-16; the SSMS and .Net default.

基本的基本规则是,使用Unicode字符的XML应该通过SQL Server传递给Unicode,并解析为Unicode。因此c#应该生成作为UTF-16的XML;SSMS和。net默认。

Cause of original problem

This variable declares XML with UTF-8 encoding, but the entity en-dash cannot be used without being encoded in UTF-8. This is wrong:

这个变量使用UTF-8编码声明XML,但是实体en-dash不能在不使用UTF-8编码的情况下使用。这是错误的:

DECLARE @badxml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 3, character 29, illegal xml character

XML解析:第3行,第29字符,非法XML字符

Another approach that doesn't work is to switch UTF-8 to UTF-16 in the XML. The string here is not unicode, so the implicit conversion fails:

另一种不起作用的方法是在XML中将UTF-8转换为UTF-16。这里的字符串不是unicode,因此隐式转换失败:

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 1, character 56, unable to switch the encoding

XML解析:第1行,第56字符,无法切换编码

Solutions

Alternatives that work are:

选择工作是:

1) Leave as utf-8 but encode with hexademimal on the entity (reference):

1)保留为utf-8,但在实体上用十六进制进行编码(引用):

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option &#x2013; Bar" />
</records>';

2) As above but with decimal encoding on the entity (reference):

2)如上所述,但对实体进行十进制编码(引用):

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option &#8211; Bar" />
</records>';

3) Include the original entity, but remove utf-8 encoding in declaration (SSMS then applies UTF-16; its default):

3)包含原始实体,但在声明中删除utf-8编码(SSMS随后应用UTF-16;其默认):

DECLARE @xml xml = '<?xml version="1.0" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

4) Retain the UTF-16 declaration, but cast the XML to Unicode (note the preceeding N before casting as XML:

4)保留UTF-16声明,但将XML转换为Unicode(在转换为XML之前,请注意前面的N:

DECLARE @xml xml = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

#3


4  

SQL Sever internally uses UTF-16. Either let the encoding away or cast to unicode

SQL Sever内部使用UTF-16。要么放弃编码,要么转换为unicode

The reason you are looking for: With UTF-8 specified, this character is not known.

您寻找的原因是:如果指定了UTF-8,则不知道这个字符。

--without your directive, SQL Server picks its default
declare @xml XML = 
'<records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>';
select @xml;

--or UNICODE, but you must use UTF-16
declare @xml2 XML = 
CAST('<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>' AS NVARCHAR(MAX));

select @xml2

UPDATE

UTF-8 means, that there are chunks of 8 bits used to carry information. The base characters are just one chunk, easy going...

UTF-8的意思是,有8位的数据块用来传输信息。基本字符只是一个块,容易操作……

Other characters can be encoded as well. There are "c2" and "c3" codes (look here). c3-codes need three chunks to be encoded. But the internally used UTF16 expects 2 byte encoded characters.

其他字符也可以编码。有“c2”和“c3”代码(看这里)。c3代码需要三个块进行编码。但是内部使用的UTF16需要2字节编码的字符。

Hope this is clear now...

希望现在一切都清楚了……

UPDATE 2

This code will show you, that the Hyphen has the ASCII code 45 and your en-dash 150:

这个代码会告诉你,连字符有ASCII码45和en-dash 150:

DECLARE @x VARCHAR(100)=
'<r RecordName="Option - Foo" /><r RecordName="Option – Bar" />';

WITH RunningNumbers AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
    FROM sys.objects
)
SELECT SUBSTRING(@x,Nmbr,1), ASCII(SUBSTRING(@x,Nmbr,1)) AS ASCII_Code
FROM RunningNumbers
WHERE ASCII(SUBSTRING(@x,Nmbr,1)) IS NOT NULL;

Have a look here All characters with 7 bits are "plain" and should encode without problems. The "extended ASCII" is depending on code tables and could vary. 150 might be en-dash or something else. UTF8 uses some tricky encodings to allow strange characters to be "legal". Obviously (this was new to me too) the internally used UTF16 cannot cope with c3-characters.

看一下这里,所有7位的字符都是“普通”的,应该是无问题的编码。“扩展ASCII”依赖于代码表,可能会有所不同。150可能是en-dash或者其他什么。UTF8使用了一些复杂的编码来允许奇怪的字符是“合法的”。显然(这对我来说也是新的),内部使用的UTF16无法处理c3字符。

#4


2  

The MSDN guidelines says:

MSDN指南说:

SQLXML 4.0 relies upon the limited support for DTDs provided in SQL Server. SQL Server allows for an internal DTD in xml data type data, which can be used to supply default values and to replace entity references with their expanded contents. SQLXML passes the XML data "as is" (including the internal DTD) to the server. You can convert DTDs to XML Schema (XSD) documents using third-party tools, and load the data with inline XSD schemas into the database.

SQLXML 4.0依赖于对SQL Server中提供的dtd的有限支持。SQL Server允许在xml数据类型数据中使用内部DTD,可以使用它来提供默认值,并用扩展的内容替换实体引用。SQLXML将XML数据“按原样”(包括内部DTD)传递给服务器。您可以使用第三方工具将dtd转换为XML Schema (XSD)文档,并将内联XSD模式的数据加载到数据库中。

#1


4  

Can you modify the XML encoding declaration? If so;

可以修改XML编码声明吗?如果是这样的话;

declare @xml XML = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?><records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>';

select @xml

(No column name)
<records><r RecordName="Option - Foo" /><r RecordName="Option – Bar" /></records>

Speculative Edit

Both of these fail with illegal xml character:

这两种方法都失败于非法的xml字符:

set @xml = '<?xml version="1.0" encoding="utf-8"?><x> – </x>'
set @xml = '<?xml version="1.0" encoding="utf-16"?><x> – </x>'

because they pass a non-unicode varchar to the XML parser; the string contains Unicode so must be treated as such, i.e. as an nvarchar (utf-16) (otherwise the 3 bytes comprising the are misinterpreted as multiple characters and one or more is not in the acceptable range for XML)

因为它们将非unicode varchar传递给XML解析器;字符串包含Unicode,所以必须这样处理,例如nvarchar (utf-16)(否则组成-的3个字节被误解为多个字符,一个或多个字符不在XML可接受的范围内)

This does pass a nvarchar string to the parser, but fails with unable to switch the encoding:

这将向解析器传递一个nvarchar字符串,但是无法切换编码:

set @xml = N'<?xml version="1.0" encoding="utf-8"?><x> – </x>'

This is because an nvarchar (utf-16) string is passed to the XML parser but the XML document states its utf-8 and the is not equivalent in the two encodings

这是因为一个nvarchar (utf-16)字符串被传递给XML解析器,但是XML文档声明它的utf-8和-在两个编码中是不相等的

This works as everything is utf-16

这可以工作,因为一切都是utf-16

set @xml = N'<?xml version="1.0" encoding="utf-16"?><x> – </x>'

#2


5  

Please permit me to answer my own question, for the purpose of me understanding it fully myself. I won't accept this as the answer; it is the combination of the other answers that lead me here. If this answer helps you in the future, please upvote the other posts also.

请允许我回答我自己的问题,以便我自己充分理解它。我不接受这个答案;是其他答案的组合引导我来到这里。如果这个答案在将来对你有帮助,请把其他的帖子也投上去。

The basic underlying rule is that XML with Unicode characters should be passed to, and parsed as, Unicode by SQL Server. Therefore C# should generate XML as UTF-16; the SSMS and .Net default.

基本的基本规则是,使用Unicode字符的XML应该通过SQL Server传递给Unicode,并解析为Unicode。因此c#应该生成作为UTF-16的XML;SSMS和。net默认。

Cause of original problem

This variable declares XML with UTF-8 encoding, but the entity en-dash cannot be used without being encoded in UTF-8. This is wrong:

这个变量使用UTF-8编码声明XML,但是实体en-dash不能在不使用UTF-8编码的情况下使用。这是错误的:

DECLARE @badxml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 3, character 29, illegal xml character

XML解析:第3行,第29字符,非法XML字符

Another approach that doesn't work is to switch UTF-8 to UTF-16 in the XML. The string here is not unicode, so the implicit conversion fails:

另一种不起作用的方法是在XML中将UTF-8转换为UTF-16。这里的字符串不是unicode,因此隐式转换失败:

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 1, character 56, unable to switch the encoding

XML解析:第1行,第56字符,无法切换编码

Solutions

Alternatives that work are:

选择工作是:

1) Leave as utf-8 but encode with hexademimal on the entity (reference):

1)保留为utf-8,但在实体上用十六进制进行编码(引用):

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option &#x2013; Bar" />
</records>';

2) As above but with decimal encoding on the entity (reference):

2)如上所述,但对实体进行十进制编码(引用):

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option &#8211; Bar" />
</records>';

3) Include the original entity, but remove utf-8 encoding in declaration (SSMS then applies UTF-16; its default):

3)包含原始实体,但在声明中删除utf-8编码(SSMS随后应用UTF-16;其默认):

DECLARE @xml xml = '<?xml version="1.0" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

4) Retain the UTF-16 declaration, but cast the XML to Unicode (note the preceeding N before casting as XML:

4)保留UTF-16声明,但将XML转换为Unicode(在转换为XML之前,请注意前面的N:

DECLARE @xml xml = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

#3


4  

SQL Sever internally uses UTF-16. Either let the encoding away or cast to unicode

SQL Sever内部使用UTF-16。要么放弃编码,要么转换为unicode

The reason you are looking for: With UTF-8 specified, this character is not known.

您寻找的原因是:如果指定了UTF-8,则不知道这个字符。

--without your directive, SQL Server picks its default
declare @xml XML = 
'<records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>';
select @xml;

--or UNICODE, but you must use UTF-16
declare @xml2 XML = 
CAST('<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>' AS NVARCHAR(MAX));

select @xml2

UPDATE

UTF-8 means, that there are chunks of 8 bits used to carry information. The base characters are just one chunk, easy going...

UTF-8的意思是,有8位的数据块用来传输信息。基本字符只是一个块,容易操作……

Other characters can be encoded as well. There are "c2" and "c3" codes (look here). c3-codes need three chunks to be encoded. But the internally used UTF16 expects 2 byte encoded characters.

其他字符也可以编码。有“c2”和“c3”代码(看这里)。c3代码需要三个块进行编码。但是内部使用的UTF16需要2字节编码的字符。

Hope this is clear now...

希望现在一切都清楚了……

UPDATE 2

This code will show you, that the Hyphen has the ASCII code 45 and your en-dash 150:

这个代码会告诉你,连字符有ASCII码45和en-dash 150:

DECLARE @x VARCHAR(100)=
'<r RecordName="Option - Foo" /><r RecordName="Option – Bar" />';

WITH RunningNumbers AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
    FROM sys.objects
)
SELECT SUBSTRING(@x,Nmbr,1), ASCII(SUBSTRING(@x,Nmbr,1)) AS ASCII_Code
FROM RunningNumbers
WHERE ASCII(SUBSTRING(@x,Nmbr,1)) IS NOT NULL;

Have a look here All characters with 7 bits are "plain" and should encode without problems. The "extended ASCII" is depending on code tables and could vary. 150 might be en-dash or something else. UTF8 uses some tricky encodings to allow strange characters to be "legal". Obviously (this was new to me too) the internally used UTF16 cannot cope with c3-characters.

看一下这里,所有7位的字符都是“普通”的,应该是无问题的编码。“扩展ASCII”依赖于代码表,可能会有所不同。150可能是en-dash或者其他什么。UTF8使用了一些复杂的编码来允许奇怪的字符是“合法的”。显然(这对我来说也是新的),内部使用的UTF16无法处理c3字符。

#4


2  

The MSDN guidelines says:

MSDN指南说:

SQLXML 4.0 relies upon the limited support for DTDs provided in SQL Server. SQL Server allows for an internal DTD in xml data type data, which can be used to supply default values and to replace entity references with their expanded contents. SQLXML passes the XML data "as is" (including the internal DTD) to the server. You can convert DTDs to XML Schema (XSD) documents using third-party tools, and load the data with inline XSD schemas into the database.

SQLXML 4.0依赖于对SQL Server中提供的dtd的有限支持。SQL Server允许在xml数据类型数据中使用内部DTD,可以使用它来提供默认值,并用扩展的内容替换实体引用。SQLXML将XML数据“按原样”(包括内部DTD)传递给服务器。您可以使用第三方工具将dtd转换为XML Schema (XSD)文档,并将内联XSD模式的数据加载到数据库中。