解析包含非法字符的XML

时间:2022-03-23 06:18:30

A message I receive from a server contains tags and in the tags is the data I need.

我从服务器收到的消息包含标签,标签中包含我需要的数据。

I try to parse the payload as XML but illegal character exceptions are generated.

我尝试将有效负载解析为XML,但会生成非法字符异常。

I also made use of httpUtility and Security Utility to escape the illegal characters, only problem is, it will escape < > which is needed to parse the XML.

我还利用httpUtility和Security Utility来逃避非法字符,唯一的问题是,它将解析解析XML所需的<>。

My question is, how do I parse XML when the data contained in it contains illegal non XML characters? (& -> amp;)_

我的问题是,当包含非法非XML字符的数据时,如何解析XML? (& - > amp;)_

Thanks.

Example:

<item><code>1234</code><title>voi hoody & polo shirt + Mckenzie jumper</title><description>Good condition size small - medium, text me if interested</description></item>

3 个解决方案

#1


4  

If you have only & as invalid character, then you can use regex to replace it with &amp;. We use regex to prevent replacement of already existing &amp;, &quot;, &#111;, etc. symbols.

如果您只有&作为无效字符,那么您可以使用正则表达式将其替换为& ;.我们使用正则表达式来防止替换已经存在的&,“,o等符号。

Regex can be as follows:

正则表达式可以如下:

&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)

解析包含非法字符的XML

Sample code:

string content = @"<item><code>1234 &amp; test</code><title>voi hoody & polo shirt + Mckenzie jumper&other stuff</title><description>Good condition size small - medium, text me if interested</description></item>";
content = Regex.Replace(content, @"&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)", "&amp;", RegexOptions.IgnoreCase);
XElement xItem = XElement.Parse(content);

#2


1  

Here is more generalized solution than Regex. First declare an array, store each invalid character that you want to replace with encoded version into it:

这是比Regex更普遍的解决方案。首先声明一个数组,将要替换的每个无效字符与编码版本一起存储到其中:

var invalidChars = new [] { '&', other chars comes here.. };

Then read all the xml as a whole text:

然后读取所有xml作为整个文本:

var xmlContent = File.ReadAllText("path");

Then replace the invalid chars using LINQ and HttpUtility.HtmlEncode:

然后使用LINQ和HttpUtility.HtmlEncode替换无效字符:

var validContent = string.Concat(xmlContent
        .Select(x =>
        {
            if (invalidChars.Contains(x)) return HttpUtility.HtmlEncode(x);
            return x.ToString();
        }));

Then parse it using XDocument.Parse, that's all.

然后使用XDocument.Parse解析它,就是这样。

#3


1  

Don't call it "XML which contains illegal characters". It isn't XML. You can't use XML tools to process something that isn't XML.

不要将其称为“包含非法字符的XML”。它不是XML。您不能使用XML工具来处理非XML的东西。

When you get bad XML, the best thing is to find out where and when it was generated, and fix the problem at source.

当你得到错误的XML时,最好的办法是找出它生成的地点和时间,并在源头修复问题。

If you can't do that, you need to find some way using non-XML tools (e.g. custom perl scripts) to repair the XML before you let it anywhere near an XML parser. The way you do this will depend on the nature of the errors you need to repair.

如果你不能这样做,你需要找到一些方法使用非XML工具(例如自定义perl脚本)来修复XML,然后再将它放在XML解析器附近。您这样做的方式取决于您需要修复的错误的性质。

#1


4  

If you have only & as invalid character, then you can use regex to replace it with &amp;. We use regex to prevent replacement of already existing &amp;, &quot;, &#111;, etc. symbols.

如果您只有&作为无效字符,那么您可以使用正则表达式将其替换为& ;.我们使用正则表达式来防止替换已经存在的&,“,o等符号。

Regex can be as follows:

正则表达式可以如下:

&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)

解析包含非法字符的XML

Sample code:

string content = @"<item><code>1234 &amp; test</code><title>voi hoody & polo shirt + Mckenzie jumper&other stuff</title><description>Good condition size small - medium, text me if interested</description></item>";
content = Regex.Replace(content, @"&(?!(?:lt|gt|amp|apos|quot|#\d+|#x[a-f\d]+);)", "&amp;", RegexOptions.IgnoreCase);
XElement xItem = XElement.Parse(content);

#2


1  

Here is more generalized solution than Regex. First declare an array, store each invalid character that you want to replace with encoded version into it:

这是比Regex更普遍的解决方案。首先声明一个数组,将要替换的每个无效字符与编码版本一起存储到其中:

var invalidChars = new [] { '&', other chars comes here.. };

Then read all the xml as a whole text:

然后读取所有xml作为整个文本:

var xmlContent = File.ReadAllText("path");

Then replace the invalid chars using LINQ and HttpUtility.HtmlEncode:

然后使用LINQ和HttpUtility.HtmlEncode替换无效字符:

var validContent = string.Concat(xmlContent
        .Select(x =>
        {
            if (invalidChars.Contains(x)) return HttpUtility.HtmlEncode(x);
            return x.ToString();
        }));

Then parse it using XDocument.Parse, that's all.

然后使用XDocument.Parse解析它,就是这样。

#3


1  

Don't call it "XML which contains illegal characters". It isn't XML. You can't use XML tools to process something that isn't XML.

不要将其称为“包含非法字符的XML”。它不是XML。您不能使用XML工具来处理非XML的东西。

When you get bad XML, the best thing is to find out where and when it was generated, and fix the problem at source.

当你得到错误的XML时,最好的办法是找出它生成的地点和时间,并在源头修复问题。

If you can't do that, you need to find some way using non-XML tools (e.g. custom perl scripts) to repair the XML before you let it anywhere near an XML parser. The way you do this will depend on the nature of the errors you need to repair.

如果你不能这样做,你需要找到一些方法使用非XML工具(例如自定义perl脚本)来修复XML,然后再将它放在XML解析器附近。您这样做的方式取决于您需要修复的错误的性质。