如何将XmlDocument.Save()保存到编码= " us-ascii ",而不是问号?

时间:2023-01-06 13:30:23

My goal is to get a binary buffer (MemoryStream.ToArray() would yield byte[] in this case) of XML without losing the Unicode characters. I would expect the XML serializer to use numeric character references to represent anything that would be invalid in ASCII. So far, I have:

我的目标是获得一个二进制缓冲区(MemoryStream.ToArray()在这种情况下会产生字节[],而不会丢失Unicode字符。我希望XML序列化器使用数字字符引用来表示ASCII中无效的任何内容。到目前为止,我有:

using System;
using System.IO;
using System.Text;
using System.Xml;

class Program
{
    static void Main(string[] args)
    {
        var doc = new XmlDocument();
        doc.LoadXml("<x>“∞π”</x>");
        using (var buf = new MemoryStream())
        {
            using (var writer = new StreamWriter(buf, Encoding.ASCII))
                doc.Save(writer);
            Console.Write(Encoding.ASCII.GetString(buf.ToArray()));
        }
    }
}

The above program produces the following output:

上述程序产生如下输出:

$ ./ConsoleApplication2.exe
<?xml version="1.0" encoding="us-ascii"?>
<x>????</x>

I figured out how to tell XmlDocument.Save() to use encoding="us-ascii"—by handing it a TextStream with TextStream.Encoding set to Encoding.ASCII. The documentation says The encoding on the TextWriter determines the encoding that is written out. But how can I tell it that I want it to use numeric character entities instead of its default lossy behavior? I have tested that doc.Save(Console.OpenStandardOutput()) writes the expected data (without an XML declaration) as UTF-8 with all of the correct characters, so I know that doc contains the information I wish to serialize. It’s just a matter of figuring out the right way to tell the XML serializer that I want encoding="us-ascii" with character entities…

我知道了如何告诉XmlDocument.Save()使用encoding="us-ascii"——通过给它一个带TextStream的TextStream。编码设置为Encoding.ASCII。文档说TextWriter上的编码决定了写入的编码。但是我怎么能告诉它我想让它使用数字字符实体而不是它的默认有损行为呢?我已经测试了document . save (Console.OpenStandardOutput()))将期望的数据(没有XML声明)写成UTF-8,并使用所有正确的字符,因此我知道doc包含了我希望序列化的信息。只要找到正确的方法告诉XML序列化器我想要编码="us-ascii"字符实体……

I understand that it may be non-trivial to write XML documents that are both encoding="us-ascii" and supportive of constructs like <π/> (I think this one might only be doable with external document type definitions. Yes, I have tried just for fun.). But I thought it was quite common to output entities for non-ASCII characters in an ASCII XML document to support preservation of content and attribute value character data in Unicode-unfriendly environments. I thought that numeric character references representing Unicode characters was analogous to using base64 to protect a blob while keeping the content more readable. How do I do this with .NET?

我知道这可能是必要的编写XML文档编码=“us - ascii”和支持结构像 <π> (我认为这可能只有一个可行的与外部文档类型定义。是的,我只是为了好玩才试的。但是我认为在ASCII XML文档中为非ASCII字符输出实体,以支持在不友好的环境中保存内容和属性值字符数据是非常常见的。我认为表示Unicode字符的数字字符引用类似于使用base64来保护blob,同时使内容更具可读性。我如何使用。net来实现这一点?

1 个解决方案

#1


4  

You can use XmlWriter instead:

您可以使用XmlWriter代替:

  var doc = new XmlDocument();
    doc.LoadXml("<x>“∞π”</x>");
    using (var buf = new MemoryStream())
    {
        using (var writer =  XmlWriter.Create(buf, 
              new XmlWriterSettings{Encoding= Encoding.ASCII}))
        {
            doc.Save(writer);
        }
        Console.Write(Encoding.ASCII.GetString(buf.ToArray()));
    }

Outputs:

输出:

<?xml version="1.0" encoding="us-ascii"?><x>&#x201C;&#x221E;&#x3C0;&#x201D;</x> 

#1


4  

You can use XmlWriter instead:

您可以使用XmlWriter代替:

  var doc = new XmlDocument();
    doc.LoadXml("<x>“∞π”</x>");
    using (var buf = new MemoryStream())
    {
        using (var writer =  XmlWriter.Create(buf, 
              new XmlWriterSettings{Encoding= Encoding.ASCII}))
        {
            doc.Save(writer);
        }
        Console.Write(Encoding.ASCII.GetString(buf.ToArray()));
    }

Outputs:

输出:

<?xml version="1.0" encoding="us-ascii"?><x>&#x201C;&#x221E;&#x3C0;&#x201D;</x>