如何从Perl中的字符串中删除无效的XML字符？

I'm looking for what the standard, approved, and robust way of stripping invalid characters from strings before writing them to an XML file. I'm talking here about blocks of text containing backspace (^H) and formfeed characters etc.

我正在寻找在将字符串写入XML文件之前从字符串中删除无效字符的标准，批准和强大的方法。我在这里谈论的是包含退格（^ H）和换页字符等的文本块。

There has to be a standard library/module function for doing this but I can't find it.

必须有一个标准的库/模块功能，但我无法找到它。

I'm using XML::LibXML to build a DOM tree that I then serialize to disk.

我正在使用XML :: LibXML构建一个DOM树，然后我将其序列化到磁盘。

10 个解决方案

#1

The complete regex for removal of invalid xml-1.0 characters is:

删除无效的xml-1.0字符的完整正则表达式是：

# #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
$str =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;

for xml-1.1 it is:

对于xml-1.1，它是：

# allowed: [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
$str =~ s/[^\x01-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
# restricted:[#x1-#x8][#xB-#xC][#xE-#x1F][#x7F-#x84][#x86-#x9F]
$str =~    s/[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]//go;

#2

As almost everyone else has said, use a regular expression. It's honestly not complex enough to be worth adding to a library. Preprocess your text with a substitution.

几乎所有人都说过，使用正则表达式。说实话，它不够复杂，不值得添加到库中。使用替换预处理文本。

Your comment about linefeeds above suggests that the formatting is of some importance to you so you will possibly have to decide exactly what you want to replace some characters with.

您对上面的换行符的评论表明格式化对您来说非常重要，因此您可能必须确切地决定要替换某些字符。

The list of invalid characters is clearly defined in the XML spec (here - http://www.w3.org/TR/REC-xml/#charsets - for example). The disallowed characters are the ASCII control characters bar carriage return, linefeed and tab. So, you are looking at a 29 character regular expression character class. That's not too bad surely.

XML规范中明确定义了无效字符列表（例如，http：//www.w3.org/TR/REC-xml/#charsets）。不允许的字符是ASCII控制字符栏回车，换行和制表符。所以，你正在看一个29个字符的正则表达式字符类。那肯定不是太糟糕。

Something like:

就像是：

$text =~ s/[\x00-\x08 \x0B \x0C \x0E-\x19]//g;

should do it.

应该这样做。

#3

I've found a solution, but it uses the iconv command instead of perl.

我找到了一个解决方案，但它使用iconv命令而不是perl。

$ iconv -c -f UTF-8 -t UTF-8 invalid.utf8 > valid.utf8

The solutions given above based on regular expressions do not work!!, consider the following example:

以上基于正则表达式给出的解决方案不起作用!!，请考虑以下示例：

$ perl -e 'print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<root>\x{A0}\x{A0}</root>"' > invalid.xml
$ perl -e 'use XML::Simple; XMLin("invalid.xml")'
invalid.xml:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xA0 0xA0 0x3C 0x2F
$ perl -ne 's/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go; print' invalid.xml > valid.xml
$ perl -e 'use XML::Simple; XMLin("valid.xml")'
invalid.xml:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xA0 0xA0 0x3C 0x2F

In fact, the two files invalid.xml and valid.xml are identical.

实际上，invalid.xml和valid.xml这两个文件是相同的。

The thing is that the range "\x20-\x{D7FF}" matches valid representations of those unicode characters, but not e.g. the invalid character sequence "\x{A0}\x{A0}".

问题是“\ x20- \ x {D7FF}”范围匹配那些unicode字符的有效表示，但不是无效的字符序列“\ x {A0} \ x {A0}”。

#4

Translate is a lot faster than regex substitution. Especially if all you want to do delete characters. Using newt's set:

翻译比正则表达式替换快得多。特别是如果你想要删除所有字符。使用newt的集合：

$string_to_clean =~ tr/\x00-\x08\x0B\x0C\x0E-\x19//d;

A test like this:

像这样的测试：

cmpthese 1_000_000
       , { translate => sub { 
               my $copy = $text; 
               $copy =~ tr/\x00-\x08\x0B\x0C\x0E-\x19//d; 
           }
           , substitute => sub { 
               my $copy = $text; 
               $copy =~ s/[\x00-\x08\x0B\x0C\x0E-\x19]//g; 
           }
         };

yeilded:

yeilded：

                Rate substitute  translate
substitute  287770/s         --       -86%
translate  2040816/s       609%         --

And the more characters I needed to delete the faster tr got in relation.

我需要删除更快的tr所需的字符越多。

#5

If you use an XML library to build your XML (as opposed to string concatenation, simple templates, etc), then it should take care of that for you. There is no point in reinventing the wheel.

如果您使用XML库来构建XML（而不是字符串连接，简单模板等），那么它应该为您处理。重新发明*毫无意义。

XML::LibXML
XML ::的libxml
XML::Twig
XML ::嫩枝
XML::Smart
XML ::智能
XML::Simple
XML ::简单
etc
等等

#6

Okay, this seems to be already answered, but what the hey. If you want to author XML documents, you must use an XML library.

好吧，这似乎已经回答了，但是嘿。如果要创作XML文档，则必须使用XML库。

#!/usr/bin/perl
use strict;
use XML::LibXML;

my $doc = XML::LibXML::Document->createDocument('1.0');
$doc->setURI('http://example.com/myuri');
$doc->setDocumentElement($doc->createElement('root-node'));

$doc->documentElement->appendTextChild('text-node',<<EOT);
    This node contains &, ñ, á, <, >...
EOT

print $doc->toString;

This produces the following:

这产生以下结果：

$ perl test.pl
<?xml version="1.0"?>
<root-node><text-node>    This node contains &amp;, &#x6C821;, &lt;, &gt;...
</text-node></root-node>

Edit: I now see that you are already using XML::LibXML. This should do the trick.

编辑：我现在看到你已经在使用XML :: LibXML了。这应该可以解决问题。

#7

You could use a Regular expression to remove control characters for example \cH will match \cL or \x08 and \x0C both match backspace and Formfeed respectively.

您可以使用正则表达式删除控制字符，例如\ cH将匹配\ cL或\ x08和\ x0C分别匹配退格和Formfeed。

#8

You can use a simple regex to find and replace all control characters in your chunk of text replacing them either with a space or removing them altogether-

你可以使用一个简单的正则表达式来查找和替换你的文本块中的所有控制字符，用空格替换它们或者完全删除它们 -

# Replace all control characters with a space
$text =~ s/[[:cntrl:]]/ /g;

# or remove them
$text =~ s/[[:cntrl:]]//g;

#9

I haven't done a lot of work with XML containing "invalid" characters before, but it seems to me you have two completely separate problems here.

我之前没有对包含“无效”字符的XML做过很多工作，但在我看来，你在这里有两个完全不同的问题。

First, there are characters in your data that you may not want. You should decide what those are and how you want to remove/replace them independent of any XML restrictions. For instance, you may have things like x^H_y^H_z^H_ where you decide you want to strip both the backspace and the following character. Or it's possible that you in fact don't want to adjust your data but feel forced to by the need to represent it in XML.

首先，您可能不需要数据中的字符。您应该决定它们是什么以及如何删除/替换它们，而不受任何XML限制的影响。例如，您可能有x ^ H_y ^ H_z ^ H_之类的内容，您决定要删除退格和后续字符。或者你可能实际上不想调整你的数据，但是由于需要用XML表示它而感到*。

Update: I've preserved the following paragraphs for posterity, but they are based on a misunderstanding: I thought you could include any character in XML data so long as you encoded it properly, but it seems there are some characters that are outright verboten, even encoded? XML::LibXML strips these out (at least the current version does so), except for the nul character, which it treats as the end of the string, discarding it and anything that follows :(

更新：我为后代保留了以下段落，但它们基于一个误解：我认为你可以在XML数据中包含任何字符，只要你正确编码它，但似乎有一些字符完全是verboten，甚至编码？ XML :: LibXML剥离这些（至少当前版本这样做），除了nul字符，它将其视为字符串的结尾，丢弃它以及随后的任何内容:(

Second, you may have characters in your data that you've kept that need encoding in XML. Ideally, whatever XML module you use would do this for you, but if it isn't, you should be able to do it manually, with something like:

其次，您可能在数据中包含需要以XML格式编码的字符。理想情况下，您使用的任何XML模块都可以为您执行此操作，但如果不是，您应该能够手动执行此操作，例如：

use HTML::Entities "encode_entities_numeric";
$encoded_string = encode_entities_numeric( $string, "\x00-\x08\x0B\x0C\x0E-\x19");

But that's really just a stopgap measure. Use a proper XML module; see for instance this answer.

但那真的只是权宜之计。使用适当的XML模块;比如看看这个答案。

#10

Axeman's right about using tr, but he and newt made a little mistake inverting the XML spec's range of legal characters. http://www.w3.org/TR/REC-xml/#charsets gives

Axeman关于使用tr的权利，但是他和newt在反转XML规范的法律字符方面犯了一点错误。 http://www.w3.org/TR/REC-xml/#charsets给出

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

and since the hexadecimal number before \x20 is \x1F (not \x19!) you should use

因为\ x20之前的十六进制数是\ x1F（不是\ x19！），你应该使用

$string_to_clean =~ tr/\x00-\x08\x0B\x0C\x0E-\x1F//d;

#1