PHP:在XML中使用过滤器删除无效的utf-8字符。

时间:2022-10-24 23:49:13

I have a large file, so I have created a filter for removing invalid utf-8 characters from XML.

我有一个大文件,因此我创建了一个过滤器,用于从XML中删除无效的utf-8字符。

class ValidUTF8XMLFilter extends php_user_filter {

    protected static $pattern = '/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./x';

    function filter($in, $out, &$consumed, $closing)
    {
        while ($bucket = stream_bucket_make_writeable($in)) {
            $bucket->data = preg_replace(self::$pattern, '$1', $bucket->data);
            $consumed += $bucket->datalen;
            stream_bucket_append($out, $bucket);
        }
        return PSFS_PASS_ON;
    }
}

This filter will remove also utf-8 characters not only invalid in xml, but also in utf-8. The regex is taken from Multilingual form encoding. The class was taken from this answer: How to skip invalid characters in XML file using PHP and rewritten. The pattern in that answer won't work for invalid utf-8 characters, eg. 0x1D.

该过滤器将删除utf-8字符,不仅在xml中无效,而且在utf-8中也是无效的。正则表达式采用多语言形式编码。这个类是从这个答案中获得的:如何使用PHP跳过XML文件中的无效字符并重写。这个答案的模式不会对无效的utf-8字符起作用。0 x1d。

Will this filter work, in situation, where invalid bytes starts at the end of buffer and ends in beginning of next filtering? Is this situation possible?

在这种情况下,这个过滤器会工作吗?在缓冲区结束时,无效字节开始,然后在下一个过滤开始时结束?这种情况可能吗?

1 个解决方案

#1


2  

No, I don't think it will work. It will strip valid sequences of code units that happen to be split between several buckets.

不,我认为它不会起作用。它将带出一些有效的代码单元序列,这些代码单元恰好在多个bucket之间进行分割。

It should not consume potentially incomplete sequences in the end (and, if necessary, it should pass nothing and return PSFS_FEED_ME).

它不应该在最后使用可能不完整的序列(如果有必要,它应该不会传递任何东西并返回PSFS_FEED_ME)。

#1


2  

No, I don't think it will work. It will strip valid sequences of code units that happen to be split between several buckets.

不,我认为它不会起作用。它将带出一些有效的代码单元序列,这些代码单元恰好在多个bucket之间进行分割。

It should not consume potentially incomplete sequences in the end (and, if necessary, it should pass nothing and return PSFS_FEED_ME).

它不应该在最后使用可能不完整的序列(如果有必要,它应该不会传递任何东西并返回PSFS_FEED_ME)。