将混合编码的xml文件数据保存到mysql数据库中的utf-8

时间:2022-10-24 22:39:21

I have an xml file with mixed encoding (file is to be said in iso-8859-1 encoding though) but contain characters from windows 1252 also (trademark symbol, endash etc)

我有一个混合编码的xml文件(虽然文件在iso-8859-1编码中说)但是包含来自windows 1252的字符(商标符号,endash等)

Im using PHP and xmlreader to parse xml file to save in database. MySQL 5.0 server is saving the mixed encoded characters as box character but MySQL 5.1 gives error.

我使用PHP和xmlreader来解析xml文件以保存在数据库中。 MySQL 5.0服务器将混合编码字符保存为框字符,但MySQL 5.1出错。

so the question is, what is the easiest and full proof method to correctly save the utf-8 data.

所以问题是,正确保存utf-8数据的最简单,最全面的方法是什么。

This is my current code to convert it to utf-8, just wanted to know, if it may create problem while converting?

这是我目前将其转换为utf-8的代码,只是想知道,如果它可能在转换时产生问题?

 function cp1252_to_utf8($str) 
    {   
       $cp1252_map = array(
                "\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
                "\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
                "\xc2\x83" => "\xc6\x92",     /* LATIN SMALL LETTER F WITH HOOK */
                "\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
                "\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
                "\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
                "\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
                "\xc2\x88" => "\xcb\x86",     /* MODIFIER LETTER CIRCUMFLEX ACCENT */
                "\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
                "\xc2\x8a" => "\xc5\xa0",     /* LATIN CAPITAL LETTER S WITH CARON */
                "\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
                "\xc2\x8c" => "\xc5\x92",     /* LATIN CAPITAL LIGATURE OE */
                "\xc2\x8e" => "\xc5\xbd",     /* LATIN CAPITAL LETTER Z WITH CARON */
                "\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
                "\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
                "\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
                "\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
                "\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
                "\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
                "\xc2\x97" => "\xe2\x80\x94", /* EM DASH */

                "\xc2\x98" => "\xcb\x9c",     /* SMALL TILDE */
                "\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
                "\xc2\x9a" => "\xc5\xa1",     /* LATIN SMALL LETTER S WITH CARON */
                "\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
                "\xc2\x9c" => "\xc5\x93",     /* LATIN SMALL LIGATURE OE */
                "\xc2\x9e" => "\xc5\xbe",     /* LATIN SMALL LETTER Z WITH CARON */
                "\xc2\x9f" => "\xc5\xb8"      /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
            );

            return  strtr(utf8_encode($str), $cp1252_map);
    }


    $sql='SET NAMES "utf8" COLLATE "utf8_swedish_ci"';
    mysql_query($sql);


    $arr_book["booktitle"] = cp1252_to_utf8( iconv("UTF-8", "ISO-8859-1//TRANSLIT", $arr_book["

booktitle"]));

1 个解决方案

#1


0  

If you have mixed encodings in the same column, you have only 1 reasonable option: store as binary, rather then in a special charset. If the file is in cp1252 though (which overlaps for a huge part with ISO-8859-1 so probably you can just claim cp1252 as input), just call the iconv function on it before loading as XML. ($utf8string = iconv('cp1252','utf-8',$string);)

如果在同一列中有混合编码,则只有1个合理的选项:存储为二进制,而不是存储在特殊的字符集中。如果文件在cp1252中(与ISO-8859-1的重要部分重叠,那么你可能只是声称cp1252作为输入),只需在加载为XML之前调用iconv函数。 ($ utf8string = iconv('cp1252','utf-8',$ string);)

#1


0  

If you have mixed encodings in the same column, you have only 1 reasonable option: store as binary, rather then in a special charset. If the file is in cp1252 though (which overlaps for a huge part with ISO-8859-1 so probably you can just claim cp1252 as input), just call the iconv function on it before loading as XML. ($utf8string = iconv('cp1252','utf-8',$string);)

如果在同一列中有混合编码,则只有1个合理的选项:存储为二进制,而不是存储在特殊的字符集中。如果文件在cp1252中(与ISO-8859-1的重要部分重叠,那么你可能只是声称cp1252作为输入),只需在加载为XML之前调用iconv函数。 ($ utf8string = iconv('cp1252','utf-8',$ string);)