PHP:如何正确地分割unicode韩语字符串?

时间:2021-09-08 03:45:53

I have a problem where I can't seem to be able to write "certain" Korean characters. Let me try to explain. These are the steps I take.

我有个问题,我似乎写不出“某些”韩文字符。我来解释一下。这些是我所采取的步骤。

  1. MS Access DB file (US version) has a table with Korean in it. I export this table as a text file with UTF-8 encoding. Let's call it "A.txt"

    MS Access DB文件(US version)中有一个包含韩语的表。我将此表导出为带有UTF-8编码的文本文件。我们叫它“A.txt”

  2. When A.txt is read, stored in an array, then written to a new file (B.txt), all characters display properly. I'm using header("Content-Type: text/plain; charset=UTF-8"); at the very beginning of my PHP script. I simply use fwrite($fh, $someStr).

    当一个。txt被读取,存储在一个数组中,然后被写入一个新的文件(B.txt),所有的字符都会正确显示。我用头(“内容类型:文本/平原;charset = utf - 8”);在PHP脚本的一开始。我只使用fwrite($fh, $someStr)。

  3. WHen I read B.txt in another script and write to yet a new file (C.txt), there's a certain column (obvisouly in the PHP code, I'm not working with a table or matrix, but effectively speaking when outputted back to the original text file format) that causes the characters to show up something like this: ¸ì¹˜ ì–´ëœíŠ¸ 나ì¼ë¡. This entire column has broken characters, so if I have 5 columns in a text file, delimited by commas and encapsulated with double quotes, this column will break all of the other columns' Korean characters. If I omit this column in writing the text file, all is well.

    当我读到B。txt在另一个脚本,写一个新文件(C.txt),有一定的列(大家在PHP代码中,我没有使用一个表或矩阵,但有效地说当输出回到原来的文本文件格式),导致出现这样的角色:¸我¹˜I -´eœiŠ¸e‚˜我¼e¡。这一整列都有破折号,所以如果我在一个文本文件中有5列,用逗号分隔,并用双引号封装,这个列将会破坏所有其他列的韩语字符。如果我在编写文本文件时省略了这一列,那么一切都很好。

Now, I noticed that certain PHP functions/operations break the Unicode characters. For example, if I use preg_replace() for any of the Korean strings and try to fwrite() that, it will break. However, I'm not performing anything that I'm not already doing on other fields/columns (speaking in terms of text file format), and other sections are not broken.

现在,我注意到某些PHP函数/操作会破坏Unicode字符。例如,如果我对任何一个韩国字符串使用preg_replace()并尝试fwrite(),那么它将会中断。但是,我并没有在其他字段/列上执行我没有执行的任何操作(从文本文件格式的角度来说),其他部分也没有被破坏。

Does anyone have any idea on how to rectify this? I've tried utf8_encode() and mb_convert_encoding() in different ways with no success. I'm reading utf8_encode() wouldn't even be necessary if my file is UTF-8 to begin with. I've tried setting my computer language to Korean as well..

有人知道怎么纠正吗?我以不同的方式尝试了utf8_encode()和mb_convert_encoding(),但没有成功。如果我的文件是UTF-8,那么我读取utf8_encode()甚至是不必要的。我也试过把电脑语言设置成韩语。

I've spent 2 days on this already, and it's becoming a huge waste of time. Please help!

我已经在这上面花了两天的时间,这已经成为一个巨大的浪费时间。请帮助!

  • UPDATE: I think I may have found the culprit. In the script that creates B.txt, I split a long Korean string into two (using string ...<br /><br />... as indicator) and assign them to different columns. I think this splitting operation is ultimately causing the problem.

    更新:我想我可能找到了罪魁祸首。在创建B的脚本中。txt,我把一个长的韩语字符串分成两段(使用字符串……)< br / > < br / >…并将它们分配给不同的列。我认为这个拆分操作最终导致了这个问题。

  • NEW QUESTION: How do I go about splitting this long string into two while preserving the unicode? Previsouly, I had used strpos() and substr(), but I am reading that the mb_*() function might be what I need.. Testing now.

    新问题:如何在保存unicode的同时将这个长字符串分割成两个?Previsouly,我使用了strpos()和substr(),但是我读到mb_*()函数可能是我需要的。现在测试。

1 个解决方案

#1


0  

Try the unicode modifier (u) for preg

为preg尝试unicode修饰符(u)。

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

这个修饰符打开与Perl不兼容的PCRE的附加功能。模式字符串被视为UTF-8。这个修饰符可以从Unix上的PHP 4.1.0或更高版本获得,也可以从win32上的PHP 4.2.3获得。从PHP 4.3.5开始检查模式的UTF-8有效性。

#1


0  

Try the unicode modifier (u) for preg

为preg尝试unicode修饰符(u)。

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

这个修饰符打开与Perl不兼容的PCRE的附加功能。模式字符串被视为UTF-8。这个修饰符可以从Unix上的PHP 4.1.0或更高版本获得,也可以从win32上的PHP 4.2.3获得。从PHP 4.3.5开始检查模式的UTF-8有效性。