Linux和C编程:如何将utf-8编码文本写入文件?

时间:2021-04-16 15:05:57

I am interested in writing utf-8 encoded strings to a file.

我有兴趣将utf-8编码的字符串写入文件。

I did this with low level functions open() and write(). In the first place I set the locale to a utf-8 aware character set with setlocale("LC_ALL", "de_DE.utf8"). But the resulting file does not contain utf-8 characters, only iso8859 encoded umlauts. What am I doing wrong?

我用低级函数open()和write()完成了这个。首先,我将语言环境设置为具有setlocale(“LC_ALL”,“de_DE.utf8”)的utf-8识别字符集。但是生成的文件不包含utf-8字符,只包含iso8859编码的变音符号。我究竟做错了什么?

Addendum: I don't know if my strings are really utf-8 encoded in the first place. I just keep them in the source file in this form: char *msg = "Rote Grütze";

附录:我不知道我的字符串是否真的是utf-8编码。我只是将它们保存在源文件中:char * msg =“RoteGrütze”;

See screenshot for content of the textfile: alt text http://img19.imageshack.us/img19/9791/picture1jh9.png

有关文本文件内容的截图:alt text http://img19.imageshack.us/img19/9791/picture1jh9.png

3 个解决方案

#1


Changing the locale won't change the actual data written to the file using write(). You have to actually produce UTF-8 characters to write them to a file. For that purpose you can use libraries as ICU.

更改语言环境不会使用write()更改写入文件的实际数据。您必须实际生成UTF-8字符才能将它们写入文件。为此,您可以将库用作ICU。

Edit after your edit of the question: UTF-8 characters are only different from ISO-8859 in the "special" symbols (ümlauts, áccénts, etc.). So, for all the text that doesn't have any of this symbols, both are equivalent. However, if you include in your program strings with those symbols, you have to make sure your text editor treats the data as UTF-8. Sometimes you just have to tell it to.

编辑问题后编辑:UTF-8字符与“特殊”符号(ümlauts,áccénts等)中的ISO-8859不同。因此,对于所有没有任何符号的文本,两者都是等价的。但是,如果在程序字符串中包含这些符号,则必须确保文本编辑器将数据视为UTF-8。有时你只需告诉它。

To sum up, the text you produce will be in UTF-8 if the strings within the source code are in UTF-8.

总而言之,如果源代码中的字符串是UTF-8,则您生成的文本将为UTF-8。

Another edit: Just to be sure, you can convert your source code to UTF-8 using iconv:

另一个编辑:为了确保您可以使用iconv将源代码转换为UTF-8:

iconv -f latin1 -t utf8 file.c

This will convert all your latin-1 strings to utf8, and when you print them they will be definitely in UTF-8. If iconv encounters a strange character, or you see the output strings with strange characters, then your strings were in UTF-8 already.

这会将你所有的latin-1字符串转换为utf8,当你打印它们时,它们绝对是UTF-8。如果iconv遇到一个奇怪的字符,或者你看到输出字符串有奇怪的字符,那么你的字符串已经是UTF-8了。

Regards,

#2


Yes, you can do it with glibc. They call it multibyte instead of UTF-8, because it can handle more than one encoding type. Check out this part of the manual.

是的,你可以用glibc来做。他们称之为多字节而不是UTF-8,因为它可以处理多种编码类型。查看本手册的这一部分。

Look for functions that start with the prefix mb, and also function with wc prefix, for converting from multibyte to wide char. You'll have to set the locale first with setlocale() to UTF-8 so it chooses this implementation of multibyte support.

查找以前缀mb开头的函数,并使用wc前缀函数,从多字节转换为宽字符。您必须首先使用setlocale()将语言环境设置为UTF-8,以便选择此多字节支持实现。

If you are coming from an Unicode file I believe the function you looking for is wcstombs().

如果您来自Unicode文件,我相信您要查找的功能是wcstombs()。

#3


Can you open up the file in a hex editor and verify, with a simple input example, that the written bytes are not the values of Unicode characters that you passed to write(). Sometimes, there is no way for a text editor to determine character set and your text editor may have assumed an ISO8859-1 character set.

您可以在十六进制编辑器中打开文件,并通过一个简单的输入示例验证写入的字节不是您传递给write()的Unicode字符的值。有时,文本编辑器无法确定字符集,并且文本编辑器可能已采用ISO8859-1字符集。

Once you have done this, could you edit your original post to add the pertinent information?

完成此操作后,您是否可以编辑原始帖子以添加相关信息?

#1


Changing the locale won't change the actual data written to the file using write(). You have to actually produce UTF-8 characters to write them to a file. For that purpose you can use libraries as ICU.

更改语言环境不会使用write()更改写入文件的实际数据。您必须实际生成UTF-8字符才能将它们写入文件。为此,您可以将库用作ICU。

Edit after your edit of the question: UTF-8 characters are only different from ISO-8859 in the "special" symbols (ümlauts, áccénts, etc.). So, for all the text that doesn't have any of this symbols, both are equivalent. However, if you include in your program strings with those symbols, you have to make sure your text editor treats the data as UTF-8. Sometimes you just have to tell it to.

编辑问题后编辑:UTF-8字符与“特殊”符号(ümlauts,áccénts等)中的ISO-8859不同。因此,对于所有没有任何符号的文本,两者都是等价的。但是,如果在程序字符串中包含这些符号,则必须确保文本编辑器将数据视为UTF-8。有时你只需告诉它。

To sum up, the text you produce will be in UTF-8 if the strings within the source code are in UTF-8.

总而言之,如果源代码中的字符串是UTF-8,则您生成的文本将为UTF-8。

Another edit: Just to be sure, you can convert your source code to UTF-8 using iconv:

另一个编辑:为了确保您可以使用iconv将源代码转换为UTF-8:

iconv -f latin1 -t utf8 file.c

This will convert all your latin-1 strings to utf8, and when you print them they will be definitely in UTF-8. If iconv encounters a strange character, or you see the output strings with strange characters, then your strings were in UTF-8 already.

这会将你所有的latin-1字符串转换为utf8,当你打印它们时,它们绝对是UTF-8。如果iconv遇到一个奇怪的字符,或者你看到输出字符串有奇怪的字符,那么你的字符串已经是UTF-8了。

Regards,

#2


Yes, you can do it with glibc. They call it multibyte instead of UTF-8, because it can handle more than one encoding type. Check out this part of the manual.

是的,你可以用glibc来做。他们称之为多字节而不是UTF-8,因为它可以处理多种编码类型。查看本手册的这一部分。

Look for functions that start with the prefix mb, and also function with wc prefix, for converting from multibyte to wide char. You'll have to set the locale first with setlocale() to UTF-8 so it chooses this implementation of multibyte support.

查找以前缀mb开头的函数,并使用wc前缀函数,从多字节转换为宽字符。您必须首先使用setlocale()将语言环境设置为UTF-8,以便选择此多字节支持实现。

If you are coming from an Unicode file I believe the function you looking for is wcstombs().

如果您来自Unicode文件,我相信您要查找的功能是wcstombs()。

#3


Can you open up the file in a hex editor and verify, with a simple input example, that the written bytes are not the values of Unicode characters that you passed to write(). Sometimes, there is no way for a text editor to determine character set and your text editor may have assumed an ISO8859-1 character set.

您可以在十六进制编辑器中打开文件,并通过一个简单的输入示例验证写入的字节不是您传递给write()的Unicode字符的值。有时,文本编辑器无法确定字符集,并且文本编辑器可能已采用ISO8859-1字符集。

Once you have done this, could you edit your original post to add the pertinent information?

完成此操作后,您是否可以编辑原始帖子以添加相关信息?