C ++ 0x是否支持将std :: wstring转换为UTF-8字节序列？

I saw that C++0x will add support for UTF-8, UTF-16 and UTF-32 literals. But what about conversions between the three representations ?

我看到C ++ 0x将添加对UTF-8,UTF-16和UTF-32文字的支持。但是这三种表述之间的转换呢?

I plan to use std::wstring everywhere in my code. But I also need to manipulate UTF-8 encoded data when dealing with files and network. Will C++0x provide also support for these operations ?

我打算在我的代码中到处使用std :: wstring。但在处理文件和网络时,我还需要操纵UTF-8编码数据。 C ++ 0x是否也支持这些操作?

2 个解决方案

#1

In C++0x, char16_t and char32_t will be used to store UTF-16 and UTF-32 and not wchar_t.

在C ++ 0x中,char16_t和char32_t将用于存储UTF-16和UTF-32而不是wchar_t。

From the draft n2798:

从草案n2798:

22.2.1.4 Class template codecvt

22.2.1.4类模板codecvt

2 The class codecvt is for use when converting from one codeset to another, such as from wide characters to multibyte characters or between wide character encodings such as Unicode and EUC.

2类codecvt用于从一个代码集转换为另一个代码集,例如从宽字符到多字节字符或在宽字符编码(如Unicode和EUC)之间。

3 The specializations required in Table 76 (22.1.1.1.1) convert the implementation- defined native character set. codecvt implements a degenerate conversion; it does not convert at all. The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters. Specializations on mbstate_t perform conversion between encodings known to the library implementor.

3表76(22.1.1.1.1)中要求的特化转换实现定义的本机字符集。 codecvt实现简并转换;它完全没有转换。特化码codecvt 在UTF-16和UTF-8编码方案之间进行转换,特化码codecvt 在UTF-32和UTF-8编码方案之间进行转换。 codecvt 在窄字符和宽字符的本机字符集之间进行转换。 mbstate_t上的专业化执行库实现者已知的编码之间的转换。 ,char,mbstate_t> ,char,mbstate_t> ,char,mbstate_t>

Other encodings can be converted by specializing on a user-defined stateT type. The stateT object can contain any state that is useful to communicate to or from the specialized do_in or do_out members.

其他编码可以通过专门处理用户定义的stateT类型来转换。 stateT对象可以包含任何对专用do_in或do_out成员进行通信的状态。

The thing about wchar_t is that it does not give you any guarantees about the encoding used. It is a type that can hold a multibyte character. Period. If you are going to write software now, you have to live with this compromise. C++0x compliant compilers are yet a far cry. You can always give the VC2010 CTP and g++ compilers a try for what it is worth. Moreover, wchar_t has different sizes on different platforms which is another thing to watch out for (2 bytes on VS/Windows, 4 bytes on GCC/Mac and so on). There is then options like -fshort-wchar for GCC to further complicate the issue.

关于wchar_t的事情是它不会对使用的编码提供任何保证。它是一种可以保存多字节字符的类型。期。如果你现在要编写软件,你必须忍受这种妥协。与C ++ 0x兼容的编译器相差甚远。您可以随时尝试VC2010 CTP和g ++编译器。此外,wchar_t在不同平台上具有不同的大小,这是另一个需要注意的事项(VS / Windows上为2个字节,GCC / Mac上为4个字节,依此类推)。然后有像-fshort-wchar这样的选项让GCC进一步使问题复杂化。

The best solution therefore is to use an existing library. Chasing UNICODE bugs around isn't the best possible use of effort/time. I'd suggest you take a look at:

因此,最佳解决方案是使用现有库。追逐UNICODE错误不是最好的努力/时间使用。我建议你看看:

GNU libiconv
IBM's libicu

More on C++0x Unicode string literals here

更多关于C ++ 0x Unicode字符串文字的信息

#2

Thank you dirkgently. I'm not yet registered, so I can't upvote or respond directly as a comment.

谢谢你。我还没有注册,所以我不能直接投票或直接回复评论。

I've learned something with codecvt. I knew about the libraries you suggest and the following resource may also be useful http://www.unicode.org/Public/PROGRAMS/CVTUTF/.

我用codecvt学到了一些东西。我知道您建议的库,以下资源也可能有用http://www.unicode.org/Public/PROGRAMS/CVTUTF/。

The project is for a library that should be open source. I would prefer minimizing the dependencies with external libraries. I already have a dependency with libgc and boost, though for the later I only use threads. I would really prefer to stick to the C++ standard and I'm a bit disappointed that GC supported has been somehow dropped.

该项目适用于应该是开源的库。我更希望最小化与外部库的依赖关系。我已经有了libgc和boost的依赖关系,但是后来我只使用了线程。我真的更喜欢坚持C ++标准,我有点失望,GC支持已经以某种方式被删除。

Apparently VC++ express 2008 is said to support most of the C++0x standard as well as icc. Since I currently develop with VC++ and it will still take some time until the library would be released, I'd like to give a try to use codecvt and char32_t strings.

显然VC ++ express 2008据称支持大多数C ++ 0x标准以及icc。由于我目前使用VC ++进行开发,并且在发布库之前仍需要一些时间,所以我想尝试使用codecvt和char32_t字符串。

Does anyone know how to do this ? Should I post another question ?

有谁知道如何做到这一点 ?我应该发布另一个问题吗?

#1