c++ & Boost:编码/解码UTF-8。

I'm trying to do a very simple task: take a unicode-aware wstring and convert it to a string, encoded as UTF8 bytes, and then the opposite way around: take a string containing UTF8 bytes and convert it to unicode-aware wstring.

我正在尝试做一个非常简单的任务:使用一个unicode感知的wstring并将它转换成一个字符串，编码为UTF8字节，然后反过来:取一个包含UTF8字节的字符串，并将其转换为unicode感知的wstring。

The problem is, I need it cross-platform and I need it work with Boost... and I just can't seem to figure a way to make it work. I've been toying with

问题是，我需要它跨平台，我需要它能促进…我似乎无法找到一种方法让它发挥作用。我一直在玩弄

http://www.edobashira.com/2010/03/using-boost-code-facet-for-reading-utf8.html and
http://www.edobashira.com/2010/03/using-boost-code-facet-for-reading-utf8.html和
http://www.boost.org/doc/libs/1_46_0/libs/serialization/doc/codecvt.html
http://www.boost.org/doc/libs/1_46_0/libs/serialization/doc/codecvt.html

Trying to convert the code to use stringstream/wstringstream instead of files of whatever, but nothing seems to work.

尝试将代码转换为使用stringstream/wstringstream而不是诸如此类的文件，但似乎没有任何效果。

For instance, in Python it would look like so:

例如，在Python中它看起来是这样的:

>>> u"שלום"
u'\u05e9\u05dc\u05d5\u05dd'
>>> u"שלום".encode("utf8")
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'.decode("utf8")
u'\u05e9\u05dc\u05d5\u05dd'

What I'm ultimately after is this:

我最终想要的是:

wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws); 
// s now holds "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}

I really don't want to add another dependency on the ICU or something in that spirit... but to my understanding, it should be possible with Boost.

我真的不想再增加对ICU的依赖或者是那种精神上的东西…但据我所知，这应该是有可能的。

Some sample code would greatly be appreciated! Thanks

一些样例代码将非常感谢!谢谢

4 个解决方案

#1

Thanks everyone, but ultimately I resorted to http://utfcpp.sourceforge.net/ -- it's a header-only library that's very lightweight and easy to use. I'm sharing a demo code here, should anyone find it useful:

感谢所有人，但最终我求助于http://utfcpp.sourceforge.net/——它是一种非常轻量级且易于使用的header-only库。我在这里分享一个演示代码，如果有人发现它有用:

inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
    utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
    utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}

Usage:

用法:

wstring ws(L"\u05e9\u05dc\u05d5\u05dd");
string s;
encode_utf8(ws, s);

#2

There's already a boost link in the comments, but in the almost-standard C++0x, there is wstring_convert that does this

在注释中已经有了一个boost链接，但是在几乎标准的c++ 0x中，有wstring_convert这样做。

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
int main()
{
    wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
    std::string s = conv.to_bytes(uchars);
    std::wstring ws2 = conv.from_bytes(s);
    std::cout << std::boolalpha
              << (s == "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d" ) << '\n'
              << (ws2 == uchars ) << '\n';
}

output when compiled with MS Visual Studio 2010 EE SP1 or with CLang++ 2.9

使用MS Visual Studio 2010 EE SP1或CLang++ 2.9编译时输出。

true 
true

#3

Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16

提振。区域设置在Boost 1.48(2011年11月15日)发布，使其更容易转换为UTF8/16。

Here are some convenient examples from the docs:

以下是来自医生的一些方便的例子:

string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);

Almost as easy as Python encoding/decoding :)

几乎和Python编码/解码一样简单:

Note that Boost.Locale is not a header-only library.

注意提高。语言环境不是一个只有头的库。

#4

For a drop-in replacement for std::string/std::wstring that handles utf8, see tinyutf8.

为了替代std::string/std::wstring处理utf8，见tinyutf8。

In combination with <codecvt> you can convert pretty much from/to every encoding from/to utf8, which you then handle through the above library.

结合，你可以从/到utf8的每一个编码转换，然后通过上面的库处理。

#1

inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
    utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
    utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}