utf - 16编码/解码std::string

时间:2023-01-04 19:57:13

I have to handle a file format (both read from and write to it) in which strings are encoded in UTF-16 (2 bytes per character). Since characters out of the ASCII table are rarely used in the application domain, all of the strings in my C++ model classes are stored in instances of std::string (UTF-8 encoded).

我必须处理一种文件格式(读和写),其中字符串编码为UTF-16(每个字符2字节)。由于在应用程序域中很少使用ASCII表中的字符,所以c++模型类中的所有字符串都存储在std::string (UTF-8编码)的实例中。

I'm looking for a library (searched in STL and Boost with no luck) or a set of C/C++ functions to handle this std::string <-> UTF-16 conversion when loading from or saving to file format (actually modeled as a bytestream) including the generation/recognition of surrogate pairs and all that Unicode stuffs (I'm admittedly no expert with)...

我在找一个库(搜索在STL和提高没有运气)或一组C / c++函数来处理这个std::string < - > utf - 16当加载或保存文件格式转换(实际上建模为bytestream)包括一代/承认*对和所有的Unicode东西(我诚然没有专家)……

Any suggestions? Thanks!

有什么建议吗?谢谢!

EDIT: forgot to mention it should be cross-platform (Win / Mac) and cannot use C++11.

编辑:忘了说它应该是跨平台的(Win / Mac),不能使用c++ 11。

2 个解决方案

#1


12  

C++11 has this functionality:

c++ 11这个功能:

std::string s = u8"Hello, World!";

// #include <codecvt>
std::wstring_convert<std::codecvt<char16_t,char,std::mbstate_t>,char16_t> convert;

std::u16string u16 = convert.from_bytes(s);
std::string u8 = convert.to_bytes(u16);

However to my knowledge the only implementation that has this so far is libc++. C++11 also has std::codecvt_utf8_utf16<char16_t> which some other implementations have. Specifically, codecvt_utf8_utf16 works in VS 2010 and above, and since wchar_t is used by Windows to represent UTF-16 you can use this to convert between UTF-8 and Windows' native encoding.

但是据我所知,到目前为止仅有的实现是libc++。c++ 11还有std::codecvt_utf8_utf16 ,这是其他一些实现所具有的。特别地,codecvt_utf8_utf16在VS 2010及以上版本中可以工作,而且由于Windows使用wchar_t来表示UTF-16,所以您可以使用它在UTF-8和Windows的本机编码之间进行转换。


The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encoding schemes, and the specialization codecvt<char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding schemes.

专门化codecvt 在UTF-16和UTF-8编码方案之间转换,专门化codecvt 在UTF-32和UTF-8编码方案之间转换。 、char、mbstate_t> 、char、mbstate_t>

                                                                                                                         — [locale.codecvt] 22.4.1.4/3

——(语言环境。codecvt]22.4.1.4/3


Oh, and std::codecvt specializations have protected destructors, and wstring_convert requires access to the destructor so you really need an adapter:

哦,还有std::codecvt专门化已经保护了析构函数,wstring_convert需要访问析构函数,所以您真的需要一个适配器:

template <class Facet>
class usable_facet : public Facet {
public:
    using Facet::Facet; // inherit constructors
    ~usable_facet() {}

    // workaround for compilers without inheriting constructors:
    // template <class ...Args> usable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
};

template<typename internT, typename externT, typename stateT> 
using codecvt = usable_facet<std::codecvt<internT, externT, stateT>>;

std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>> convert;

#2


4  

Did you look at Boost.Locale? This page, in particular, describes how to do UTF to UTF conversions and how to integrate it with IOStreams.

你看过boozer . locale吗?这一页,特别描述了如何做UTF到UTF转换以及如何将它与IOStreams集成。

#1


12  

C++11 has this functionality:

c++ 11这个功能:

std::string s = u8"Hello, World!";

// #include <codecvt>
std::wstring_convert<std::codecvt<char16_t,char,std::mbstate_t>,char16_t> convert;

std::u16string u16 = convert.from_bytes(s);
std::string u8 = convert.to_bytes(u16);

However to my knowledge the only implementation that has this so far is libc++. C++11 also has std::codecvt_utf8_utf16<char16_t> which some other implementations have. Specifically, codecvt_utf8_utf16 works in VS 2010 and above, and since wchar_t is used by Windows to represent UTF-16 you can use this to convert between UTF-8 and Windows' native encoding.

但是据我所知,到目前为止仅有的实现是libc++。c++ 11还有std::codecvt_utf8_utf16 ,这是其他一些实现所具有的。特别地,codecvt_utf8_utf16在VS 2010及以上版本中可以工作,而且由于Windows使用wchar_t来表示UTF-16,所以您可以使用它在UTF-8和Windows的本机编码之间进行转换。


The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encoding schemes, and the specialization codecvt<char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding schemes.

专门化codecvt 在UTF-16和UTF-8编码方案之间转换,专门化codecvt 在UTF-32和UTF-8编码方案之间转换。 、char、mbstate_t> 、char、mbstate_t>

                                                                                                                         — [locale.codecvt] 22.4.1.4/3

——(语言环境。codecvt]22.4.1.4/3


Oh, and std::codecvt specializations have protected destructors, and wstring_convert requires access to the destructor so you really need an adapter:

哦,还有std::codecvt专门化已经保护了析构函数,wstring_convert需要访问析构函数,所以您真的需要一个适配器:

template <class Facet>
class usable_facet : public Facet {
public:
    using Facet::Facet; // inherit constructors
    ~usable_facet() {}

    // workaround for compilers without inheriting constructors:
    // template <class ...Args> usable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
};

template<typename internT, typename externT, typename stateT> 
using codecvt = usable_facet<std::codecvt<internT, externT, stateT>>;

std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>> convert;

#2


4  

Did you look at Boost.Locale? This page, in particular, describes how to do UTF to UTF conversions and how to integrate it with IOStreams.

你看过boozer . locale吗?这一页,特别描述了如何做UTF到UTF转换以及如何将它与IOStreams集成。