将字符串从UTF-8转换为ISO-8859-1

时间:2022-11-13 14:33:51

I'm trying to convert a UTF-8 string to a ISO-8859-1 char* for use in legacy code. The only way I'm seeing to do this is with iconv.

我试图将UTF-8字符串转换为ISO-8859-1 char*,以便在遗留代码中使用。我唯一能做的就是用iconv。

I would definitely prefer a completely string-based C++ solution then just call .c_str() on the resulting string.

我肯定更喜欢完全基于字符串的c++解决方案,然后在结果字符串上调用。c_str()。

How do I do this? Code example if possible, please. I'm fine using iconv if it is the only solution you know.

我该怎么做呢?如果可能的话,请使用代码示例。如果这是唯一的解的话,我可以用iconv。

3 个解决方案

#1


9  

I'm going to modify my code from another answer to implement the suggestion from Alf.

我将修改另一个答案中的代码来实现Alf的建议。

std::string UTF8toISO8859_1(const char * in)
{
    std::string out;
    if (in == NULL)
        return out;

    unsigned int codepoint;
    while (*in != 0)
    {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
        {
            if (codepoint <= 255)
            {
                out.append(1, static_cast<char>(codepoint));
            }
            else
            {
                // do whatever you want for out-of-bounds characters
            }
        }
    }
    return out;
}

Invalid UTF-8 input results in dropped characters.

无效的UTF-8输入导致字符丢失。

#2


6  

First convert UTF-8 to 32-bit Unicode.

首先将UTF-8转换为32位Unicode。

Then keep the values that are in the range 0 through 255.

然后保持在0到255之间的值。

Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.

这些是Latin-1代码点,对于其他值,决定是否将其视为一个错误,或者可能将其替换为代码点127(我的收藏,ASCII“del”)或问号或其他东西。


The C++ standard library defines a std::codecvt specialization that can be used,

c++标准库定义了可以使用的std: codecvt专门化,

template<>
codecvt<char32_t, char, mbstate_t>

C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding schemes”

c++ 11§22.4.1.4/3:“专业化codecvt < char,char32_t mbstate_t > utf - 32之间的转换和utf - 8编码方案”

#3


1  

Alfs suggestion implemented in C++11

在c++ 11中实现的Alfs建议

#include <string>
#include <codecvt>
#include <algorithm>
#include <iterator>
auto i = u8"H€llo Wørld";
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8;
auto wide = utf8.from_bytes(i);
std::string out;
out.reserve(wide.length());
std::transform(wide.cbegin(), wide.cend(), std::back_inserter(out),
           [](const wchar_t c) { return (c <= 255) ? c : '?'; });
// out now contains "H?llo W\xf8rld"

#1


9  

I'm going to modify my code from another answer to implement the suggestion from Alf.

我将修改另一个答案中的代码来实现Alf的建议。

std::string UTF8toISO8859_1(const char * in)
{
    std::string out;
    if (in == NULL)
        return out;

    unsigned int codepoint;
    while (*in != 0)
    {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
        {
            if (codepoint <= 255)
            {
                out.append(1, static_cast<char>(codepoint));
            }
            else
            {
                // do whatever you want for out-of-bounds characters
            }
        }
    }
    return out;
}

Invalid UTF-8 input results in dropped characters.

无效的UTF-8输入导致字符丢失。

#2


6  

First convert UTF-8 to 32-bit Unicode.

首先将UTF-8转换为32位Unicode。

Then keep the values that are in the range 0 through 255.

然后保持在0到255之间的值。

Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.

这些是Latin-1代码点,对于其他值,决定是否将其视为一个错误,或者可能将其替换为代码点127(我的收藏,ASCII“del”)或问号或其他东西。


The C++ standard library defines a std::codecvt specialization that can be used,

c++标准库定义了可以使用的std: codecvt专门化,

template<>
codecvt<char32_t, char, mbstate_t>

C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding schemes”

c++ 11§22.4.1.4/3:“专业化codecvt < char,char32_t mbstate_t > utf - 32之间的转换和utf - 8编码方案”

#3


1  

Alfs suggestion implemented in C++11

在c++ 11中实现的Alfs建议

#include <string>
#include <codecvt>
#include <algorithm>
#include <iterator>
auto i = u8"H€llo Wørld";
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8;
auto wide = utf8.from_bytes(i);
std::string out;
out.reserve(wide.length());
std::transform(wide.cbegin(), wide.cend(), std::back_inserter(out),
           [](const wchar_t c) { return (c <= 255) ? c : '?'; });
// out now contains "H?llo W\xf8rld"