如何将UTF-8字符串转换为Unicode?

时间:2022-01-12 23:25:44

I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode.

我有一个显示UTF-8编码字符的字符串,我想把它转换回Unicode。

For now, my implementation is the following:

目前,我的执行情况如下:

public static string DecodeFromUtf8(this string utf8String)
{
    // read the string as UTF-8 bytes.
    byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);

    // convert them into unicode bytes.
    byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);

    // builds the converted string.
    return Encoding.Unicode.GetString(encodedBytes);
}

I am playing with the word "déjà". I have converted it into UTF-8 through this online tool, and so I started to test my method with the string "déjÃ".

我在玩“deja”这个词。我已经通过这个在线工具转换为utf - 8,所以我开始测试方法的字符串“dA©jA”。

Unfortunately, with this implementation the string just remains the same.

不幸的是,有了这个实现,字符串仍然保持不变。

Where am I wrong?

我在哪儿错了?

4 个解决方案

#1


11  

So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.

因此问题是UTF-8代码单元值被存储为c#字符串中16位代码单元的序列。您只需验证每个代码单元都在字节范围内,将这些值复制为字节,然后将新的UTF-8字节序列转换为UTF-16。

public static string DecodeFromUtf8(this string utf8String)
{
    // copy the string as UTF-8 bytes.
    byte[] utf8Bytes = new byte[utf8String.Length];
    for (int i=0;i<utf8String.Length;++i) {
        //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
        utf8Bytes[i] = (byte)utf8String[i];
    }

    return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}

DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà

This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).

这很容易,但是最好找到根本原因;将UTF-8代码单元复制到16位代码单元的位置。可能的罪魁祸首是某人使用错误的编码将字节转换为c#字符串。例如Encoding.Default。GetString(utf8Bytes 0 utf8Bytes.Length)。


Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:

或者,如果你确定你知道不正确的编码用于生产的字符串,而不正确的编码转换无损(通常情况下如果不正确的编码是一个字节编码),那么您可以简单地做反编码步骤原始utf - 8数据,然后你可以做正确的转换从utf - 8字节:

public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
    // the inverse of `mistake.GetString(originalBytes);`
    byte[] originalBytes = mistake.GetBytes(mangledString);
    return correction.GetString(originalBytes);
}

UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);

#2


9  

If you have a UTF-8 string, where every byte is correct ('Ö' -> [195, 0] , [150, 0]), you can use the following:

如果您有一个UTF-8字符串,其中每个字节都是正确的(' O ' ->[195, 0],[150, 0]),您可以使用以下方法:

public static string Utf8ToUtf16(string utf8String)
{
    /***************************************************************
     * Every .NET string will store text with the UTF-16 encoding, *
     * known as Encoding.Unicode. Other encodings may exist as     *
     * Byte-Array or incorrectly stored with the UTF-16 encoding.  *
     *                                                             *
     * UTF-8 = 1 bytes per char                                    *
     *    ["100" for the ansi 'd']                                 *
     *    ["206" and "186" for the russian '?']                    *
     *                                                             *
     * UTF-16 = 2 bytes per char                                   *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["186, 3" for the russian '?']                           *
     *                                                             *
     * UTF-8 inside UTF-16                                         *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["206, 0" and "186, 0" for the russian '?']              *
     *                                                             *
     * First we need to get the UTF-8 Byte-Array and remove all    *
     * 0 byte (binary 0) while doing so.                           *
     *                                                             *
     * Binary 0 means end of string on UTF-8 encoding while on     *
     * UTF-16 one binary 0 does not end the string. Only if there  *
     * are 2 binary 0, than the UTF-16 encoding will end the       *
     * string. Because of .NET we don't have to handle this.       *
     *                                                             *
     * After removing binary 0 and receiving the Byte-Array, we    *
     * can use the UTF-8 encoding to string method now to get a    *
     * UTF-16 string.                                              *
     *                                                             *
     ***************************************************************/

    // Get UTF-8 bytes and remove binary 0 bytes (filler)
    List<byte> utf8Bytes = new List<byte>(utf8String.Length);
    foreach (byte utf8Byte in utf8String)
    {
        // Remove binary 0 bytes (filler)
        if (utf8Byte > 0) {
            utf8Bytes.Add(utf8Byte);
        }
    }

    // Convert UTF-8 bytes to UTF-16 string
    return Encoding.UTF8.GetString(utf8Bytes.ToArray());
}

In my case the DLL result is a UTF-8 string too, but unfortunately the UTF-8 string is interpreted with UTF-16 encoding ('Ö' -> [195, 0], [19, 32]). So the ANSI '–' which is 150 was converted to the UTF-16 '–' which is 8211. If you have this case too, you can use the following instead:

在我的例子中,DLL结果也是一个UTF-8字符串,但是不幸的是,UTF-8字符串被解释为UTF-16编码(' O ' ->[195, 0],[19,32])。因此,ANSI ',即150被转换为UTF-16 ',即8211。如果你也有这种情况,你可以用以下方法代替:

public static string Utf8ToUtf16(string utf8String)
{
    // Get UTF-8 bytes by reading each byte with ANSI encoding
    byte[] utf8Bytes = Encoding.Default.GetBytes(utf8String);

    // Convert UTF-8 bytes to UTF-16 bytes
    byte[] utf16Bytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);

    // Return UTF-16 bytes as UTF-16 string
    return Encoding.Unicode.GetString(utf16Bytes);
}

Or the Native-Method:

或本机方法:

[DllImport("kernel32.dll")]
private static extern Int32 MultiByteToWideChar(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPStr)] String lpMultiByteStr, Int32 cbMultiByte, [Out, MarshalAs(UnmanagedType.LPWStr)] StringBuilder lpWideCharStr, Int32 cchWideChar);

public static string Utf8ToUtf16(string utf8String)
{
    Int32 iNewDataLen = MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, null, 0);
    if (iNewDataLen > 1)
    {
        StringBuilder utf16String = new StringBuilder(iNewDataLen);
        MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, utf16String, utf16String.Capacity);

        return utf16String.ToString();
    }
    else
    {
        return String.Empty;
    }
}

If you need it the other way around, see Utf16ToUtf8. Hope I could be of help.

如果您需要它,请参见Utf16ToUtf8。希望我能帮上忙。

#3


8  

I have string that displays UTF-8 encoded characters

我有一个显示UTF-8编码字符的字符串。

There is no such thing in .NET. The string class can only store strings in UTF-16 encoding. A UTF-8 encoded string can only exist as a byte[]. Trying to store bytes into a string will not come to a good end; UTF-8 uses byte values that don't have a valid Unicode codepoint. The content will be destroyed when the string is normalized. So it is already too late to recover the string by the time your DecodeFromUtf8() starts running.

在。net中没有这样的东西。string类只能以UTF-16编码存储字符串。UTF-8编码的字符串只能以字节[]的形式存在。尝试将字节存储到字符串中不会有好的结果;UTF-8使用的字节值没有有效的Unicode码点。当字符串被规范化时,内容将被销毁。因此,当您的DecodeFromUtf8()开始运行时,恢复字符串已经太晚了。

Only handle UTF-8 encoded text with byte[]. And use UTF8Encoding.GetString() to convert it.

只处理带有字节[]的UTF-8编码文本。并使用UTF8Encoding.GetString()来转换它。

#4


2  

What you have seems to be a string incorrectly decoded from another encoding, likely code page 1252, which is US Windows default. Here's how to reverse, assuming no other loss. One loss not immediately apparent is the non-breaking space (U+00A0) at the end of your string that is not displayed. Of course it would be better to read the data source correctly in the first place, but perhaps the data source was stored incorrectly to begin with.

您看到的似乎是从另一个编码(可能是代码页1252)中错误解码的字符串,即US Windows默认。这里是如何逆转,假设没有其他损失。一个不明显的损失是在字符串末尾没有显示的非破坏空间(U+00A0)。当然,最好一开始就正确地读取数据源,但是可能数据源一开始就被错误地存储了。

using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        string junk = "déjÃ\xa0";  // Bad Unicode string

        // Turn string back to bytes using the original, incorrect encoding.
        byte[] bytes = Encoding.GetEncoding(1252).GetBytes(junk);

        // Use the correct encoding this time to convert back to a string.
        string good = Encoding.UTF8.GetString(bytes);
        Console.WriteLine(good);
    }
}

Result:

结果:

déjà

#1


11  

So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.

因此问题是UTF-8代码单元值被存储为c#字符串中16位代码单元的序列。您只需验证每个代码单元都在字节范围内,将这些值复制为字节,然后将新的UTF-8字节序列转换为UTF-16。

public static string DecodeFromUtf8(this string utf8String)
{
    // copy the string as UTF-8 bytes.
    byte[] utf8Bytes = new byte[utf8String.Length];
    for (int i=0;i<utf8String.Length;++i) {
        //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
        utf8Bytes[i] = (byte)utf8String[i];
    }

    return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}

DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà

This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).

这很容易,但是最好找到根本原因;将UTF-8代码单元复制到16位代码单元的位置。可能的罪魁祸首是某人使用错误的编码将字节转换为c#字符串。例如Encoding.Default。GetString(utf8Bytes 0 utf8Bytes.Length)。


Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:

或者,如果你确定你知道不正确的编码用于生产的字符串,而不正确的编码转换无损(通常情况下如果不正确的编码是一个字节编码),那么您可以简单地做反编码步骤原始utf - 8数据,然后你可以做正确的转换从utf - 8字节:

public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
    // the inverse of `mistake.GetString(originalBytes);`
    byte[] originalBytes = mistake.GetBytes(mangledString);
    return correction.GetString(originalBytes);
}

UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);

#2


9  

If you have a UTF-8 string, where every byte is correct ('Ö' -> [195, 0] , [150, 0]), you can use the following:

如果您有一个UTF-8字符串,其中每个字节都是正确的(' O ' ->[195, 0],[150, 0]),您可以使用以下方法:

public static string Utf8ToUtf16(string utf8String)
{
    /***************************************************************
     * Every .NET string will store text with the UTF-16 encoding, *
     * known as Encoding.Unicode. Other encodings may exist as     *
     * Byte-Array or incorrectly stored with the UTF-16 encoding.  *
     *                                                             *
     * UTF-8 = 1 bytes per char                                    *
     *    ["100" for the ansi 'd']                                 *
     *    ["206" and "186" for the russian '?']                    *
     *                                                             *
     * UTF-16 = 2 bytes per char                                   *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["186, 3" for the russian '?']                           *
     *                                                             *
     * UTF-8 inside UTF-16                                         *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["206, 0" and "186, 0" for the russian '?']              *
     *                                                             *
     * First we need to get the UTF-8 Byte-Array and remove all    *
     * 0 byte (binary 0) while doing so.                           *
     *                                                             *
     * Binary 0 means end of string on UTF-8 encoding while on     *
     * UTF-16 one binary 0 does not end the string. Only if there  *
     * are 2 binary 0, than the UTF-16 encoding will end the       *
     * string. Because of .NET we don't have to handle this.       *
     *                                                             *
     * After removing binary 0 and receiving the Byte-Array, we    *
     * can use the UTF-8 encoding to string method now to get a    *
     * UTF-16 string.                                              *
     *                                                             *
     ***************************************************************/

    // Get UTF-8 bytes and remove binary 0 bytes (filler)
    List<byte> utf8Bytes = new List<byte>(utf8String.Length);
    foreach (byte utf8Byte in utf8String)
    {
        // Remove binary 0 bytes (filler)
        if (utf8Byte > 0) {
            utf8Bytes.Add(utf8Byte);
        }
    }

    // Convert UTF-8 bytes to UTF-16 string
    return Encoding.UTF8.GetString(utf8Bytes.ToArray());
}

In my case the DLL result is a UTF-8 string too, but unfortunately the UTF-8 string is interpreted with UTF-16 encoding ('Ö' -> [195, 0], [19, 32]). So the ANSI '–' which is 150 was converted to the UTF-16 '–' which is 8211. If you have this case too, you can use the following instead:

在我的例子中,DLL结果也是一个UTF-8字符串,但是不幸的是,UTF-8字符串被解释为UTF-16编码(' O ' ->[195, 0],[19,32])。因此,ANSI ',即150被转换为UTF-16 ',即8211。如果你也有这种情况,你可以用以下方法代替:

public static string Utf8ToUtf16(string utf8String)
{
    // Get UTF-8 bytes by reading each byte with ANSI encoding
    byte[] utf8Bytes = Encoding.Default.GetBytes(utf8String);

    // Convert UTF-8 bytes to UTF-16 bytes
    byte[] utf16Bytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);

    // Return UTF-16 bytes as UTF-16 string
    return Encoding.Unicode.GetString(utf16Bytes);
}

Or the Native-Method:

或本机方法:

[DllImport("kernel32.dll")]
private static extern Int32 MultiByteToWideChar(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPStr)] String lpMultiByteStr, Int32 cbMultiByte, [Out, MarshalAs(UnmanagedType.LPWStr)] StringBuilder lpWideCharStr, Int32 cchWideChar);

public static string Utf8ToUtf16(string utf8String)
{
    Int32 iNewDataLen = MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, null, 0);
    if (iNewDataLen > 1)
    {
        StringBuilder utf16String = new StringBuilder(iNewDataLen);
        MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, utf16String, utf16String.Capacity);

        return utf16String.ToString();
    }
    else
    {
        return String.Empty;
    }
}

If you need it the other way around, see Utf16ToUtf8. Hope I could be of help.

如果您需要它,请参见Utf16ToUtf8。希望我能帮上忙。

#3


8  

I have string that displays UTF-8 encoded characters

我有一个显示UTF-8编码字符的字符串。

There is no such thing in .NET. The string class can only store strings in UTF-16 encoding. A UTF-8 encoded string can only exist as a byte[]. Trying to store bytes into a string will not come to a good end; UTF-8 uses byte values that don't have a valid Unicode codepoint. The content will be destroyed when the string is normalized. So it is already too late to recover the string by the time your DecodeFromUtf8() starts running.

在。net中没有这样的东西。string类只能以UTF-16编码存储字符串。UTF-8编码的字符串只能以字节[]的形式存在。尝试将字节存储到字符串中不会有好的结果;UTF-8使用的字节值没有有效的Unicode码点。当字符串被规范化时,内容将被销毁。因此,当您的DecodeFromUtf8()开始运行时,恢复字符串已经太晚了。

Only handle UTF-8 encoded text with byte[]. And use UTF8Encoding.GetString() to convert it.

只处理带有字节[]的UTF-8编码文本。并使用UTF8Encoding.GetString()来转换它。

#4


2  

What you have seems to be a string incorrectly decoded from another encoding, likely code page 1252, which is US Windows default. Here's how to reverse, assuming no other loss. One loss not immediately apparent is the non-breaking space (U+00A0) at the end of your string that is not displayed. Of course it would be better to read the data source correctly in the first place, but perhaps the data source was stored incorrectly to begin with.

您看到的似乎是从另一个编码(可能是代码页1252)中错误解码的字符串,即US Windows默认。这里是如何逆转,假设没有其他损失。一个不明显的损失是在字符串末尾没有显示的非破坏空间(U+00A0)。当然,最好一开始就正确地读取数据源,但是可能数据源一开始就被错误地存储了。

using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        string junk = "déjÃ\xa0";  // Bad Unicode string

        // Turn string back to bytes using the original, incorrect encoding.
        byte[] bytes = Encoding.GetEncoding(1252).GetBytes(junk);

        // Use the correct encoding this time to convert back to a string.
        string good = Encoding.UTF8.GetString(bytes);
        Console.WriteLine(good);
    }
}

Result:

结果:

déjà