如何在.NET字符串中将Unicode转义序列转换为Unicode字符?

Say you've loaded a text file into a string, and you'd like to convert all Unicode escapes into actual Unicode characters inside of the string.

假设您已经将一个文本文件加载到一个字符串中，并希望将所有Unicode转义转换为字符串中的实际Unicode字符。

Example:

例子:

"The following is the top half of an integral character in Unicode '\u2320', and this is the lower half '\U2321'."

下面是Unicode '\u2320'中一个完整字符的上半部分，这是下半部分'\U2321'。

4 个解决方案

#1

The answer is simple and works well with strings up to at least several thousand characters.

答案很简单，对于至少几千个字符的字符串都很好用。

Example 1:

示例1:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" );
result = rx.Replace( result, match => ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString() );

Example 2:

示例2:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" );
result = rx.Replace( result, delegate (Match match) { return ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); } );

The first example shows the replacement being made using a lambda expression (C# 3.0) and the second uses a delegate which should work with C# 2.0.

第一个示例显示了使用lambda表达式(c# 3.0)进行的替换，第二个示例使用了应该与c# 2.0一起使用的委托。

To break down what's going on here, first we create a regular expression:

要分析这里发生了什么，首先我们创建一个正则表达式:

new Regex( @"\\[uU]([0-9A-F]{4})" );

Then we call Replace() with the string 'result' and an anonymous method (lambda expression in the first example and the delegate in the second - the delegate could also be a regular method) that converts each regular expression that is found in the string.

然后我们将Replace()调用为字符串'result'和一个匿名方法(第一个示例中的lambda表达式和第二个示例中的委托——委托也可以是一个正则方法)，该方法转换字符串中找到的每个正则表达式。

The Unicode escape is processed like this:

Unicode转义处理如下:

((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); });

Get the string representing the number part of the escape (skip the first two characters).

获取表示转义的数字部分的字符串(跳过前两个字符)。

match.Value.Substring(2)

Parse that string using Int32.Parse() which takes the string and the number format that the Parse() function should expect which in this case is a hex number.

使用Int32.Parse()解析该字符串，该字符串接受Parse()函数应该期望的数字格式，在本例中该格式为十六进制数字。

NumberStyles.HexNumber

Then we cast the resulting number to a Unicode character:

然后我们将结果数字转换为Unicode字符:

(char)

And finally we call ToString() on the Unicode character which gives us its string representation which is the value passed back to Replace():

最后，我们在Unicode字符上调用ToString()，它给我们字符串表示，这是返回给Replace()的值:

.ToString()

Note: Instead of grabbing the text to be converted with a Substring call you could use the match parameter's GroupCollection, and a subexpressions in the regular expression to capture just the number ('2320'), but that's more complicated and less readable.

注意:与使用子字符串调用获取要转换的文本不同，您可以使用match参数的GroupCollection和正则表达式中的子表达式来捕获数字(‘2320’)，但这更复杂，可读性更差。

#2

Refactored a little more:

重构更:

Regex regex = new Regex (@"\\U([0-9A-F]{4})", RegexOptions.IgnoreCase);
string line = "...";
line = regex.Replace (line, match => ((char)int.Parse (match.Groups[1].Value,
  NumberStyles.HexNumber)).ToString ());

#3

This is the VB.NET equivalent:

这是一个VB。NET相当于:

Dim rx As New RegularExpressions.Regex("\\[uU]([0-9A-Fa-f]{4})")
result = rx.Replace(result, Function(match) CChar(ChrW(Int32.Parse(match.Value.Substring(2), Globalization.NumberStyles.HexNumber))).ToString())

#4

I think you better add the small letters to your regular expression. It worked better for me.

我认为你最好把小字母加到你的正则表达式中。对我来说效果更好。

Regex rx = new Regex(@"\\[uU]([0-9A-Fa-f]{4})");
result = rx.Replace(result, match => ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString());

#1