如何在*nix中使用控制台工具将\uXXXX unicode转换为UTF-8 ?

时间:2022-06-08 02:06:20

I use curl to get some URL response, it's JSON response and it contains unicode-escaped national characters like \u0144 (ń) and \u00f3 (ó).

我使用curl来得到一些URL响应,JSON响应,它包含unicode-escaped国家字符\ u0144(ń)和\ u00f3(o)。

How can I convert them to UTF-8 or any other encoding to save into file?

如何将它们转换为UTF-8或任何其他编码以保存到文件中?

7 个解决方案

#1


26  

I don't know which distribution you are using, but uni2ascii should be included.

我不知道您使用的是哪个发行版,但是应该包括uni2ascii。

$ sudo apt-get install uni2ascii

It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!

它只依赖于libc6,所以它是一个轻量级的解决方案(uni2ascii i386 4.18-2在Ubuntu上是55,0 kB)!

Then to use it:

然后使用它:

$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó

#2


29  

Might be a bit ugly, but echo -e should do it:

可能有点丑,但echo -e应该这样做:

echo -en "$(curl $URL)"

-e interprets escapes, -n suppresses the newline echo would normally add.

-e解释转义,-n抑制新行回波通常会添加。

Note: The \u escape works in the bash builtin echo, but not /usr/bin/echo.

注意:在bash builtin echo中,\u脱机工作,但不是/usr/bin/ echo。

As pointed out in the comments, this is bash 4.2+, and 4.2.x have a bug handling 0x00ff/17 values (0x80-0xff).

正如在注释中指出的,这是bash 4.2+和4.2。x有一个处理0x00ff/17值的错误(0x80-0xff)。

#3


28  

I found native2ascii from JDK as the best way to do it:

我从JDK中找到了native2ascii,这是最好的方法:

native2ascii -encoding UTF-8 -reverse src.txt dest.txt

Detailed description is here: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html

详细描述如下:http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html。

Update: No longer available since JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431

更新:自JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431之后,不再可用。

#4


18  

Assuming the \u is always followed by exactly 4 hex digits:

假设\u始终紧跟着4个十六进制数字:

#!/usr/bin/perl

use strict;
use warnings;

binmode(STDOUT, ':utf8');

while (<>) {
    s/\\u([0-9a-fA-F]{4})/chr(hex($1))/eg;
    print;
}

The binmode puts standard output into UTF-8 mode. The s... command replaces each occurrence of \u followed by 4 hex digits with the corresponding character. The e suffix causes the replacement to be evaluated as an expression rather than treated as a string; the g says to replace all occurrences rather than just the first.

binmode将标准输出放入UTF-8模式。年代……命令将每个发生的\u替换为4个十六进制数字和相应的字符。e后缀使替换被计算为表达式,而不是作为字符串处理;g说要替换所有的事件,而不是第一个。

You can save the above to a file somewhere in your $PATH (don't forget the chmod +x). It filters standard input (or one or more files named on the command line) to standard output.

您可以将上述文件保存到$PATH中的某个文件(不要忘记chmod +x)。它过滤标准输入(或命令行中指定的一个或多个文件)到标准输出。

#5


9  

use /usr/bin/printf "\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima" to get proper unicode-to-utf8 conversion.

使用/usr/bin/printf "\u0160ini\ u0161i - A\u017e sa skon\u010d\u00ed zima"获得适当的unicodeto -utf8转换。

#6


8  

Don't rely on regexes: JSON has some strange corner-cases with \u escapes and non-BMP code points. (specifically, JSON will encode one code-point using two \u escapes) If you assume 1 escape sequence translates to 1 code point, you're doomed on such text.

不要依赖regexes: JSON有一些奇怪的转角,有\u转义和非bmp代码点。(具体来说,JSON将用两个\u来编码一个代码点)如果您假设一个转义序列转换为1个代码点,那么您就注定要在这样的文本中出现。

Using a full JSON parser from the language of your choice is considerably more robust:

从您选择的语言中使用完整的JSON解析器会更加健壮:

$ echo '["foo bar \u0144\n"]' | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'

That's really just feeding the data to this short python script:

这只是给这个短python脚本提供了数据:

import json
import sys

data = json.load(sys.stdin)
data = data[0] # change this to find your string in the JSON
sys.stdout.write(data.encode('utf-8'))

From which you can save as foo.py and call as curl ... | foo.py

您可以从其中保存为foo。py和call as curl…| foo.py

An example that will break most of the other attempts in this question is "\ud83d\udca3":

一个将打破这个问题的大部分尝试的例子是“\ud83d\udca3”:

% printf '"\\ud83d\\udca3"' | python2 -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'; echo
????
# echo will result in corrupt output:
% echo -e $(printf '"\\ud83d\\udca3"') 
"������"
# native2ascii won't even try (this is correct for its intended use case, however, just not ours):
% printf '"\\ud83d\\udca3"' | native2ascii -encoding utf-8 -reverse
"\ud83d\udca3"

#7


-1  

Works on Windows, should work on *nix too. Uses python 2.

在Windows上工作,也应该在*nix上工作。使用python 2。

#!/usr/bin/env python
from __future__ import unicode_literals
import sys
import json
import codecs

def unescape_json(fname_in, fname_out):
    with file(fname_in, 'rb') as fin:
        js = json.load(fin)
    with codecs.open(fname_out, 'wb', 'utf-8') as fout:
        json.dump(js, fout, ensure_ascii=False)

def usage():
    print "Converts all \\uXXXX codes in json into utf-8"
    print "Usage: .py infile outfile"
    sys.exit(1)

def main():
    try:
        fname_in, fname_out = sys.argv[1:]
    except Exception:
        usage()

    unescape_json(fname_in, fname_out)
    print "Done."

if __name__ == '__main__':
    main()

#1


26  

I don't know which distribution you are using, but uni2ascii should be included.

我不知道您使用的是哪个发行版,但是应该包括uni2ascii。

$ sudo apt-get install uni2ascii

It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!

它只依赖于libc6,所以它是一个轻量级的解决方案(uni2ascii i386 4.18-2在Ubuntu上是55,0 kB)!

Then to use it:

然后使用它:

$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó

#2


29  

Might be a bit ugly, but echo -e should do it:

可能有点丑,但echo -e应该这样做:

echo -en "$(curl $URL)"

-e interprets escapes, -n suppresses the newline echo would normally add.

-e解释转义,-n抑制新行回波通常会添加。

Note: The \u escape works in the bash builtin echo, but not /usr/bin/echo.

注意:在bash builtin echo中,\u脱机工作,但不是/usr/bin/ echo。

As pointed out in the comments, this is bash 4.2+, and 4.2.x have a bug handling 0x00ff/17 values (0x80-0xff).

正如在注释中指出的,这是bash 4.2+和4.2。x有一个处理0x00ff/17值的错误(0x80-0xff)。

#3


28  

I found native2ascii from JDK as the best way to do it:

我从JDK中找到了native2ascii,这是最好的方法:

native2ascii -encoding UTF-8 -reverse src.txt dest.txt

Detailed description is here: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html

详细描述如下:http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html。

Update: No longer available since JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431

更新:自JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431之后,不再可用。

#4


18  

Assuming the \u is always followed by exactly 4 hex digits:

假设\u始终紧跟着4个十六进制数字:

#!/usr/bin/perl

use strict;
use warnings;

binmode(STDOUT, ':utf8');

while (<>) {
    s/\\u([0-9a-fA-F]{4})/chr(hex($1))/eg;
    print;
}

The binmode puts standard output into UTF-8 mode. The s... command replaces each occurrence of \u followed by 4 hex digits with the corresponding character. The e suffix causes the replacement to be evaluated as an expression rather than treated as a string; the g says to replace all occurrences rather than just the first.

binmode将标准输出放入UTF-8模式。年代……命令将每个发生的\u替换为4个十六进制数字和相应的字符。e后缀使替换被计算为表达式,而不是作为字符串处理;g说要替换所有的事件,而不是第一个。

You can save the above to a file somewhere in your $PATH (don't forget the chmod +x). It filters standard input (or one or more files named on the command line) to standard output.

您可以将上述文件保存到$PATH中的某个文件(不要忘记chmod +x)。它过滤标准输入(或命令行中指定的一个或多个文件)到标准输出。

#5


9  

use /usr/bin/printf "\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima" to get proper unicode-to-utf8 conversion.

使用/usr/bin/printf "\u0160ini\ u0161i - A\u017e sa skon\u010d\u00ed zima"获得适当的unicodeto -utf8转换。

#6


8  

Don't rely on regexes: JSON has some strange corner-cases with \u escapes and non-BMP code points. (specifically, JSON will encode one code-point using two \u escapes) If you assume 1 escape sequence translates to 1 code point, you're doomed on such text.

不要依赖regexes: JSON有一些奇怪的转角,有\u转义和非bmp代码点。(具体来说,JSON将用两个\u来编码一个代码点)如果您假设一个转义序列转换为1个代码点,那么您就注定要在这样的文本中出现。

Using a full JSON parser from the language of your choice is considerably more robust:

从您选择的语言中使用完整的JSON解析器会更加健壮:

$ echo '["foo bar \u0144\n"]' | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'

That's really just feeding the data to this short python script:

这只是给这个短python脚本提供了数据:

import json
import sys

data = json.load(sys.stdin)
data = data[0] # change this to find your string in the JSON
sys.stdout.write(data.encode('utf-8'))

From which you can save as foo.py and call as curl ... | foo.py

您可以从其中保存为foo。py和call as curl…| foo.py

An example that will break most of the other attempts in this question is "\ud83d\udca3":

一个将打破这个问题的大部分尝试的例子是“\ud83d\udca3”:

% printf '"\\ud83d\\udca3"' | python2 -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'; echo
????
# echo will result in corrupt output:
% echo -e $(printf '"\\ud83d\\udca3"') 
"������"
# native2ascii won't even try (this is correct for its intended use case, however, just not ours):
% printf '"\\ud83d\\udca3"' | native2ascii -encoding utf-8 -reverse
"\ud83d\udca3"

#7


-1  

Works on Windows, should work on *nix too. Uses python 2.

在Windows上工作,也应该在*nix上工作。使用python 2。

#!/usr/bin/env python
from __future__ import unicode_literals
import sys
import json
import codecs

def unescape_json(fname_in, fname_out):
    with file(fname_in, 'rb') as fin:
        js = json.load(fin)
    with codecs.open(fname_out, 'wb', 'utf-8') as fout:
        json.dump(js, fout, ensure_ascii=False)

def usage():
    print "Converts all \\uXXXX codes in json into utf-8"
    print "Usage: .py infile outfile"
    sys.exit(1)

def main():
    try:
        fname_in, fname_out = sys.argv[1:]
    except Exception:
        usage()

    unescape_json(fname_in, fname_out)
    print "Done."

if __name__ == '__main__':
    main()