将unicode字符编码为unicode转义序列

时间:2021-04-02 00:15:10

I've a CSV file containing sites along with addresses. I need to work on this file to produce a json file that I will use in Django to load initial data to my database. To do that, I need to convert all special characters from the CSV file to unicode escaped characters.

我有一个包含网站和地址的CSV文件。我需要处理这个文件来生成一个json文件,我将在Django中使用它来将初始数据加载到我的数据库中。为此,我需要将CSV文件中的所有特殊字符转换为unicode转义字符。

Here is an example:

这是一个例子:

Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A

It should be converted to:

它应该转换为:

\u00D6rnsk\u00F6ldsvik;SE;Ornskoldsvik;\u00C5ngermanlandsgatan 28 A

The following site is doing exactly the conversion I'm expecting: http://itpro.cz/juniconv/ but I'de like to find a way to do it from command line (bash) or in python. I've already tried using iconv, uconv and some python scripts without real success.

以下网站正在完成我期望的转换:http://itpro.cz/juniconv/但我想找到一种方法从命令行(bash)或python中完成。我已经尝试过使用iconv,uconv和一些python脚本而没有真正的成功。

What kind of script is running behind the juniconv website?

juniconv网站背后运行什么样的脚本?

Thank you in avance for any suggestion.

感谢您提出任何建议。

3 个解决方案

#1


1  

If you want to get Unicode escapes similar to Java in Python; you could use JSON format:

如果你想在Python中获得类似于Java的Unicode转义;你可以使用JSON格式:

>>> import json
>>> import sys
>>> s = u'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A'
>>> json.dump(s, sys.stdout)
"\u00d6rnsk\u00f6ldsvik;SE;Ornskoldsvik;\u00c5ngermanlandsgatan 28 A"

There is also, unicode-escape codec but you shouldn't use it: it produces Python-specific escaping (how Python Unicode string literals look like):

还有unicode-escape编解码器,但你不应该使用它:它产生特定于Python的转义(Python Unicode字符串文字的外观):

>>> print s.encode('unicode-escape')
\xd6rnsk\xf6ldsvik;SE;Ornskoldsvik;\xc5ngermanlandsgatan 28 A

#2


0  

Maybe something like this might help you? I assume you have a utf-8 string...

也许这样的事情对你有帮助吗?我假设你有一个utf-8字符串......

import csv
csv_reader = csv.reader(utf8_data)
for row in csv_reader:
    encoded_row = [unicode(cell, 'utf-8') for cell in row]
    #print(encoded_row)

#3


0  

You can do it with GNU libiconv's --unicode-subst option:

您可以使用GNU libiconv的--unicode-subst选项来完成:

$ echo 'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A' | \
  iconv -t ASCII --unicode-subst='\u%04X'
\u00D6rnsk\u00F6ldsvik;SE;Ornskoldsvik;\u00C5ngermanlandsgatan 28 A

Incidentally, GNU libiconv also has a pseudo-encoding called JAVA that does this:

顺便说一句,GNU libiconv也有一个名为JAVA的伪编码,它执行此操作:

$ echo 'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A' | \
  iconv -t JAVA
\u00d6rnsk\u00f6ldsvik;SE;Ornskoldsvik;\u00c5ngermanlandsgatan 28 A

Note: GNU libiconv is not the iconv included in with glibc. It's a separate package that is usually not installed on glibc systems because glibc's iconv is just as good for 99% of the purposes.

注意:GNU libiconv不是glibc中包含的iconv。它是一个单独的软件包,通常没有安装在glibc系统上,因为glibc的iconv对于99%的目的来说同样好。

#1


1  

If you want to get Unicode escapes similar to Java in Python; you could use JSON format:

如果你想在Python中获得类似于Java的Unicode转义;你可以使用JSON格式:

>>> import json
>>> import sys
>>> s = u'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A'
>>> json.dump(s, sys.stdout)
"\u00d6rnsk\u00f6ldsvik;SE;Ornskoldsvik;\u00c5ngermanlandsgatan 28 A"

There is also, unicode-escape codec but you shouldn't use it: it produces Python-specific escaping (how Python Unicode string literals look like):

还有unicode-escape编解码器,但你不应该使用它:它产生特定于Python的转义(Python Unicode字符串文字的外观):

>>> print s.encode('unicode-escape')
\xd6rnsk\xf6ldsvik;SE;Ornskoldsvik;\xc5ngermanlandsgatan 28 A

#2


0  

Maybe something like this might help you? I assume you have a utf-8 string...

也许这样的事情对你有帮助吗?我假设你有一个utf-8字符串......

import csv
csv_reader = csv.reader(utf8_data)
for row in csv_reader:
    encoded_row = [unicode(cell, 'utf-8') for cell in row]
    #print(encoded_row)

#3


0  

You can do it with GNU libiconv's --unicode-subst option:

您可以使用GNU libiconv的--unicode-subst选项来完成:

$ echo 'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A' | \
  iconv -t ASCII --unicode-subst='\u%04X'
\u00D6rnsk\u00F6ldsvik;SE;Ornskoldsvik;\u00C5ngermanlandsgatan 28 A

Incidentally, GNU libiconv also has a pseudo-encoding called JAVA that does this:

顺便说一句,GNU libiconv也有一个名为JAVA的伪编码,它执行此操作:

$ echo 'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A' | \
  iconv -t JAVA
\u00d6rnsk\u00f6ldsvik;SE;Ornskoldsvik;\u00c5ngermanlandsgatan 28 A

Note: GNU libiconv is not the iconv included in with glibc. It's a separate package that is usually not installed on glibc systems because glibc's iconv is just as good for 99% of the purposes.

注意:GNU libiconv不是glibc中包含的iconv。它是一个单独的软件包,通常没有安装在glibc系统上,因为glibc的iconv对于99%的目的来说同样好。