I've a CSV file containing sites along with addresses. I need to work on this file to produce a json file that I will use in Django to load initial data to my database. To do that, I need to convert all special characters from the CSV file to unicode escaped characters.
我有一个包含网站和地址的CSV文件。我需要处理这个文件来生成一个json文件,我将在Django中使用它来将初始数据加载到我的数据库中。为此,我需要将CSV文件中的所有特殊字符转换为unicode转义字符。
Here is an example:
这是一个例子:
Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A
It should be converted to:
它应该转换为:
\u00D6rnsk\u00F6ldsvik;SE;Ornskoldsvik;\u00C5ngermanlandsgatan 28 A
The following site is doing exactly the conversion I'm expecting: http://itpro.cz/juniconv/ but I'de like to find a way to do it from command line (bash) or in python. I've already tried using iconv
, uconv
and some python scripts without real success.
以下网站正在完成我期望的转换:http://itpro.cz/juniconv/但我想找到一种方法从命令行(bash)或python中完成。我已经尝试过使用iconv,uconv和一些python脚本而没有真正的成功。
What kind of script is running behind the juniconv
website?
juniconv网站背后运行什么样的脚本?
Thank you in avance for any suggestion.
感谢您提出任何建议。
3 个解决方案
#1
1
If you want to get Unicode escapes similar to Java in Python; you could use JSON format:
如果你想在Python中获得类似于Java的Unicode转义;你可以使用JSON格式:
>>> import json
>>> import sys
>>> s = u'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A'
>>> json.dump(s, sys.stdout)
"\u00d6rnsk\u00f6ldsvik;SE;Ornskoldsvik;\u00c5ngermanlandsgatan 28 A"
There is also, unicode-escape
codec but you shouldn't use it: it produces Python-specific escaping (how Python Unicode string literals look like):
还有unicode-escape编解码器,但你不应该使用它:它产生特定于Python的转义(Python Unicode字符串文字的外观):
>>> print s.encode('unicode-escape')
\xd6rnsk\xf6ldsvik;SE;Ornskoldsvik;\xc5ngermanlandsgatan 28 A
#2
0
Maybe something like this might help you? I assume you have a utf-8 string...
也许这样的事情对你有帮助吗?我假设你有一个utf-8字符串......
import csv
csv_reader = csv.reader(utf8_data)
for row in csv_reader:
encoded_row = [unicode(cell, 'utf-8') for cell in row]
#print(encoded_row)
#3
0
You can do it with GNU libiconv's --unicode-subst option:
您可以使用GNU libiconv的--unicode-subst选项来完成:
$ echo 'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A' | \
iconv -t ASCII --unicode-subst='\u%04X'
\u00D6rnsk\u00F6ldsvik;SE;Ornskoldsvik;\u00C5ngermanlandsgatan 28 A
Incidentally, GNU libiconv also has a pseudo-encoding called JAVA that does this:
顺便说一句,GNU libiconv也有一个名为JAVA的伪编码,它执行此操作:
$ echo 'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A' | \
iconv -t JAVA
\u00d6rnsk\u00f6ldsvik;SE;Ornskoldsvik;\u00c5ngermanlandsgatan 28 A
Note: GNU libiconv is not the iconv included in with glibc. It's a separate package that is usually not installed on glibc systems because glibc's iconv is just as good for 99% of the purposes.
注意:GNU libiconv不是glibc中包含的iconv。它是一个单独的软件包,通常没有安装在glibc系统上,因为glibc的iconv对于99%的目的来说同样好。
#1
1
If you want to get Unicode escapes similar to Java in Python; you could use JSON format:
如果你想在Python中获得类似于Java的Unicode转义;你可以使用JSON格式:
>>> import json
>>> import sys
>>> s = u'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A'
>>> json.dump(s, sys.stdout)
"\u00d6rnsk\u00f6ldsvik;SE;Ornskoldsvik;\u00c5ngermanlandsgatan 28 A"
There is also, unicode-escape
codec but you shouldn't use it: it produces Python-specific escaping (how Python Unicode string literals look like):
还有unicode-escape编解码器,但你不应该使用它:它产生特定于Python的转义(Python Unicode字符串文字的外观):
>>> print s.encode('unicode-escape')
\xd6rnsk\xf6ldsvik;SE;Ornskoldsvik;\xc5ngermanlandsgatan 28 A
#2
0
Maybe something like this might help you? I assume you have a utf-8 string...
也许这样的事情对你有帮助吗?我假设你有一个utf-8字符串......
import csv
csv_reader = csv.reader(utf8_data)
for row in csv_reader:
encoded_row = [unicode(cell, 'utf-8') for cell in row]
#print(encoded_row)
#3
0
You can do it with GNU libiconv's --unicode-subst option:
您可以使用GNU libiconv的--unicode-subst选项来完成:
$ echo 'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A' | \
iconv -t ASCII --unicode-subst='\u%04X'
\u00D6rnsk\u00F6ldsvik;SE;Ornskoldsvik;\u00C5ngermanlandsgatan 28 A
Incidentally, GNU libiconv also has a pseudo-encoding called JAVA that does this:
顺便说一句,GNU libiconv也有一个名为JAVA的伪编码,它执行此操作:
$ echo 'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A' | \
iconv -t JAVA
\u00d6rnsk\u00f6ldsvik;SE;Ornskoldsvik;\u00c5ngermanlandsgatan 28 A
Note: GNU libiconv is not the iconv included in with glibc. It's a separate package that is usually not installed on glibc systems because glibc's iconv is just as good for 99% of the purposes.
注意:GNU libiconv不是glibc中包含的iconv。它是一个单独的软件包,通常没有安装在glibc系统上,因为glibc的iconv对于99%的目的来说同样好。