如何读取Unicode输入并比较Python中的Unicode字符串?

I work in Python and would like to read user input (from command line) in Unicode format, ie a Unicode equivalent of raw_input?

我在Python中工作，想要读取Unicode格式的用户输入(从命令行)，即等同于raw_input的Unicode格式?

Also, I would like to test Unicode strings for equality and it looks like a standard == does not work.

此外，我还想测试Unicode字符串是否相等，它看起来标准== =不起作用。

Thank you for your help !

谢谢你的帮助!

4 个解决方案

#1

raw_input() returns strings as encoded by the OS or UI facilities. The difficulty is knowing which is that decoding. You might attempt the following:

raw_input()返回由OS或UI设施编码的字符串。难点在于知道哪个是解码。你可以尝试以下方法:

import sys, locale
text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))

which should work correctly in most of the cases.

这在大多数情况下都应该是正确的。

We need more data about not working Unicode comparisons in order to help you. However, it might be a matter of normalization. Consider the following:

为了帮助您，我们需要更多关于不使用Unicode比较的数据。然而，这可能是一个正常化的问题。考虑以下:

>>> a1= u'\xeatre'
>>> a2= u'e\u0302tre'

a1 and a2 are equivalent but not equal:

a1和a2相等，但不相等:

>>> print a1, a2
être être
>>> print a1 == a2
False

So you might want to use the unicodedata.normalize() method:

因此，您可能需要使用unicodedata.normalize()方法:

>>> import unicodedata as ud
>>> ud.normalize('NFC', a1)
u'\xeatre'
>>> ud.normalize('NFC', a2)
u'\xeatre'
>>> ud.normalize('NFC', a1) == ud.normalize('NFC', a2)
True

If you give us more information, we might be able to help you more, though.

如果你给我们更多的信息，我们也许能帮你更多的忙。

#2

It should work. raw_input returns a byte string which you must decode using the correct encoding to get your unicode object. For example, the following works for me under Python 2.5 / Terminal.app / OSX:

它应该工作。raw_input返回一个字节字符串，您必须使用正确的编码解码以获得unicode对象。例如，下面的代码适用于Python 2.5 / Terminal。应用程序/ OSX:

>>> bytes = raw_input()
日本語 Ελληνικά
>>> bytes
'\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e \xce\x95\xce\xbb\xce\xbb\xce\xb7\xce\xbd\xce\xb9\xce\xba\xce\xac'

>>> uni = bytes.decode('utf-8') # substitute the encoding of your terminal if it's not utf-8
>>> uni
u'\u65e5\u672c\u8a9e \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ac'

>>> print uni
日本語 Ελληνικά

As for comparing unicode strings: can you post an example where the comparison doesn't work?

至于比较unicode字符串:你能举一个比较不起作用的例子吗?

#3

I'm not really sure, which format you mean by "Unicode format", there are several. UTF-8? UTF-16? In any case you should be able to read a normal string with raw_input and then decode it using the strings decode method:

我不太确定“Unicode格式”指的是哪种格式，有几种。utf - 8 ?utf - 16 ?无论如何，您应该能够使用raw_input读取一个普通字符串，然后使用string decode方法对其进行解码:

raw = raw_input("Please input some funny characters: ")
decoded = raw.decode("utf-8")

If you have a different input encoding just use "utf-16" or whatever instead of "utf-8". Also see the codecs modules docs for different kinds of encodings.

如果您有不同的输入编码，请使用“utf-16”或其他什么，而不是“utf-8”。还可以查看不同类型编码的codecs模块文档。

Comparing then should work just fine with ==. If you have string literals containing special characters you should prefix them with "u" to mark them as unicode:

比较起来应该可以用==。如果您有包含特殊字符的字符串，您应该在它们前面加上“u”以标记为unicode:

if decoded == u"äöü":
  print "Do you speak German?"

And if you want to output these strings again, you probably want to encode them again in the desired encoding:

如果你想再次输出这些字符串，你可能想要在编码中重新编码:

print decoded.encode("utf-8")

#4

In the general case, it's probably not possible to compare unicode strings. The problem is that there are several ways to compose the same characters. A simple example is accented roman characters. Although there are codepoints for basically all of the commonly used accented characters, it is also correct to compose them from unaccented base letters and a non-spacing accent. This issue is more significant in many non-roman alphabets.

在一般情况下，可能无法比较unicode字符串。问题是有几种方法可以组成相同的字符。一个简单的例子是重读的罗马字母。虽然基本上所有常用的重音字符都有代码点，但是用非重音的基本字母和非间隔重音来组合它们也是正确的。这个问题在许多非罗马字母中更为重要。

#1