Mac OS X文件系统的Unicode编码在Python中不正确?

Having a bit of struggle with Unicode file names in OS X and Python. I am trying to use filenames as input for a regular expression later in the code, but the encoding used in the filenames seem to be different from what sys.getfilesystemencoding() tells me. Take the following code:

在OS X和Python中使用Unicode文件名有点困难。我试图在后面的代码中使用文件名作为正则表达式的输入，但是文件名中使用的编码似乎与system .getfilesystemencoding()告诉我的不同。下面的代码:

#!/usr/bin/env python
# coding=utf-8

import sys,os
print sys.getfilesystemencoding()

p = u'/temp/s/'
s = u'åäö'
print 's', [ord(c) for c in s], s
s2 = s.encode(sys.getfilesystemencoding())
print 's2', [ord(c) for c in s2], s2
os.mkdir(p+s)
for d in os.listdir(p):
  print 'dir', [ord(c) for c in d], d

It outputs the following:

它输出如下:

utf-8
s [229, 228, 246] åäö
s2 [195, 165, 195, 164, 195, 182] åäö
dir [97, 778, 97, 776, 111, 776] åäö

So, file system encoding is utf-8, but when I encode my filename åäö using that, it will not be the same as if I create a dir name with the same string. I expect that when I use my string åäö to create a dir, and read it's name back, it should use the same codes as if I applied the encoding directly.

因此，文件系统编码是utf-8，但是当我使用它对文件名aao进行编码时，它将与创建具有相同字符串的dir名不同。我希望，当我使用字符串aao创建一个dir并读取它的名称时，它应该使用与直接应用编码相同的代码。

If we look at the code points 97, 778, 97, 776, 111, 776, it's basically ASCII characters with added diacritic, e.g. o + ¨ = ö, which makes it two characters, not one. How can I avoid this discrepancy, is there an encoding scheme in Python that matches this behaviour by OS X, and why is not getfilesystemencoding() giving me the right result?

如果我们看一下代码点97,778,97,776,111,776,基本上与添加可区别的ASCII字符,如o +¨= o,这使得这两个字符,而不是一个。我如何避免这种差异，Python中是否存在与OS X行为匹配的编码方案，为什么getfilesystemencoding()没有给我正确的结果?

Or have I messed up?

还是我搞砸了?

2 个解决方案

#1

MacOS X uses a special kind of decomposed UTF-8 to store filenames. If you need to e.g. read in filenames and write them to a "normal" UTF-8 file, you must normalize them :

MacOS X使用一种特殊的分解UTF-8来存储文件名。例如，如果你需要读取文件名并将其写入“普通”UTF-8文件，你必须将其规范化:

filename = unicodedata.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')

from here: https://web.archive.org/web/20120423075412/http://boodebr.org/main/python/all-about-python-and-unicode

从这里:https://web.archive.org/web/20120423075412/http:/ /boodebr.org/main/python/all-about-python-and-unicode

#2

getfilesystemencoding() is giving you the correct response (the encoding), but it does not tell you the unicode normalisation form.

getfilesystemencoding()为您提供了正确的响应(编码)，但是它没有告诉您unicode规范化形式。

In particular, the HFS+ filesystem uses UTF-8 encoding, and a normalisation form close to "D" (which requires composed characters like ö to be decomposed into o¨). HFS+ is also tied to the normalisation form as it existed in Unicode version 3.2—as detailed in Apple's documentation for the HFS+ format.

特别是,HFS +文件系统使用utf - 8编码,一种正常化接近“D”(这需要由字符如o被分解成o¨)。HFS+还与Unicode版本3.2中存在的规范化形式绑定在一起，这在苹果的HFS+格式文档中有详细说明。

Python's unicodedata.normalize method converts between forms, and if you prefix the call with the ucd_3_2_0 object, you can constrain it to Unicode version 3.2:

Python的unicodedata。规范化方法在表单之间转换，如果您使用ucd_3_ 2_0对象的调用，您可以将其限制为Unicode版本3.2:

filename = unicodedata.ucd_3_2_0.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')

#1