如何在Python中按字母顺序排序unicode字符串?

时间:2021-01-10 07:12:30

Python sorts by byte value by default, which means é comes after z and other equally funny things. What is the best way to sort alphabetically in Python?

Python在默认情况下按字节值排序,这意味着在z和其他同样有趣的事情之后。在Python中按字母顺序排序的最好方法是什么?

Is there a library for this? I couldn't find anything. Preferrably sorting should have language support so it understands that åäö should be sorted after z in Swedish, but that ü should be sorted by u, etc. Unicode support is thereby pretty much a requirement.

这里有图书馆吗?我找不到任何东西。优先排序应该有语言支持,所以它理解aao应该以瑞典的z来排序,但是应该由u来排序,等等。Unicode的支持是非常必要的。

If there is no library for it, what is the best way to do this? Just make a mapping from letter to a integer value and map the string to a integer list with that?

如果没有库,那么最好的方法是什么?只需要从字母到整数值进行映射,然后将字符串映射到一个整数列表?

11 个解决方案

#1


65  

IBM's ICU library does that (and a lot more). It has Python bindings: PyICU.

IBM的ICU库可以做到这一点(以及更多)。它有Python绑定:PyICU。

Update: The core difference in sorting between ICU and locale.strcoll is that ICU uses the full Unicode Collation Algorithm while strcoll uses ISO 14651.

更新:ICU和语言环境之间排序的核心差异。strcoll是ICU使用完整的Unicode排序算法,而strcoll使用ISO 14651。

The differences between those two algorithms are briefly summarized here: http://unicode.org/faq/collation.html#13. These are rather exotic special cases, which should rarely matter in practice.

这两种算法之间的区别在这里简单地总结一下:http://unicode.org/faq/coll.html #13。这些都是相当奇特的特殊情况,在实践中应该很少有影响。

>>> import icu # pip install PyICU
>>> sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']
>>> collator = icu.Collator.createInstance(icu.Locale('de_DE.UTF-8'))
>>> sorted(['a','b','c','ä'], key=collator.getSortKey)
['a', 'ä', 'b', 'c']

#2


47  

I don't see this in the answers. My Application sorts according to the locale using python's standard library. It is pretty easy.

答案里没有这个。我的应用程序使用python的标准库根据语言环境进行排序。它是非常容易的。

# python2.5 code below
# corpus is our unicode() strings collection as a list
corpus = [u"Art", u"Älg", u"Ved", u"Wasa"]

import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# alternatively, (but it's bad to hardcode)
# locale.setlocale(locale.LC_ALL, "sv_SE.UTF-8")

corpus.sort(cmp=locale.strcoll)

# in python2.x, locale.strxfrm is broken and does not work for unicode strings
# in python3.x however:
# corpus.sort(key=locale.strxfrm)

Question to Lennart and other answerers: Doesn't anyone know 'locale' or is it not up to this task?

对Lennart和其他回答的问题:没有人知道“locale”或者它不胜任这个任务吗?

#3


9  

Try James Tauber's Python Unicode Collation Algorithm. It may not do exactly as you want, but seems well worth a look. For a bit more information about the issues, see this post by Christopher Lenz.

试试James Tauber的Python Unicode排序算法。它可能不像你想的那样,但看起来很值得一看。有关这些问题的更多信息,请参见Christopher Lenz的这篇文章。

#4


8  

You might also be interested in pyuca:

你可能也对皮尤卡感兴趣:

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

Though it is certainly not the most exact way, it is a very simple way to at least get it somewhat right. It also beats locale in a webapp as locale is not threadsafe and sets the language settings process-wide. It also easier to set up than PyICU which relies on an external C library.

虽然这肯定不是最精确的方法,但至少是一种非常简单的方法,可以使它在一定程度上正确。它还会在webapp中击败locale,因为locale不是threadsafe的,并在整个过程中设置语言设置。与依赖外部C库的PyICU相比,它更容易设置。

I uploaded the script to github as the original was down at the time of this writing and I had to resort to web caches to get it:

我把剧本上传到github上,因为写这篇文章的时候剧本已经写完了,我不得不求助于网络缓存来获取:

https://github.com/href/Python-Unicode-Collation-Algorithm

https://github.com/href/Python-Unicode-Collation-Algorithm

I successfully used this script to sanely sort German/French/Italian text in a plone module.

我成功地使用这个脚本在plone模块中对德语/法语/意大利语文本进行合理的排序。

#5


7  

A summary and extended answer:

总结和扩展答案:

locale.strcoll under Python 2, and locale.strxfrm will in fact solve the problem, and does a good job, assuming that you have the locale in question installed. I tested it under Windows too, where the locale names confusingly are different, but on the other hand it seems to have all locales that are supported installed by default.

语言环境。在Python 2和语言环境下的strcoll。实际上,strxfrm将解决这个问题,并且能够很好地完成工作,假设您已经安装了问题区域。我也在Windows下进行了测试,这里的地区名称是不同的,但另一方面,它似乎拥有默认安装的所有locale。

ICU doesn't necessarily do this better in practice, it however does way more. Most notably it has support for splitters that can split texts in different languages into words. This is very useful for languages that doesn't have word separators. You'll need to have a corpus of words to use as a base for the splitting, because that's not included, though.

ICU在实践中不一定做得更好,但它做得更多。最值得注意的是,它支持将不同语言的文本拆分为文字。这对于没有词分隔符的语言非常有用。你需要有一个语料库来作为拆分的基础,因为这还不包括在内。

It also has long names for the locales so you can get pretty display names for the locale, support for other calendars than Gregorian (although I'm not sure the Python interface supports that) and tons and tons of other more or less obscure locale supports.

它还对区域设置有很长的名称,这样您就可以为区域设置获得漂亮的显示名称,支持比Gregorian(尽管我不确定Python接口是否支持这种名称)更支持其他大量或多或少的模糊区域设置支持。

So all in all: If you want to sort alphabetically and locale-dependent, you can use the locale module, unless you have special requirements, or also need more locale dependent functionality, like words splitter.

总之:如果您希望按字母顺序排序并依赖于本地语言环境,您可以使用本地语言环境模块,除非您有特殊的需求,或者还需要更多的本地语言环境相关功能,比如单词拆分器。

#6


6  

I see the answers have already done an excellent job, just wanted to point out one coding inefficiency in Human Sort. To apply a selective char-by-char translation to a unicode string s, it uses the code:

我看到答案已经做得很好了,只是想指出一种人类的编码效率低下。要对unicode字符串s应用选择性逐字符转换,它使用以下代码:

spec_dict = {'Å':'A', 'Ä':'A'}

def spec_order(s):
    return ''.join([spec_dict.get(ch, ch) for ch in s])

Python has a much better, faster and more concise way to perform this auxiliary task (on Unicode strings -- the analogous method for byte strings has a different and somewhat less helpful specification!-):

Python有一种更好、更快、更简洁的方式来执行这个辅助任务(在Unicode字符串上——类似的字节字符串方法有一个不同的、不太有用的规范!)

spec_dict = dict((ord(k), spec_dict[k]) for k in spec_dict)

def spec_order(s):
    return s.translate(spec_dict)

The dict you pass to the translate method has Unicode ordinals (not strings) as keys, which is why we need that rebuilding step from the original char-to-char spec_dict. (Values in the dict you pass to translate [as opposed to keys, which must be ordinals] can be Unicode ordinals, arbitrary Unicode strings, or None to remove the corresponding character as part of the translation, so it's easy to specify "ignore a certain character for sorting purposes", "map ä to ae for sorting purposes", and the like).

传递给translate方法的dict类型具有Unicode序数(而不是字符串)作为键,这就是为什么我们需要重新构建从原始charto -char spec_dict类型的步骤。(关键字值传递给翻译(而不是钥匙,必须顺序]可以Unicode序数,任意Unicode字符串,或没有删除相应的字符作为翻译的一部分,所以很容易指定“忽略某些字符排序的目的”,“地图ae用于排序”,等)。

In Python 3, you can get the "rebuilding" step more simply, e.g.:

在Python 3中,您可以更简单地获得“重新构建”步骤,例如:

spec_dict = ''.maketrans(spec_dict)

See the docs for other ways you can use this maketrans static method in Python 3.

在Python 3中使用这个maketrans静态方法的其他方法请参阅文档。

#7


3  

A Complete UCA Solution

The simplest, easiest, and most straightforward way to do this it to make a callout to the Perl library module, Unicode::Collate::Locale, which is a subclass of the standard Unicode::Collate module. All you need do is pass the constructor a locale value of "xv" for Sweden.

最简单、最简单、最直接的方法是调用Perl库模块Unicode::Collate::Locale,它是标准Unicode:::Collate模块的一个子类。您所需要做的就是将构造函数传递给瑞典的“xv”的地区值。

(You may not neccesarily appreciate this for Swedish text, but because Perl uses abstract characters, you can use any Unicode code point you please — no matter the platform or build! Few languages offer such convenience. I mention it because I’ve fighting a losing battle with Java a lot over this maddening problem lately.)

(对于瑞典文本,您可能不需要理解这一点,但是因为Perl使用抽象字符,所以您可以使用任何您喜欢的Unicode代码点——无论平台还是构建!)很少有语言能提供这样的方便。我提到它是因为我最近和Java在这个令人发狂的问题上打了一场败仗。

The problem is that I do not know how to access a Perl module from Python — apart, that is, from using a shell callout or two-sided pipe. To that end, I have therefore provided you with a complete working script called ucsort that you can call to do exactly what you have asked for with perfect ease.

问题是,我不知道如何从Python访问Perl模块——除了使用shell标注或双面管道。为此,我为您提供了一个名为ucsort的完整工作脚本,您可以调用该脚本以完全轻松地完成您所要求的任务。

This script is 100% compliant with the full Unicode Collation Algorithm, with all tailoring options supported!! And if you have an optional module installed or run Perl 5.13 or better, then you have full access to easy-to-use CLDR locales. See below.

此脚本100%符合Unicode排序算法,支持所有裁剪选项!如果您安装或运行Perl 5.13或更好的可选模块,那么您可以完全访问易于使用的CLDR区域设置。见下文。

Demonstration

Imagine an input set ordered this way:

想象这样一个输入集:

b o i j n l m å y e v s k h d f g t ö r x p z a ä c u q

A default sort by code point yields:

由代码点产生的默认排序:

a b c d e f g h i j k l m n o p q r s t u v x y z ä å ö

which is incorrect by everybody’s book. Using my script, which uses the Unicode Collation Algorithm, you get this order:

这在大家的书上是不正确的。使用我的脚本,使用Unicode排序算法,您可以得到以下顺序:

% perl ucsort /tmp/swedish_alphabet | fmt
a å ä b c d e f g h i j k l m n o ö p q r s t u v x y z

That is the default UCA sort. To get the Swedish locale, call ucsort this way:

这是默认的UCA排序。要获得瑞典语言环境,可以这样称呼ucsort:

% perl ucsort --locale=sv /tmp/swedish_alphabet | fmt
a b c d e f g h i j k l m n o p q r s t u v x y z å ä ö

Here is a better input demo. First, the input set:

这里有一个更好的输入演示。首先,输入设置:

% fmt /tmp/swedish_set
cTD cDD Cöd Cbd cAD cCD cYD Cud cZD Cod cBD Cnd cQD cFD Ced Cfd cOD
cLD cXD Cid Cpd cID Cgd cVD cMD cÅD cGD Cqd Cäd cJD Cdd Ckd cÖD cÄD
Ctd Czd Cxd cHD cND cKD Cvd Chd Cyd cUD Cld Cmd cED Crd Cad Cåd Ccd
cRD cSD Csd Cjd cPD

By code point, that sorts this way:

按代码点排序:

Cad Cbd Ccd Cdd Ced Cfd Cgd Chd Cid Cjd Ckd Cld Cmd Cnd Cod Cpd Cqd
Crd Csd Ctd Cud Cvd Cxd Cyd Czd Cäd Cåd Cöd cAD cBD cCD cDD cED cFD
cGD cHD cID cJD cKD cLD cMD cND cOD cPD cQD cRD cSD cTD cUD cVD cXD
cYD cZD cÄD cÅD cÖD

But using the default UCA makes it sort this way:

但是使用默认的UCA使得它可以这样排序:

% ucsort /tmp/swedish_set | fmt
cAD Cad cÅD Cåd cÄD Cäd cBD Cbd cCD Ccd cDD Cdd cED Ced cFD Cfd cGD
Cgd cHD Chd cID Cid cJD Cjd cKD Ckd cLD Cld cMD Cmd cND Cnd cOD Cod
cÖD Cöd cPD Cpd cQD Cqd cRD Crd cSD Csd cTD Ctd cUD Cud cVD Cvd cXD
Cxd cYD Cyd cZD Czd

But in the Swedish locale, this way:

但是在瑞典语的语境中,是这样的:

% ucsort --locale=sv /tmp/swedish_set | fmt
cAD Cad cBD Cbd cCD Ccd cDD Cdd cED Ced cFD Cfd cGD Cgd cHD Chd cID
Cid cJD Cjd cKD Ckd cLD Cld cMD Cmd cND Cnd cOD Cod cPD Cpd cQD Cqd
cRD Crd cSD Csd cTD Ctd cUD Cud cVD Cvd cXD Cxd cYD Cyd cZD Czd cÅD
Cåd cÄD Cäd cÖD Cöd

If you prefer uppercase to sort before lowercase, do this:

如果您喜欢用大写来排序,那么请这样做:

% ucsort --upper-before-lower --locale=sv /tmp/swedish_set | fmt
Cad cAD Cbd cBD Ccd cCD Cdd cDD Ced cED Cfd cFD Cgd cGD Chd cHD Cid
cID Cjd cJD Ckd cKD Cld cLD Cmd cMD Cnd cND Cod cOD Cpd cPD Cqd cQD
Crd cRD Csd cSD Ctd cTD Cud cUD Cvd cVD Cxd cXD Cyd cYD Czd cZD Cåd
cÅD Cäd cÄD Cöd cÖD

Customized Sorts

You can do many other things with ucsort. For example, here is how to sort titles in English:

你可以用ucsort做很多其他的事情。例如,以下是如何用英语排序标题:

% ucsort --preprocess='s/^(an?|the)\s+//i' /tmp/titles
Anathem
The Book of Skulls
A Civil Campaign
The Claw of the Conciliator
The Demolished Man
Dune
An Early Dawn
The Faded Sun: Kesrith
The Fall of Hyperion
A Feast for Crows
Flowers for Algernon
The Forbidden Tower
Foundation and Empire
Foundation’s Edge
The Goblin Reservation
The High Crusade
Jack of Shadows
The Man in the High Castle
The Ringworld Engineers
The Robots of Dawn
A Storm of Swords
Stranger in a Strange Land
There Will Be Time
The White Dragon

You will need Perl 5.10.1 or better to run the script in general. For locale support, you must either install the optional CPAN module Unicode::Collate::Locale. Alternately, you can install a development versions of Perl, 5.13+, which include that module standardly.

一般来说,您需要Perl 5.10.1或更好的版本来运行这个脚本。对于locale支持,您必须安装可选的CPAN模块Unicode:::Collate:: locale。另外,您可以安装Perl 5.13+的开发版本,其中包括该模块的标准。

Calling Conventions

This is a rapid prototype, so ucsort is mostly un(der)documented. But this is its SYNOPSIS of what switches/options it accepts on the command line:

这是一个快速的原型,所以ucsort大部分是联合国文件。但这是它在命令行接受的开关/选项的概要:

    # standard options
    --help|?
    --man|m
    --debug|d

    # collator constructor options
    --backwards-levels=i
    --collation-level|level|l=i
    --katakana-before-hiragana
    --normalization|n=s
    --override-CJK=s
    --override-Hangul=s
    --preprocess|P=s
    --upper-before-lower|u
    --variable=s

    # program specific options
    --case-insensitive|insensitive|i
    --input-encoding|e=s
    --locale|L=s
    --paragraph|p
    --reverse-fields|last
    --reverse-output|r
    --right-to-left|reverse-input

Yeah, ok: that’s really the argument list I use for the call to Getopt::Long, but you get the idea. :)

是的,好的:这就是我调用Getopt的参数列表::Long,但是你懂的。:)

If you can figure out how to call Perl library modules from Python directly without calling a Perl script, by all means do so. I just don’t know how myself. I’d love to learn how.

如果您能够在不调用Perl脚本的情况下直接从Python调用Perl库模块,那么一定要这样做。我就是不知道我自己。我很想学。

In the meantime, I believe this script will do what you need done in all its particular — and more! I now use this for all of text sorting. It finally does what I’ve needed for a long, long time.

与此同时,我相信这个脚本将完成您需要完成的所有特殊任务——以及更多!我现在把它用于所有的文本排序。它终于完成了我长久以来所需要的。

The only downside is that --locale argument causes performance to go down the tubes, although it’s plenty fast enough for regular, non-locale but still 100% UCA compliant sorting. Since it loads everything in memory, you probably don’t want to use this on gigabyte documents. I use it many times a day, and it sure it great having sane text sorting at last.

唯一的缺点是——locale参数会导致性能下降,尽管它对于常规的、非locale的排序足够快,但仍然是100%符合UCA的排序。由于它在内存中加载所有内容,所以您可能不希望在gb文档中使用它。我每天都要使用它很多次,而且它确实很好,因为最后有了合理的文本排序。

#8


1  

To implement it you will need to read about "Unicode collation algorithm" see http://en.wikipedia.org/wiki/Unicode_collation_algorithm

要实现它,您需要阅读“Unicode排序算法”,请参见http://en.wikipedia.org/wiki/Unicode_collation_algorithm

http://www.unicode.org/unicode/reports/tr10/

http://www.unicode.org/unicode/reports/tr10/

a sample implementation is here

这里有一个示例实现

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

#9


1  

Lately I've been using zope.ucol (https://pypi.python.org/pypi/zope.ucol) for this task. For example, sorting the german ß:

最近我一直在用zope。ucol (https://pypi.python.org/pypi/zope.ucol)用于此任务。例如,德国ß排序:

>>> import zope.ucol
>>> collator = zope.ucol.Collator("de-de")
>>> mylist = [u"a", u'x', u'\u00DF']
>>> print mylist
[u'a', u'x', u'\xdf']
>>> print sorted(mylist, key=collator.key)
[u'a', u'\xdf', u'x']

zope.ucol also wraps ICU, so would be an alternative to PyICU.

zope。ucol也将ICU包裹起来,因此可以替代PyICU。

#10


0  

Jeff Atwood wrote a good post on Natural Sort Order, in it he linked to a script which does pretty much what you ask.

杰夫·阿特伍德写了一篇很好的关于自然排序的文章,他把它和一个剧本联系在一起,这个剧本基本上就是你要的。

It's not a trivial script, by any means, but it does the trick.

无论如何,这并不是一个简单的脚本,但它起到了作用。

#11


0  

It is far from a complete solution for your use case, but you could take a look at the unaccent.py script from effbot.org. What it basically does is remove all accents from a text. You can use that 'sanitized' text to sort alphabetically. (For a better description see this page.)

对于您的用例来说,这远不是一个完整的解决方案,但是您可以查看unaccent。py脚本从effbot.org。它的基本功能是从文本中去除所有的重音。你可以使用“清除”的文本按字母顺序排序。(要获得更好的描述,请参阅本页。)

#1


65  

IBM's ICU library does that (and a lot more). It has Python bindings: PyICU.

IBM的ICU库可以做到这一点(以及更多)。它有Python绑定:PyICU。

Update: The core difference in sorting between ICU and locale.strcoll is that ICU uses the full Unicode Collation Algorithm while strcoll uses ISO 14651.

更新:ICU和语言环境之间排序的核心差异。strcoll是ICU使用完整的Unicode排序算法,而strcoll使用ISO 14651。

The differences between those two algorithms are briefly summarized here: http://unicode.org/faq/collation.html#13. These are rather exotic special cases, which should rarely matter in practice.

这两种算法之间的区别在这里简单地总结一下:http://unicode.org/faq/coll.html #13。这些都是相当奇特的特殊情况,在实践中应该很少有影响。

>>> import icu # pip install PyICU
>>> sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']
>>> collator = icu.Collator.createInstance(icu.Locale('de_DE.UTF-8'))
>>> sorted(['a','b','c','ä'], key=collator.getSortKey)
['a', 'ä', 'b', 'c']

#2


47  

I don't see this in the answers. My Application sorts according to the locale using python's standard library. It is pretty easy.

答案里没有这个。我的应用程序使用python的标准库根据语言环境进行排序。它是非常容易的。

# python2.5 code below
# corpus is our unicode() strings collection as a list
corpus = [u"Art", u"Älg", u"Ved", u"Wasa"]

import locale
# this reads the environment and inits the right locale
locale.setlocale(locale.LC_ALL, "")
# alternatively, (but it's bad to hardcode)
# locale.setlocale(locale.LC_ALL, "sv_SE.UTF-8")

corpus.sort(cmp=locale.strcoll)

# in python2.x, locale.strxfrm is broken and does not work for unicode strings
# in python3.x however:
# corpus.sort(key=locale.strxfrm)

Question to Lennart and other answerers: Doesn't anyone know 'locale' or is it not up to this task?

对Lennart和其他回答的问题:没有人知道“locale”或者它不胜任这个任务吗?

#3


9  

Try James Tauber's Python Unicode Collation Algorithm. It may not do exactly as you want, but seems well worth a look. For a bit more information about the issues, see this post by Christopher Lenz.

试试James Tauber的Python Unicode排序算法。它可能不像你想的那样,但看起来很值得一看。有关这些问题的更多信息,请参见Christopher Lenz的这篇文章。

#4


8  

You might also be interested in pyuca:

你可能也对皮尤卡感兴趣:

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

Though it is certainly not the most exact way, it is a very simple way to at least get it somewhat right. It also beats locale in a webapp as locale is not threadsafe and sets the language settings process-wide. It also easier to set up than PyICU which relies on an external C library.

虽然这肯定不是最精确的方法,但至少是一种非常简单的方法,可以使它在一定程度上正确。它还会在webapp中击败locale,因为locale不是threadsafe的,并在整个过程中设置语言设置。与依赖外部C库的PyICU相比,它更容易设置。

I uploaded the script to github as the original was down at the time of this writing and I had to resort to web caches to get it:

我把剧本上传到github上,因为写这篇文章的时候剧本已经写完了,我不得不求助于网络缓存来获取:

https://github.com/href/Python-Unicode-Collation-Algorithm

https://github.com/href/Python-Unicode-Collation-Algorithm

I successfully used this script to sanely sort German/French/Italian text in a plone module.

我成功地使用这个脚本在plone模块中对德语/法语/意大利语文本进行合理的排序。

#5


7  

A summary and extended answer:

总结和扩展答案:

locale.strcoll under Python 2, and locale.strxfrm will in fact solve the problem, and does a good job, assuming that you have the locale in question installed. I tested it under Windows too, where the locale names confusingly are different, but on the other hand it seems to have all locales that are supported installed by default.

语言环境。在Python 2和语言环境下的strcoll。实际上,strxfrm将解决这个问题,并且能够很好地完成工作,假设您已经安装了问题区域。我也在Windows下进行了测试,这里的地区名称是不同的,但另一方面,它似乎拥有默认安装的所有locale。

ICU doesn't necessarily do this better in practice, it however does way more. Most notably it has support for splitters that can split texts in different languages into words. This is very useful for languages that doesn't have word separators. You'll need to have a corpus of words to use as a base for the splitting, because that's not included, though.

ICU在实践中不一定做得更好,但它做得更多。最值得注意的是,它支持将不同语言的文本拆分为文字。这对于没有词分隔符的语言非常有用。你需要有一个语料库来作为拆分的基础,因为这还不包括在内。

It also has long names for the locales so you can get pretty display names for the locale, support for other calendars than Gregorian (although I'm not sure the Python interface supports that) and tons and tons of other more or less obscure locale supports.

它还对区域设置有很长的名称,这样您就可以为区域设置获得漂亮的显示名称,支持比Gregorian(尽管我不确定Python接口是否支持这种名称)更支持其他大量或多或少的模糊区域设置支持。

So all in all: If you want to sort alphabetically and locale-dependent, you can use the locale module, unless you have special requirements, or also need more locale dependent functionality, like words splitter.

总之:如果您希望按字母顺序排序并依赖于本地语言环境,您可以使用本地语言环境模块,除非您有特殊的需求,或者还需要更多的本地语言环境相关功能,比如单词拆分器。

#6


6  

I see the answers have already done an excellent job, just wanted to point out one coding inefficiency in Human Sort. To apply a selective char-by-char translation to a unicode string s, it uses the code:

我看到答案已经做得很好了,只是想指出一种人类的编码效率低下。要对unicode字符串s应用选择性逐字符转换,它使用以下代码:

spec_dict = {'Å':'A', 'Ä':'A'}

def spec_order(s):
    return ''.join([spec_dict.get(ch, ch) for ch in s])

Python has a much better, faster and more concise way to perform this auxiliary task (on Unicode strings -- the analogous method for byte strings has a different and somewhat less helpful specification!-):

Python有一种更好、更快、更简洁的方式来执行这个辅助任务(在Unicode字符串上——类似的字节字符串方法有一个不同的、不太有用的规范!)

spec_dict = dict((ord(k), spec_dict[k]) for k in spec_dict)

def spec_order(s):
    return s.translate(spec_dict)

The dict you pass to the translate method has Unicode ordinals (not strings) as keys, which is why we need that rebuilding step from the original char-to-char spec_dict. (Values in the dict you pass to translate [as opposed to keys, which must be ordinals] can be Unicode ordinals, arbitrary Unicode strings, or None to remove the corresponding character as part of the translation, so it's easy to specify "ignore a certain character for sorting purposes", "map ä to ae for sorting purposes", and the like).

传递给translate方法的dict类型具有Unicode序数(而不是字符串)作为键,这就是为什么我们需要重新构建从原始charto -char spec_dict类型的步骤。(关键字值传递给翻译(而不是钥匙,必须顺序]可以Unicode序数,任意Unicode字符串,或没有删除相应的字符作为翻译的一部分,所以很容易指定“忽略某些字符排序的目的”,“地图ae用于排序”,等)。

In Python 3, you can get the "rebuilding" step more simply, e.g.:

在Python 3中,您可以更简单地获得“重新构建”步骤,例如:

spec_dict = ''.maketrans(spec_dict)

See the docs for other ways you can use this maketrans static method in Python 3.

在Python 3中使用这个maketrans静态方法的其他方法请参阅文档。

#7


3  

A Complete UCA Solution

The simplest, easiest, and most straightforward way to do this it to make a callout to the Perl library module, Unicode::Collate::Locale, which is a subclass of the standard Unicode::Collate module. All you need do is pass the constructor a locale value of "xv" for Sweden.

最简单、最简单、最直接的方法是调用Perl库模块Unicode::Collate::Locale,它是标准Unicode:::Collate模块的一个子类。您所需要做的就是将构造函数传递给瑞典的“xv”的地区值。

(You may not neccesarily appreciate this for Swedish text, but because Perl uses abstract characters, you can use any Unicode code point you please — no matter the platform or build! Few languages offer such convenience. I mention it because I’ve fighting a losing battle with Java a lot over this maddening problem lately.)

(对于瑞典文本,您可能不需要理解这一点,但是因为Perl使用抽象字符,所以您可以使用任何您喜欢的Unicode代码点——无论平台还是构建!)很少有语言能提供这样的方便。我提到它是因为我最近和Java在这个令人发狂的问题上打了一场败仗。

The problem is that I do not know how to access a Perl module from Python — apart, that is, from using a shell callout or two-sided pipe. To that end, I have therefore provided you with a complete working script called ucsort that you can call to do exactly what you have asked for with perfect ease.

问题是,我不知道如何从Python访问Perl模块——除了使用shell标注或双面管道。为此,我为您提供了一个名为ucsort的完整工作脚本,您可以调用该脚本以完全轻松地完成您所要求的任务。

This script is 100% compliant with the full Unicode Collation Algorithm, with all tailoring options supported!! And if you have an optional module installed or run Perl 5.13 or better, then you have full access to easy-to-use CLDR locales. See below.

此脚本100%符合Unicode排序算法,支持所有裁剪选项!如果您安装或运行Perl 5.13或更好的可选模块,那么您可以完全访问易于使用的CLDR区域设置。见下文。

Demonstration

Imagine an input set ordered this way:

想象这样一个输入集:

b o i j n l m å y e v s k h d f g t ö r x p z a ä c u q

A default sort by code point yields:

由代码点产生的默认排序:

a b c d e f g h i j k l m n o p q r s t u v x y z ä å ö

which is incorrect by everybody’s book. Using my script, which uses the Unicode Collation Algorithm, you get this order:

这在大家的书上是不正确的。使用我的脚本,使用Unicode排序算法,您可以得到以下顺序:

% perl ucsort /tmp/swedish_alphabet | fmt
a å ä b c d e f g h i j k l m n o ö p q r s t u v x y z

That is the default UCA sort. To get the Swedish locale, call ucsort this way:

这是默认的UCA排序。要获得瑞典语言环境,可以这样称呼ucsort:

% perl ucsort --locale=sv /tmp/swedish_alphabet | fmt
a b c d e f g h i j k l m n o p q r s t u v x y z å ä ö

Here is a better input demo. First, the input set:

这里有一个更好的输入演示。首先,输入设置:

% fmt /tmp/swedish_set
cTD cDD Cöd Cbd cAD cCD cYD Cud cZD Cod cBD Cnd cQD cFD Ced Cfd cOD
cLD cXD Cid Cpd cID Cgd cVD cMD cÅD cGD Cqd Cäd cJD Cdd Ckd cÖD cÄD
Ctd Czd Cxd cHD cND cKD Cvd Chd Cyd cUD Cld Cmd cED Crd Cad Cåd Ccd
cRD cSD Csd Cjd cPD

By code point, that sorts this way:

按代码点排序:

Cad Cbd Ccd Cdd Ced Cfd Cgd Chd Cid Cjd Ckd Cld Cmd Cnd Cod Cpd Cqd
Crd Csd Ctd Cud Cvd Cxd Cyd Czd Cäd Cåd Cöd cAD cBD cCD cDD cED cFD
cGD cHD cID cJD cKD cLD cMD cND cOD cPD cQD cRD cSD cTD cUD cVD cXD
cYD cZD cÄD cÅD cÖD

But using the default UCA makes it sort this way:

但是使用默认的UCA使得它可以这样排序:

% ucsort /tmp/swedish_set | fmt
cAD Cad cÅD Cåd cÄD Cäd cBD Cbd cCD Ccd cDD Cdd cED Ced cFD Cfd cGD
Cgd cHD Chd cID Cid cJD Cjd cKD Ckd cLD Cld cMD Cmd cND Cnd cOD Cod
cÖD Cöd cPD Cpd cQD Cqd cRD Crd cSD Csd cTD Ctd cUD Cud cVD Cvd cXD
Cxd cYD Cyd cZD Czd

But in the Swedish locale, this way:

但是在瑞典语的语境中,是这样的:

% ucsort --locale=sv /tmp/swedish_set | fmt
cAD Cad cBD Cbd cCD Ccd cDD Cdd cED Ced cFD Cfd cGD Cgd cHD Chd cID
Cid cJD Cjd cKD Ckd cLD Cld cMD Cmd cND Cnd cOD Cod cPD Cpd cQD Cqd
cRD Crd cSD Csd cTD Ctd cUD Cud cVD Cvd cXD Cxd cYD Cyd cZD Czd cÅD
Cåd cÄD Cäd cÖD Cöd

If you prefer uppercase to sort before lowercase, do this:

如果您喜欢用大写来排序,那么请这样做:

% ucsort --upper-before-lower --locale=sv /tmp/swedish_set | fmt
Cad cAD Cbd cBD Ccd cCD Cdd cDD Ced cED Cfd cFD Cgd cGD Chd cHD Cid
cID Cjd cJD Ckd cKD Cld cLD Cmd cMD Cnd cND Cod cOD Cpd cPD Cqd cQD
Crd cRD Csd cSD Ctd cTD Cud cUD Cvd cVD Cxd cXD Cyd cYD Czd cZD Cåd
cÅD Cäd cÄD Cöd cÖD

Customized Sorts

You can do many other things with ucsort. For example, here is how to sort titles in English:

你可以用ucsort做很多其他的事情。例如,以下是如何用英语排序标题:

% ucsort --preprocess='s/^(an?|the)\s+//i' /tmp/titles
Anathem
The Book of Skulls
A Civil Campaign
The Claw of the Conciliator
The Demolished Man
Dune
An Early Dawn
The Faded Sun: Kesrith
The Fall of Hyperion
A Feast for Crows
Flowers for Algernon
The Forbidden Tower
Foundation and Empire
Foundation’s Edge
The Goblin Reservation
The High Crusade
Jack of Shadows
The Man in the High Castle
The Ringworld Engineers
The Robots of Dawn
A Storm of Swords
Stranger in a Strange Land
There Will Be Time
The White Dragon

You will need Perl 5.10.1 or better to run the script in general. For locale support, you must either install the optional CPAN module Unicode::Collate::Locale. Alternately, you can install a development versions of Perl, 5.13+, which include that module standardly.

一般来说,您需要Perl 5.10.1或更好的版本来运行这个脚本。对于locale支持,您必须安装可选的CPAN模块Unicode:::Collate:: locale。另外,您可以安装Perl 5.13+的开发版本,其中包括该模块的标准。

Calling Conventions

This is a rapid prototype, so ucsort is mostly un(der)documented. But this is its SYNOPSIS of what switches/options it accepts on the command line:

这是一个快速的原型,所以ucsort大部分是联合国文件。但这是它在命令行接受的开关/选项的概要:

    # standard options
    --help|?
    --man|m
    --debug|d

    # collator constructor options
    --backwards-levels=i
    --collation-level|level|l=i
    --katakana-before-hiragana
    --normalization|n=s
    --override-CJK=s
    --override-Hangul=s
    --preprocess|P=s
    --upper-before-lower|u
    --variable=s

    # program specific options
    --case-insensitive|insensitive|i
    --input-encoding|e=s
    --locale|L=s
    --paragraph|p
    --reverse-fields|last
    --reverse-output|r
    --right-to-left|reverse-input

Yeah, ok: that’s really the argument list I use for the call to Getopt::Long, but you get the idea. :)

是的,好的:这就是我调用Getopt的参数列表::Long,但是你懂的。:)

If you can figure out how to call Perl library modules from Python directly without calling a Perl script, by all means do so. I just don’t know how myself. I’d love to learn how.

如果您能够在不调用Perl脚本的情况下直接从Python调用Perl库模块,那么一定要这样做。我就是不知道我自己。我很想学。

In the meantime, I believe this script will do what you need done in all its particular — and more! I now use this for all of text sorting. It finally does what I’ve needed for a long, long time.

与此同时,我相信这个脚本将完成您需要完成的所有特殊任务——以及更多!我现在把它用于所有的文本排序。它终于完成了我长久以来所需要的。

The only downside is that --locale argument causes performance to go down the tubes, although it’s plenty fast enough for regular, non-locale but still 100% UCA compliant sorting. Since it loads everything in memory, you probably don’t want to use this on gigabyte documents. I use it many times a day, and it sure it great having sane text sorting at last.

唯一的缺点是——locale参数会导致性能下降,尽管它对于常规的、非locale的排序足够快,但仍然是100%符合UCA的排序。由于它在内存中加载所有内容,所以您可能不希望在gb文档中使用它。我每天都要使用它很多次,而且它确实很好,因为最后有了合理的文本排序。

#8


1  

To implement it you will need to read about "Unicode collation algorithm" see http://en.wikipedia.org/wiki/Unicode_collation_algorithm

要实现它,您需要阅读“Unicode排序算法”,请参见http://en.wikipedia.org/wiki/Unicode_collation_algorithm

http://www.unicode.org/unicode/reports/tr10/

http://www.unicode.org/unicode/reports/tr10/

a sample implementation is here

这里有一个示例实现

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/

#9


1  

Lately I've been using zope.ucol (https://pypi.python.org/pypi/zope.ucol) for this task. For example, sorting the german ß:

最近我一直在用zope。ucol (https://pypi.python.org/pypi/zope.ucol)用于此任务。例如,德国ß排序:

>>> import zope.ucol
>>> collator = zope.ucol.Collator("de-de")
>>> mylist = [u"a", u'x', u'\u00DF']
>>> print mylist
[u'a', u'x', u'\xdf']
>>> print sorted(mylist, key=collator.key)
[u'a', u'\xdf', u'x']

zope.ucol also wraps ICU, so would be an alternative to PyICU.

zope。ucol也将ICU包裹起来,因此可以替代PyICU。

#10


0  

Jeff Atwood wrote a good post on Natural Sort Order, in it he linked to a script which does pretty much what you ask.

杰夫·阿特伍德写了一篇很好的关于自然排序的文章,他把它和一个剧本联系在一起,这个剧本基本上就是你要的。

It's not a trivial script, by any means, but it does the trick.

无论如何,这并不是一个简单的脚本,但它起到了作用。

#11


0  

It is far from a complete solution for your use case, but you could take a look at the unaccent.py script from effbot.org. What it basically does is remove all accents from a text. You can use that 'sanitized' text to sort alphabetically. (For a better description see this page.)

对于您的用例来说,这远不是一个完整的解决方案,但是您可以查看unaccent。py脚本从effbot.org。它的基本功能是从文本中去除所有的重音。你可以使用“清除”的文本按字母顺序排序。(要获得更好的描述,请参阅本页。)