如何将没有空格的文本分割成单词列表?

时间:2022-03-07 01:48:12

Input: "tableapplechairtablecupboard..." many words

输入:“tableapplechairtablecupboard…”许多单词

What would be an efficient algorithm to split such text to the list of words and get:

什么是一种有效的算法,将这些文本分割成单词列表,然后得到:

Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]

输出:[“表”、“苹果”、“椅子”,“表”,(“橱柜”,[“杯”,“董事会”]],…]

First thing that cames to mind is to go through all possible words (starting with first letter) and find the longest word possible, continue from position=word_position+len(word)

首先要记住的是,你要经历所有可能的单词(从第一个字母开始),然后找到尽可能长的单词,从position=word_position+len(word)

P.S.
We have a list of all possible words.
Word "cupboard" can be "cup" and "board", select longest.
Language: python, but main thing is the algorithm itself.

附注:我们有一个所有可能的单词的列表。单词“橱”可以是“杯”和“板”,选择时间最长。语言:python,但主要的是算法本身。

11 个解决方案

#1


113  

A naive algorithm won't give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.

一个朴素的算法在应用于真实世界的数据时不会产生好的结果。这里有一个20行算法,利用相对的字频来为实字文本提供准确的结果。

(If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by "longest word": is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)

(如果你想要一个不使用词频的原始问题的答案,你需要精炼“最长的单词”到底是什么意思:一个20个字母的单词和10个3个字母的单词是好还是五个10个字母的单词更好?)一旦您确定了一个精确的定义,您只需更改定义wordcost的行,以反映预期的含义。

The idea

The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.

最好的方法是对输出的分布进行建模。一个好的第一个近似是假设所有的词都是独立分布的。然后你只需要知道所有单词的相对频率。假设他们遵循了Zipf的法则,这是合理的,这是单词列表中的n,其概率大约为1/(n log n),其中n是字典中单词的数量。

Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.

一旦修复了模型,就可以使用动态编程来推断空间的位置。最可能的句子是将每个单词的概率乘积最大化的句子,用动态编程来计算很容易。我们不直接使用概率,而是使用概率倒数的对数来定义成本,以避免溢出。

The code

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

which you can use with

你能用它来做什么

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

The results

I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.

我使用的是一个由*的一小部分组成的快速、肮脏的125k单词字典。

Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.

:thumbgreenappleactiveassignmentweeklymetaphor。后:拇指绿色苹果主动作业每周隐喻。

Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.

在此之前,人们对文字的表达方式进行了广泛的评论,但是在这之前,人们还会对文字进行限定,例如:“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”

After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.

后:有大量的文本信息的人民评论解析html但没有分隔的字符在例如拇指青苹果活动作业每周比喻显然有拇指青苹果等字符串中的我也有一个大字典查询词是否合理的提取thx什么年代的最快方式。

Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

:itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness。

After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

后:这是一个月黑风高的夜晚倾盆的除了偶尔间隔时被一阵猛烈的风,席卷了街道在伦敦,我们现场的谎言会在房顶和强烈搅拌的稀疏的火焰灯,挣扎在黑暗。

As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.

正如你所看到的,它本质上是完美无瑕的。最重要的是确保你的单词列表被训练成与你实际遇到的类似的语料库,否则结果会很糟糕。


Optimization

The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.

实现会消耗线性的时间和内存,因此它是相当有效的。如果您需要进一步的加速,您可以从单词列表中构建一个后缀树,以减少候选集合的大小。

If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.

如果您需要处理一个非常大的连续字符串,那么合理的做法是分割字符串,以避免过多的内存使用。例如,您可以使用10000个字符块和两边各有1000个字符的空白来处理文本,以避免边界效果。这将使内存使用最小化,并且几乎肯定不会对质量产生影响。

#2


16  

Based on the excellent work in the top answer, I've created a pip package for easy use.

基于上面答案中出色的工作,我创建了一个pip包,便于使用。

>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']

To install, run pip install wordninja.

要安装,运行pip安装wordninja。

The only differences are minor. This returns a list rather than a str, it works in python3, it includes the word list and properly splits even if there are non-alpha chars (like underscores, dashes, etc).

唯一的区别很小。它返回一个列表而不是一个str,它在python3中工作,它包括单词列表并且适当地分割即使有非字符字符(如下划线、破折号等)。

Thanks again to Generic Human!

再次感谢普通人类!

https://github.com/keredson/wordninja

https://github.com/keredson/wordninja

#3


13  

Here is solution using recursive search:

下面是递归搜索的解决方案:

def find_words(instring, prefix = '', words = None):
    if not instring:
        return []
    if words is None:
        words = set()
        with open('/usr/share/dict/words') as f:
            for line in f:
                words.add(line.strip())
    if (not prefix) and (instring in words):
        return [instring]
    prefix, suffix = prefix + instring[0], instring[1:]
    solutions = []
    # Case 1: prefix in solution
    if prefix in words:
        try:
            solutions.append([prefix] + find_words(suffix, '', words))
        except ValueError:
            pass
    # Case 2: prefix not in solution
    try:
        solutions.append(find_words(suffix, prefix, words))
    except ValueError:
        pass
    if solutions:
        return sorted(solutions,
                      key = lambda solution: [len(word) for word in solution],
                      reverse = True)[0]
    else:
        raise ValueError('no solution')

print(find_words('tableapplechairtablecupboard'))
print(find_words('tableprechaun', words = set(['tab', 'table', 'leprechaun'])))

yields

收益率

['table', 'apple', 'chair', 'table', 'cupboard']
['tab', 'leprechaun']

#4


9  

Using a trie data structure, which holds the list of possible words, it would not be too complicated to do the following:

使用一个trie数据结构,其中包含可能的单词列表,这样做并不太复杂:

  1. Advance pointer (in the concatenated string)
  2. 前进指针(在连接字符串中)
  3. Lookup and store the corresponding node in the trie
  4. 在trie中查找并存储相应的节点
  5. If the trie node has children (e.g. there are longer words), go to 1.
  6. 如果trie节点有子节点(例如有较长的单词),请转到1。
  7. If the node reached has no children, a longest word match happened; add the word (stored in the node or just concatenated during trie traversal) to the result list, reset the pointer in the trie (or reset the reference), and start over
  8. 如果节点没有子节点,就会发生最长的单词匹配;添加单词(存储在节点中或在trie遍历过程中连接)到结果列表,重置trie中的指针(或重置引用),然后重新开始。

#5


7  

Unutbu's solution was quite close but I find the code difficult to read, and it didn't yield the expected result. Generic Human's solution has the drawback that it needs word frequencies. Not appropriate for all use case.

Unutbu的解决方案非常接近,但是我发现代码很难读,而且它没有产生预期的结果。一般人的解决方案有一个缺点,它需要字频率。不适合所有用例。

Here's a simple solution using a Divide and Conquer algorithm.

这是一个使用分治算法的简单解决方案。

  1. It tries to minimize the number of words E.g. find_words('cupboard') will return ['cupboard'] rather than ['cup', 'board'] (assuming that cupboard, cup and board are in the dictionnary)
  2. 它尽量减少单词的数量,例如find_words('橱柜')将返回['橱柜'],而不是返回['cup', 'board'](假设碗柜、杯和板在字典里)
  3. The optimal solution is not unique, the implementation below returns a solution. find_words('charactersin') could return ['characters', 'in'] or maybe it will return ['character', 'sin'] (as seen below). You could quite easily modify the algorithm to return all optimal solutions.
  4. 最优解不是唯一的,下面的实现返回一个解。find_words('characters sin')可以返回['characters', 'in'],或者可能返回['character', 'sin'](如下所示)。您可以很容易地修改算法以返回所有最优解。
  5. In this implementation solutions are memoized so that it runs in a reasonable time.
  6. 在这个实现中,解决方案被存储,以便在合理的时间内运行。

The code:

代码:

words = set()
with open('/usr/share/dict/words') as f:
    for line in f:
        words.add(line.strip())

solutions = {}
def find_words(instring):
    # First check if instring is in the dictionnary
    if instring in words:
        return [instring]
    # No... But maybe it's a result we already computed
    if instring in solutions:
        return solutions[instring]
    # Nope. Try to split the string at all position to recursively search for results
    best_solution = None
    for i in range(1, len(instring) - 1):
        part1 = find_words(instring[:i])
        part2 = find_words(instring[i:])
        # Both parts MUST have a solution
        if part1 is None or part2 is None:
            continue
        solution = part1 + part2
        # Is the solution found "better" than the previous one?
        if best_solution is None or len(solution) < len(best_solution):
            best_solution = solution
    # Remember (memoize) this solution to avoid having to recompute it
    solutions[instring] = best_solution
    return best_solution

This will take about about 5sec on my 3GHz machine:

在我的3GHz机器上大约需要5秒:

result = find_words("thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearenodelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetaphorapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquerywhetherthewordisreasonablesowhatsthefastestwayofextractionthxalot")
assert(result is not None)
print ' '.join(result)

the reis masses of text information of peoples comments which is parsed from h t m l but there are no delimited character sin them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple e t c in the string i also have a large dictionary to query whether the word is reasonable so whats the fastest way of extraction t h x a lot

reis群众人民评论的文本信息解析从h t m l但没有分隔字符罪他们例如拇指青苹果活动作业每周比喻显然有拇指青苹果e t c字符串中的我也有一个大字典查询这个词是否合理所以什么最快的方法提取t h x

#6


4  

The answer by https://*.com/users/1515832/generic-human is great. But the best implementation of this I've ever seen was written Peter Norvig himself in his book 'Beautiful Data'.

https://*.com/users/1515832/ generationhuman的答案很好。但我所见过的最好的实现方式是,彼得·诺维格在他的书《美丽的数据》中写道。

Before I paste his code, let me expand on why Norvig's method is more accurate (although a little slower and longer in terms of code).

在粘贴他的代码之前,让我进一步说明为什么Norvig的方法更准确(尽管在代码方面慢了一点,也长了一点)。

1) The data is a bit better - both in terms of size and in terms of precision (he uses a word count rather than a simple ranking) 2) More importantly, it's the logic behind n-grams that really makes the approach so accurate.

2)更重要的是,n-g背后的逻辑使这种方法如此精确。

The example he provides in his book is the problem of splitting a string 'sitdown'. Now a non-bigram method of string split would consider p('sit') * p ('down'), and if this less than the p('sitdown') - which will be the case quite often - it will NOT split it, but we'd want it to (most of the time).

他在书中给出的例子是分割字符串“sitdown”的问题。现在一种非双图的字符串分割方法是考虑p('sit') * p(' down'),如果这个小于p('sitdown')——这是很常见的情况——它不会分裂它,但我们希望它(大多数时候)分裂。

However when you have the bigram model you could value p('sit down') as a bigram vs p('sitdown') and the former wins. Basically, if you don't use bigrams, it treats the probability of the words you're splitting as independent, which is not the case, some words are more likely to appear one after the other. Unfortunately those are also the words that are often stuck together in a lot of instances and confuses the splitter.

然而,当你有双图模型时,你可以把p(“坐下”)看成是双图vs p(“坐下来”),前者获胜。基本上,如果你不使用bigrams,它会把你拆分的单词的概率当作独立的,其实不是这样,有些词很可能一个接一个出现。不幸的是,在很多情况下,这些词常常被粘在一起,使拆分器混淆。

Here's the link to the data (it's data for 3 separate problems and segmentation is only one. Please read the chapter for details): http://norvig.com/ngrams/

这是数据的链接(这是3个独立问题的数据,分割是一个。详情请参阅本章):http://norvig.com/ngrams/

and here's the link to the code: http://norvig.com/ngrams/ngrams.py

这里是代码的链接:http://norvig.com/ngrams/ngrams.py

These links have been up a while, but I'll copy paste the segmentation part of the code here anyway

这些链接已经有一段时间了,但是我还是要在这里复制代码的分段部分

import re, string, random, glob, operator, heapq
from collections import defaultdict
from math import log10

def memo(f):
    "Memoize function f."
    table = {}
    def fmemo(*args):
        if args not in table:
            table[args] = f(*args)
        return table[args]
    fmemo.memo = table
    return fmemo

def test(verbose=None):
    """Run some tests, taken from the chapter.
    Since the hillclimbing algorithm is randomized, some tests may fail."""
    import doctest
    print 'Running tests...'
    doctest.testfile('ngrams-test.txt', verbose=verbose)

################ Word Segmentation (p. 223)

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

def splits(text, L=20):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    return [(text[:i+1], text[i+1:]) 
            for i in range(min(len(text), L))]

def Pwords(words): 
    "The Naive Bayes probability of a sequence of words."
    return product(Pw(w) for w in words)

#### Support functions (p. 224)

def product(nums):
    "Return the product of a sequence of numbers."
    return reduce(operator.mul, nums, 1)

class Pdist(dict):
    "A probability distribution estimated from counts in datafile."
    def __init__(self, data=[], N=None, missingfn=None):
        for key,count in data:
            self[key] = self.get(key, 0) + int(count)
        self.N = float(N or sum(self.itervalues()))
        self.missingfn = missingfn or (lambda k, N: 1./N)
    def __call__(self, key): 
        if key in self: return self[key]/self.N  
        else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):
    "Read key,value pairs from file."
    for line in file(name):
        yield line.split(sep)

def avoid_long_words(key, N):
    "Estimate the probability of an unknown word."
    return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw  = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

#### segment2: second version, with bigram counts, (p. 226-227)

def cPw(word, prev):
    "Conditional probability of word, given previous word."
    try:
        return P2w[prev + ' ' + word]/float(Pw[prev])
    except KeyError:
        return Pw(word)

P2w = Pdist(datafile('count_2w.txt'), N)

@memo 
def segment2(text, prev='<S>'): 
    "Return (log P(words), words), where words is the best segmentation." 
    if not text: return 0.0, [] 
    candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first)) 
                  for first,rem in splits(text)] 
    return max(candidates) 

def combine(Pfirst, first, (Prem, rem)): 
    "Combine first and rem results into one (probability, words) pair." 
    return Pfirst+Prem, [first]+rem 

#7


2  

If you precompile the wordlist into a DFA (which will be very slow), then the time it takes to match an input will be proportional to the length of the string (in fact, only a little slower than just iterating over the string).

如果将wordlist预编译为DFA(这会非常慢),那么匹配输入所需的时间将与字符串的长度成正比(实际上,只比迭代字符串慢一点)。

This is effectively a more general version of the trie algorithm which was mentioned earlier. I only mention it for completeless -- as of yet, there's no DFA implementation you can just use. RE2 would work, but I don't know if the Python bindings let you tune how large you allow a DFA to be before it just throws away the compiled DFA data and does NFA search.

这是前面提到的更通用的trie算法。我只在完全没有的情况下提到它——到目前为止,还没有可以使用的DFA实现。RE2可以工作,但是我不知道Python绑定是否允许您在DFA丢弃已编译的DFA数据并进行NFA搜索之前调整DFA的大小。

#8


0  

It seems like fairly mundane backtracking will do. Start at the beggining of the string. Scan right until you have a word. Then, call the function on the rest of the string. Function returns "false" if it scans all the way to the right without recognizing a word. Otherwise, returns the word it found and the list of words returned by the recursive call.

似乎相当平常的回溯就可以了。从绳子的乞丐开始。扫描直到你有一个词。然后,调用字符串其余部分的函数。函数返回“false”,如果它一直向右扫描而不识别单词。否则,返回它找到的单词和递归调用返回的单词列表。

Example: "tableapple". Finds "tab", then "leap", but no word in "ple". No other word in "leapple". Finds "table", then "app". "le" not a word, so tries apple, recognizes, returns.

例如:“tableapple”。查找“tab”,然后“leap”,但是在“ple”中没有单词。“leapple”里没有别的词了。发现“表”,然后“应用程序”。“le”一个字也没说,苹果也试着承认,回报。

To get longest possible, keep going, only emitting (rather than returning) correct solutions; then, choose the optimal one by any criterion you choose (maxmax, minmax, average, etc.)

为了尽可能长,继续前进,只发射(而不是返回)正确的解决方案;然后,根据选择的任何标准(maxmax、minmax、average等)选择最优的一个。

#9


0  

Based on unutbu's solution I've implemented a Java version:

基于unutbu的解决方案,我实现了一个Java版本:

private static List<String> splitWordWithoutSpaces(String instring, String suffix) {
    if(isAWord(instring)) {
        if(suffix.length() > 0) {
            List<String> rest = splitWordWithoutSpaces(suffix, "");
            if(rest.size() > 0) {
                List<String> solutions = new LinkedList<>();
                solutions.add(instring);
                solutions.addAll(rest);
                return solutions;
            }
        } else {
            List<String> solutions = new LinkedList<>();
            solutions.add(instring);
            return solutions;
        }

    }
    if(instring.length() > 1) {
        String newString = instring.substring(0, instring.length()-1);
        suffix = instring.charAt(instring.length()-1) + suffix;
        List<String> rest = splitWordWithoutSpaces(newString, suffix);
        return rest;
    }
    return Collections.EMPTY_LIST;
}

Input: "tableapplechairtablecupboard"

输入:“tableapplechairtablecupboard”

Output: [table, apple, chair, table, cupboard]

输出:【桌子、苹果、椅子、桌子、橱柜】

Input: "tableprechaun"

输入:“tableprechaun”

Output: [tab, leprechaun]

输出(选项卡,小妖精):

#10


0  

For German language there is CharSplit which uses machine learning and works pretty good for strings of a few words.

对于德语来说,CharSplit是一种使用机器学习的方法,它可以很好地处理一些单词的字符串。

https://github.com/dtuggener/CharSplit

https://github.com/dtuggener/CharSplit

#11


-1  

You need to identify your vocabulary - perhaps any free word list will do.

你需要识别你的词汇表——也许任何免费的词汇表都可以。

Once done, use that vocabulary to build a suffix tree, and match your stream of input against that: http://en.wikipedia.org/wiki/Suffix_tree

一旦完成,使用该词汇表构建后缀树,并将输入流与之匹配:http://en.wikipedia.org/wiki/Suffix_tree

#1


113  

A naive algorithm won't give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.

一个朴素的算法在应用于真实世界的数据时不会产生好的结果。这里有一个20行算法,利用相对的字频来为实字文本提供准确的结果。

(If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by "longest word": is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)

(如果你想要一个不使用词频的原始问题的答案,你需要精炼“最长的单词”到底是什么意思:一个20个字母的单词和10个3个字母的单词是好还是五个10个字母的单词更好?)一旦您确定了一个精确的定义,您只需更改定义wordcost的行,以反映预期的含义。

The idea

The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.

最好的方法是对输出的分布进行建模。一个好的第一个近似是假设所有的词都是独立分布的。然后你只需要知道所有单词的相对频率。假设他们遵循了Zipf的法则,这是合理的,这是单词列表中的n,其概率大约为1/(n log n),其中n是字典中单词的数量。

Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.

一旦修复了模型,就可以使用动态编程来推断空间的位置。最可能的句子是将每个单词的概率乘积最大化的句子,用动态编程来计算很容易。我们不直接使用概率,而是使用概率倒数的对数来定义成本,以避免溢出。

The code

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

which you can use with

你能用它来做什么

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

The results

I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.

我使用的是一个由*的一小部分组成的快速、肮脏的125k单词字典。

Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.

:thumbgreenappleactiveassignmentweeklymetaphor。后:拇指绿色苹果主动作业每周隐喻。

Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.

在此之前,人们对文字的表达方式进行了广泛的评论,但是在这之前,人们还会对文字进行限定,例如:“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”“”

After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.

后:有大量的文本信息的人民评论解析html但没有分隔的字符在例如拇指青苹果活动作业每周比喻显然有拇指青苹果等字符串中的我也有一个大字典查询词是否合理的提取thx什么年代的最快方式。

Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

:itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness。

After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

后:这是一个月黑风高的夜晚倾盆的除了偶尔间隔时被一阵猛烈的风,席卷了街道在伦敦,我们现场的谎言会在房顶和强烈搅拌的稀疏的火焰灯,挣扎在黑暗。

As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.

正如你所看到的,它本质上是完美无瑕的。最重要的是确保你的单词列表被训练成与你实际遇到的类似的语料库,否则结果会很糟糕。


Optimization

The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.

实现会消耗线性的时间和内存,因此它是相当有效的。如果您需要进一步的加速,您可以从单词列表中构建一个后缀树,以减少候选集合的大小。

If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.

如果您需要处理一个非常大的连续字符串,那么合理的做法是分割字符串,以避免过多的内存使用。例如,您可以使用10000个字符块和两边各有1000个字符的空白来处理文本,以避免边界效果。这将使内存使用最小化,并且几乎肯定不会对质量产生影响。

#2


16  

Based on the excellent work in the top answer, I've created a pip package for easy use.

基于上面答案中出色的工作,我创建了一个pip包,便于使用。

>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']

To install, run pip install wordninja.

要安装,运行pip安装wordninja。

The only differences are minor. This returns a list rather than a str, it works in python3, it includes the word list and properly splits even if there are non-alpha chars (like underscores, dashes, etc).

唯一的区别很小。它返回一个列表而不是一个str,它在python3中工作,它包括单词列表并且适当地分割即使有非字符字符(如下划线、破折号等)。

Thanks again to Generic Human!

再次感谢普通人类!

https://github.com/keredson/wordninja

https://github.com/keredson/wordninja

#3


13  

Here is solution using recursive search:

下面是递归搜索的解决方案:

def find_words(instring, prefix = '', words = None):
    if not instring:
        return []
    if words is None:
        words = set()
        with open('/usr/share/dict/words') as f:
            for line in f:
                words.add(line.strip())
    if (not prefix) and (instring in words):
        return [instring]
    prefix, suffix = prefix + instring[0], instring[1:]
    solutions = []
    # Case 1: prefix in solution
    if prefix in words:
        try:
            solutions.append([prefix] + find_words(suffix, '', words))
        except ValueError:
            pass
    # Case 2: prefix not in solution
    try:
        solutions.append(find_words(suffix, prefix, words))
    except ValueError:
        pass
    if solutions:
        return sorted(solutions,
                      key = lambda solution: [len(word) for word in solution],
                      reverse = True)[0]
    else:
        raise ValueError('no solution')

print(find_words('tableapplechairtablecupboard'))
print(find_words('tableprechaun', words = set(['tab', 'table', 'leprechaun'])))

yields

收益率

['table', 'apple', 'chair', 'table', 'cupboard']
['tab', 'leprechaun']

#4


9  

Using a trie data structure, which holds the list of possible words, it would not be too complicated to do the following:

使用一个trie数据结构,其中包含可能的单词列表,这样做并不太复杂:

  1. Advance pointer (in the concatenated string)
  2. 前进指针(在连接字符串中)
  3. Lookup and store the corresponding node in the trie
  4. 在trie中查找并存储相应的节点
  5. If the trie node has children (e.g. there are longer words), go to 1.
  6. 如果trie节点有子节点(例如有较长的单词),请转到1。
  7. If the node reached has no children, a longest word match happened; add the word (stored in the node or just concatenated during trie traversal) to the result list, reset the pointer in the trie (or reset the reference), and start over
  8. 如果节点没有子节点,就会发生最长的单词匹配;添加单词(存储在节点中或在trie遍历过程中连接)到结果列表,重置trie中的指针(或重置引用),然后重新开始。

#5


7  

Unutbu's solution was quite close but I find the code difficult to read, and it didn't yield the expected result. Generic Human's solution has the drawback that it needs word frequencies. Not appropriate for all use case.

Unutbu的解决方案非常接近,但是我发现代码很难读,而且它没有产生预期的结果。一般人的解决方案有一个缺点,它需要字频率。不适合所有用例。

Here's a simple solution using a Divide and Conquer algorithm.

这是一个使用分治算法的简单解决方案。

  1. It tries to minimize the number of words E.g. find_words('cupboard') will return ['cupboard'] rather than ['cup', 'board'] (assuming that cupboard, cup and board are in the dictionnary)
  2. 它尽量减少单词的数量,例如find_words('橱柜')将返回['橱柜'],而不是返回['cup', 'board'](假设碗柜、杯和板在字典里)
  3. The optimal solution is not unique, the implementation below returns a solution. find_words('charactersin') could return ['characters', 'in'] or maybe it will return ['character', 'sin'] (as seen below). You could quite easily modify the algorithm to return all optimal solutions.
  4. 最优解不是唯一的,下面的实现返回一个解。find_words('characters sin')可以返回['characters', 'in'],或者可能返回['character', 'sin'](如下所示)。您可以很容易地修改算法以返回所有最优解。
  5. In this implementation solutions are memoized so that it runs in a reasonable time.
  6. 在这个实现中,解决方案被存储,以便在合理的时间内运行。

The code:

代码:

words = set()
with open('/usr/share/dict/words') as f:
    for line in f:
        words.add(line.strip())

solutions = {}
def find_words(instring):
    # First check if instring is in the dictionnary
    if instring in words:
        return [instring]
    # No... But maybe it's a result we already computed
    if instring in solutions:
        return solutions[instring]
    # Nope. Try to split the string at all position to recursively search for results
    best_solution = None
    for i in range(1, len(instring) - 1):
        part1 = find_words(instring[:i])
        part2 = find_words(instring[i:])
        # Both parts MUST have a solution
        if part1 is None or part2 is None:
            continue
        solution = part1 + part2
        # Is the solution found "better" than the previous one?
        if best_solution is None or len(solution) < len(best_solution):
            best_solution = solution
    # Remember (memoize) this solution to avoid having to recompute it
    solutions[instring] = best_solution
    return best_solution

This will take about about 5sec on my 3GHz machine:

在我的3GHz机器上大约需要5秒:

result = find_words("thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearenodelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetaphorapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquerywhetherthewordisreasonablesowhatsthefastestwayofextractionthxalot")
assert(result is not None)
print ' '.join(result)

the reis masses of text information of peoples comments which is parsed from h t m l but there are no delimited character sin them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple e t c in the string i also have a large dictionary to query whether the word is reasonable so whats the fastest way of extraction t h x a lot

reis群众人民评论的文本信息解析从h t m l但没有分隔字符罪他们例如拇指青苹果活动作业每周比喻显然有拇指青苹果e t c字符串中的我也有一个大字典查询这个词是否合理所以什么最快的方法提取t h x

#6


4  

The answer by https://*.com/users/1515832/generic-human is great. But the best implementation of this I've ever seen was written Peter Norvig himself in his book 'Beautiful Data'.

https://*.com/users/1515832/ generationhuman的答案很好。但我所见过的最好的实现方式是,彼得·诺维格在他的书《美丽的数据》中写道。

Before I paste his code, let me expand on why Norvig's method is more accurate (although a little slower and longer in terms of code).

在粘贴他的代码之前,让我进一步说明为什么Norvig的方法更准确(尽管在代码方面慢了一点,也长了一点)。

1) The data is a bit better - both in terms of size and in terms of precision (he uses a word count rather than a simple ranking) 2) More importantly, it's the logic behind n-grams that really makes the approach so accurate.

2)更重要的是,n-g背后的逻辑使这种方法如此精确。

The example he provides in his book is the problem of splitting a string 'sitdown'. Now a non-bigram method of string split would consider p('sit') * p ('down'), and if this less than the p('sitdown') - which will be the case quite often - it will NOT split it, but we'd want it to (most of the time).

他在书中给出的例子是分割字符串“sitdown”的问题。现在一种非双图的字符串分割方法是考虑p('sit') * p(' down'),如果这个小于p('sitdown')——这是很常见的情况——它不会分裂它,但我们希望它(大多数时候)分裂。

However when you have the bigram model you could value p('sit down') as a bigram vs p('sitdown') and the former wins. Basically, if you don't use bigrams, it treats the probability of the words you're splitting as independent, which is not the case, some words are more likely to appear one after the other. Unfortunately those are also the words that are often stuck together in a lot of instances and confuses the splitter.

然而,当你有双图模型时,你可以把p(“坐下”)看成是双图vs p(“坐下来”),前者获胜。基本上,如果你不使用bigrams,它会把你拆分的单词的概率当作独立的,其实不是这样,有些词很可能一个接一个出现。不幸的是,在很多情况下,这些词常常被粘在一起,使拆分器混淆。

Here's the link to the data (it's data for 3 separate problems and segmentation is only one. Please read the chapter for details): http://norvig.com/ngrams/

这是数据的链接(这是3个独立问题的数据,分割是一个。详情请参阅本章):http://norvig.com/ngrams/

and here's the link to the code: http://norvig.com/ngrams/ngrams.py

这里是代码的链接:http://norvig.com/ngrams/ngrams.py

These links have been up a while, but I'll copy paste the segmentation part of the code here anyway

这些链接已经有一段时间了,但是我还是要在这里复制代码的分段部分

import re, string, random, glob, operator, heapq
from collections import defaultdict
from math import log10

def memo(f):
    "Memoize function f."
    table = {}
    def fmemo(*args):
        if args not in table:
            table[args] = f(*args)
        return table[args]
    fmemo.memo = table
    return fmemo

def test(verbose=None):
    """Run some tests, taken from the chapter.
    Since the hillclimbing algorithm is randomized, some tests may fail."""
    import doctest
    print 'Running tests...'
    doctest.testfile('ngrams-test.txt', verbose=verbose)

################ Word Segmentation (p. 223)

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

def splits(text, L=20):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    return [(text[:i+1], text[i+1:]) 
            for i in range(min(len(text), L))]

def Pwords(words): 
    "The Naive Bayes probability of a sequence of words."
    return product(Pw(w) for w in words)

#### Support functions (p. 224)

def product(nums):
    "Return the product of a sequence of numbers."
    return reduce(operator.mul, nums, 1)

class Pdist(dict):
    "A probability distribution estimated from counts in datafile."
    def __init__(self, data=[], N=None, missingfn=None):
        for key,count in data:
            self[key] = self.get(key, 0) + int(count)
        self.N = float(N or sum(self.itervalues()))
        self.missingfn = missingfn or (lambda k, N: 1./N)
    def __call__(self, key): 
        if key in self: return self[key]/self.N  
        else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):
    "Read key,value pairs from file."
    for line in file(name):
        yield line.split(sep)

def avoid_long_words(key, N):
    "Estimate the probability of an unknown word."
    return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw  = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

#### segment2: second version, with bigram counts, (p. 226-227)

def cPw(word, prev):
    "Conditional probability of word, given previous word."
    try:
        return P2w[prev + ' ' + word]/float(Pw[prev])
    except KeyError:
        return Pw(word)

P2w = Pdist(datafile('count_2w.txt'), N)

@memo 
def segment2(text, prev='<S>'): 
    "Return (log P(words), words), where words is the best segmentation." 
    if not text: return 0.0, [] 
    candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first)) 
                  for first,rem in splits(text)] 
    return max(candidates) 

def combine(Pfirst, first, (Prem, rem)): 
    "Combine first and rem results into one (probability, words) pair." 
    return Pfirst+Prem, [first]+rem 

#7


2  

If you precompile the wordlist into a DFA (which will be very slow), then the time it takes to match an input will be proportional to the length of the string (in fact, only a little slower than just iterating over the string).

如果将wordlist预编译为DFA(这会非常慢),那么匹配输入所需的时间将与字符串的长度成正比(实际上,只比迭代字符串慢一点)。

This is effectively a more general version of the trie algorithm which was mentioned earlier. I only mention it for completeless -- as of yet, there's no DFA implementation you can just use. RE2 would work, but I don't know if the Python bindings let you tune how large you allow a DFA to be before it just throws away the compiled DFA data and does NFA search.

这是前面提到的更通用的trie算法。我只在完全没有的情况下提到它——到目前为止,还没有可以使用的DFA实现。RE2可以工作,但是我不知道Python绑定是否允许您在DFA丢弃已编译的DFA数据并进行NFA搜索之前调整DFA的大小。

#8


0  

It seems like fairly mundane backtracking will do. Start at the beggining of the string. Scan right until you have a word. Then, call the function on the rest of the string. Function returns "false" if it scans all the way to the right without recognizing a word. Otherwise, returns the word it found and the list of words returned by the recursive call.

似乎相当平常的回溯就可以了。从绳子的乞丐开始。扫描直到你有一个词。然后,调用字符串其余部分的函数。函数返回“false”,如果它一直向右扫描而不识别单词。否则,返回它找到的单词和递归调用返回的单词列表。

Example: "tableapple". Finds "tab", then "leap", but no word in "ple". No other word in "leapple". Finds "table", then "app". "le" not a word, so tries apple, recognizes, returns.

例如:“tableapple”。查找“tab”,然后“leap”,但是在“ple”中没有单词。“leapple”里没有别的词了。发现“表”,然后“应用程序”。“le”一个字也没说,苹果也试着承认,回报。

To get longest possible, keep going, only emitting (rather than returning) correct solutions; then, choose the optimal one by any criterion you choose (maxmax, minmax, average, etc.)

为了尽可能长,继续前进,只发射(而不是返回)正确的解决方案;然后,根据选择的任何标准(maxmax、minmax、average等)选择最优的一个。

#9


0  

Based on unutbu's solution I've implemented a Java version:

基于unutbu的解决方案,我实现了一个Java版本:

private static List<String> splitWordWithoutSpaces(String instring, String suffix) {
    if(isAWord(instring)) {
        if(suffix.length() > 0) {
            List<String> rest = splitWordWithoutSpaces(suffix, "");
            if(rest.size() > 0) {
                List<String> solutions = new LinkedList<>();
                solutions.add(instring);
                solutions.addAll(rest);
                return solutions;
            }
        } else {
            List<String> solutions = new LinkedList<>();
            solutions.add(instring);
            return solutions;
        }

    }
    if(instring.length() > 1) {
        String newString = instring.substring(0, instring.length()-1);
        suffix = instring.charAt(instring.length()-1) + suffix;
        List<String> rest = splitWordWithoutSpaces(newString, suffix);
        return rest;
    }
    return Collections.EMPTY_LIST;
}

Input: "tableapplechairtablecupboard"

输入:“tableapplechairtablecupboard”

Output: [table, apple, chair, table, cupboard]

输出:【桌子、苹果、椅子、桌子、橱柜】

Input: "tableprechaun"

输入:“tableprechaun”

Output: [tab, leprechaun]

输出(选项卡,小妖精):

#10


0  

For German language there is CharSplit which uses machine learning and works pretty good for strings of a few words.

对于德语来说,CharSplit是一种使用机器学习的方法,它可以很好地处理一些单词的字符串。

https://github.com/dtuggener/CharSplit

https://github.com/dtuggener/CharSplit

#11


-1  

You need to identify your vocabulary - perhaps any free word list will do.

你需要识别你的词汇表——也许任何免费的词汇表都可以。

Once done, use that vocabulary to build a suffix tree, and match your stream of input against that: http://en.wikipedia.org/wiki/Suffix_tree

一旦完成,使用该词汇表构建后缀树,并将输入流与之匹配:http://en.wikipedia.org/wiki/Suffix_tree