使用词频生成“顶部列表”的算法

时间:2022-09-13 09:44:30

I have a big collection of human generated content. I want to find the words or phrases that occur most often. What is an efficient way to do this?

我有很多人类生成的内容。我想找到最常出现的单词或短语。有效的方法是什么?

6 个解决方案

#1


Don't reinvent the wheel. Use a full text search engine such as Lucene.

不要重新发明*。使用Lucene等全文搜索引擎。

#2


The simple/naive way is to use a hashtable. Walk through the words and increment the count as you go.

简单/天真的方式是使用哈希表。翻阅单词并随时增加计数。

At the end of the process sort the key/value pairs by count.

在该过程结束时,按键计数键/值对。

#3


the basic idea is simple -- in executable pseudocode,

基本思路很简单 - 在可执行的伪代码中,

from collections import defaultdict

def process(words):
  d = defaultdict(int)
  for w in words: d[w] += 1
  return d

Of course, the devil is in the details -- how do you turn the big collection into an iterator yielding words? Is it big enough that you can't process it on a single machine but rather need a mapreduce approach e.g. via hadoop? Etc, etc. NLTK can help with the linguistic aspects (isolating words in languages that don't separate them cleanly).

当然,魔鬼在细节中 - 你如何将大集合变成一个产生单词的迭代器?是否足够大,你不能在一台机器上处理它,而是需要mapreduce方法,例如通过hadoop?等等.NLTK可以帮助语言方面(隔离不能完全分离它们的语言中的单词)。

On a single-machine execution (net of mapreduce), one issue that can arise is that the simple idea gives you far too many singletons or thereabouts (words occurring once or just a few times), which fill memory. A probabilistic retort to that is to do two passes: one with random sampling (get only one word in ten, or one in a hundred) to make a set of words that are candidates for the top ranks, then a second pass skipping words that are not in the candidate set. Depending on how many words you're sampling and how many you want in the result, it's possible to compute an upper bound on the probability that you're going to miss an important word this way (and for reasonable numbers, and any natural language, I assure you that you'll be just fine).

在单机执行(netreduce的网络)上,可能出现的一个问题是,这个简单的想法会给你太多的单身或左右(单词出现一次或只有几次),这会填补内存。一个概率性的反驳就是做两个过程:一个随机抽样(只得十分之一,或一百分之一)制作一组单词,作为最高等级的候选,然后第二遍传单跳过单词不在候选人集合中。根据您抽样的单词数量以及结果中您想要的数量,可以计算出以这种方式错过重要单词的概率的上限(以及合理的数字和任何自然语言) ,我向你保证,你会没事的。

Once you have your dictionary mapping words to numbers of occurrences you just need to pick the top N words by occurrences -- a heap-queue will help there, if the dictionary is just too large to sort by occurrences in its entirety (e.g. in my favorite executable pseudocode, heapq.nlargest will do it, for example).

一旦你将字典映射到出现次数的字典,你只需要按出现次数选择前N个字 - 如果字典太大而无法按整个事件排序(例如在我的字典中),那么堆排队就会有帮助。最喜欢的可执行伪代码,例如heapq.nlargest会这样做。

#4


Look into the Apriori algorithm. It can be used to find frequent items and/or frequent sets of items.

查看Apriori算法。它可用于查找频繁项目和/或频繁项目集。

Like the wikipedia article states, there are more efficient algorithms that do the same thing, but this could be a good start to see if this will apply to your situation.

就像*文章所述,有更高效的算法可以做同样的事情,但这可能是一个很好的开始,看看这是否适用于你的情况。

#5


Maybe you can try using a PATRICIA trie or practical algorithm to retrieve information coded in alphanumeric trie?

也许您可以尝试使用PATRICIA trie或实用算法来检索以字母数字形式编码的信息?

#6


Why not a simple map with key as the word and the Counter as the Value. It will give the top used words, by taking the high value counter. It is just a O(N) operation.

为什么不是一个简单的地图,其中键作为单词,计数器作为值。它将通过采用高值计数器给出最常用的单词。它只是一个O(N)操作。

#1


Don't reinvent the wheel. Use a full text search engine such as Lucene.

不要重新发明*。使用Lucene等全文搜索引擎。

#2


The simple/naive way is to use a hashtable. Walk through the words and increment the count as you go.

简单/天真的方式是使用哈希表。翻阅单词并随时增加计数。

At the end of the process sort the key/value pairs by count.

在该过程结束时,按键计数键/值对。

#3


the basic idea is simple -- in executable pseudocode,

基本思路很简单 - 在可执行的伪代码中,

from collections import defaultdict

def process(words):
  d = defaultdict(int)
  for w in words: d[w] += 1
  return d

Of course, the devil is in the details -- how do you turn the big collection into an iterator yielding words? Is it big enough that you can't process it on a single machine but rather need a mapreduce approach e.g. via hadoop? Etc, etc. NLTK can help with the linguistic aspects (isolating words in languages that don't separate them cleanly).

当然,魔鬼在细节中 - 你如何将大集合变成一个产生单词的迭代器?是否足够大,你不能在一台机器上处理它,而是需要mapreduce方法,例如通过hadoop?等等.NLTK可以帮助语言方面(隔离不能完全分离它们的语言中的单词)。

On a single-machine execution (net of mapreduce), one issue that can arise is that the simple idea gives you far too many singletons or thereabouts (words occurring once or just a few times), which fill memory. A probabilistic retort to that is to do two passes: one with random sampling (get only one word in ten, or one in a hundred) to make a set of words that are candidates for the top ranks, then a second pass skipping words that are not in the candidate set. Depending on how many words you're sampling and how many you want in the result, it's possible to compute an upper bound on the probability that you're going to miss an important word this way (and for reasonable numbers, and any natural language, I assure you that you'll be just fine).

在单机执行(netreduce的网络)上,可能出现的一个问题是,这个简单的想法会给你太多的单身或左右(单词出现一次或只有几次),这会填补内存。一个概率性的反驳就是做两个过程:一个随机抽样(只得十分之一,或一百分之一)制作一组单词,作为最高等级的候选,然后第二遍传单跳过单词不在候选人集合中。根据您抽样的单词数量以及结果中您想要的数量,可以计算出以这种方式错过重要单词的概率的上限(以及合理的数字和任何自然语言) ,我向你保证,你会没事的。

Once you have your dictionary mapping words to numbers of occurrences you just need to pick the top N words by occurrences -- a heap-queue will help there, if the dictionary is just too large to sort by occurrences in its entirety (e.g. in my favorite executable pseudocode, heapq.nlargest will do it, for example).

一旦你将字典映射到出现次数的字典,你只需要按出现次数选择前N个字 - 如果字典太大而无法按整个事件排序(例如在我的字典中),那么堆排队就会有帮助。最喜欢的可执行伪代码,例如heapq.nlargest会这样做。

#4


Look into the Apriori algorithm. It can be used to find frequent items and/or frequent sets of items.

查看Apriori算法。它可用于查找频繁项目和/或频繁项目集。

Like the wikipedia article states, there are more efficient algorithms that do the same thing, but this could be a good start to see if this will apply to your situation.

就像*文章所述,有更高效的算法可以做同样的事情,但这可能是一个很好的开始,看看这是否适用于你的情况。

#5


Maybe you can try using a PATRICIA trie or practical algorithm to retrieve information coded in alphanumeric trie?

也许您可以尝试使用PATRICIA trie或实用算法来检索以字母数字形式编码的信息?

#6


Why not a simple map with key as the word and the Counter as the Value. It will give the top used words, by taking the high value counter. It is just a O(N) operation.

为什么不是一个简单的地图,其中键作为单词,计数器作为值。它将通过采用高值计数器给出最常用的单词。它只是一个O(N)操作。