您如何开发一个频率排序的列表,列出英语中最常用的一万个单词?

时间:2022-04-15 16:06:50

I was once asked by a current employee how I would develop a frequency-sorted list of the ten thousand most-used words in the English language. Suggest a solution in the language of your choosing, though I prefer C#.

我曾经被一位现任员工问到,我将如何开发一个频率排序的列表,列出英语中最常用的一万个单词。使用您选择的语言建议解决方案,但我更喜欢C#。

Please provide not only an implementation, but also an explanation.

请不仅提供实施,还要提供解释。

Thanks

4 个解决方案

#1


IEnumerable<string> inputList; // input words.
var mostFrequentlyUsed = inputList.GroupBy(word => word)
  .Select(wordGroup => new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
  .OrderByDescending(word => word.Frequency);

Explanation: I don't really know if it requires further explanation but I'll try. inputList is an array or any other collection providing source words. GroupBy function will group the input collection by some similar property (which is, in my code the object itself, as noted by the lambda word => word). The output (which is a set of groups by a specified key, the word) will be transformed to an object with Word and Frequency properties and sorted by Frequency property in descending order. You could use .Take(10000) to take the first 10000. The whole thing can be easily parallelized by .AsParallel() provided by PLINQ. The query operator syntax might look clearer:

说明:我真的不知道是否需要进一步解释,但我会尝试。 inputList是一个数组或任何其他提供源词的集合。 GroupBy函数将通过一些类似的属性对输入集合进行分组(在我的代码中,对象本身,如lambda word => word所示)。输出(由指定键组成的一组组,单词)将转换为具有Word和Frequency属性的对象,并按Frequency属性按降序排序。您可以使用.Take(10000)获取前10000个。整个事物可以通过PLINQ提供的.AsParallel()轻松并行化。查询运算符语法可能看起来更清晰:

var mostFrequentlyUsed = 
     (from word in inputList
      group word by word into wordGroup
      select new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
     .OrderByDescending(word => word.Frequency).Take(10000);

#2


As a first cut, absent further definition of the problem (just what do you mean by the most-used words in English?) -- I'd buy Google's n-gram data, intersect the 1-grams with an English dictionary, and pipe that to sort -rn -k 2 | head -10000.

作为第一个剪辑,没有进一步定义问题(你用英语中最常用的单词是什么意思?) - 我会购买Google的n-gram数据,将1-gram与英语词典相交,并且管道要排序-rn -k 2 |头-10000。

#3


I would use map-reduce. This is a canonical example of a task well-suited for it. You can use Hadoop with C# with the streaming protocol. There are also other approaches. See Is there a .NET equivalent to Apache Hadoop? and https://*.com/questions/436686/-net-mapreduce-implementation.

我会使用map-reduce。这是一个非常适合它的任务的规范示例。您可以将Hadoop与C#一起使用,使用流协议。还有其他方法。请参阅是否有与Apache Hadoop等效的.NET?和https://*.com/questions/436686/-net-mapreduce-implementation。

#4


First thing to pop into my head (not syntax checked, and verbose (for perl) for demonstrative purposes)

首先要碰到我的脑袋(没有语法检查,并且为了演示目的而详细(对于perl))

#!/usr/bin/perl

my %wordFreq
foreach ( my $word in @words)
{
   $wordFreq{$word}++;
}

my @mostPopularWords = sort{$wordFreq{$a} <=> $wordFreq{$b} } keys %wordFreq;
for (my $i=0; $i < 10000; ++$i)
{
   print "$i: $mostPopularWords[$i] ($wordFreq{$mostPopularWords[$i]} hits)\n"
}

#1


IEnumerable<string> inputList; // input words.
var mostFrequentlyUsed = inputList.GroupBy(word => word)
  .Select(wordGroup => new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
  .OrderByDescending(word => word.Frequency);

Explanation: I don't really know if it requires further explanation but I'll try. inputList is an array or any other collection providing source words. GroupBy function will group the input collection by some similar property (which is, in my code the object itself, as noted by the lambda word => word). The output (which is a set of groups by a specified key, the word) will be transformed to an object with Word and Frequency properties and sorted by Frequency property in descending order. You could use .Take(10000) to take the first 10000. The whole thing can be easily parallelized by .AsParallel() provided by PLINQ. The query operator syntax might look clearer:

说明:我真的不知道是否需要进一步解释,但我会尝试。 inputList是一个数组或任何其他提供源词的集合。 GroupBy函数将通过一些类似的属性对输入集合进行分组(在我的代码中,对象本身,如lambda word => word所示)。输出(由指定键组成的一组组,单词)将转换为具有Word和Frequency属性的对象,并按Frequency属性按降序排序。您可以使用.Take(10000)获取前10000个。整个事物可以通过PLINQ提供的.AsParallel()轻松并行化。查询运算符语法可能看起来更清晰:

var mostFrequentlyUsed = 
     (from word in inputList
      group word by word into wordGroup
      select new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
     .OrderByDescending(word => word.Frequency).Take(10000);

#2


As a first cut, absent further definition of the problem (just what do you mean by the most-used words in English?) -- I'd buy Google's n-gram data, intersect the 1-grams with an English dictionary, and pipe that to sort -rn -k 2 | head -10000.

作为第一个剪辑,没有进一步定义问题(你用英语中最常用的单词是什么意思?) - 我会购买Google的n-gram数据,将1-gram与英语词典相交,并且管道要排序-rn -k 2 |头-10000。

#3


I would use map-reduce. This is a canonical example of a task well-suited for it. You can use Hadoop with C# with the streaming protocol. There are also other approaches. See Is there a .NET equivalent to Apache Hadoop? and https://*.com/questions/436686/-net-mapreduce-implementation.

我会使用map-reduce。这是一个非常适合它的任务的规范示例。您可以将Hadoop与C#一起使用,使用流协议。还有其他方法。请参阅是否有与Apache Hadoop等效的.NET?和https://*.com/questions/436686/-net-mapreduce-implementation。

#4


First thing to pop into my head (not syntax checked, and verbose (for perl) for demonstrative purposes)

首先要碰到我的脑袋(没有语法检查,并且为了演示目的而详细(对于perl))

#!/usr/bin/perl

my %wordFreq
foreach ( my $word in @words)
{
   $wordFreq{$word}++;
}

my @mostPopularWords = sort{$wordFreq{$a} <=> $wordFreq{$b} } keys %wordFreq;
for (my $i=0; $i < 10000; ++$i)
{
   print "$i: $mostPopularWords[$i] ($wordFreq{$mostPopularWords[$i]} hits)\n"
}