为给定文本中最常用的单词构建一个ASCII图表

时间:2023-01-14 11:37:05

The challenge:

Build an ASCII chart of the most commonly used words in a given text.

为给定文本中最常用的单词构建一个ASCII图表。

The rules:

规则:

  • Only accept a-z and A-Z (alphabetic characters) as part of a word.
  • 只接受a-z和a-z(字母字符)作为单词的一部分。
  • Ignore casing (She == she for our purpose).
  • 忽略套管(她为我们的目的)。
  • Ignore the following words (quite arbitary, I know): the, and, of, to, a, i, it, in, or, is
  • 忽略下面的词(我知道):a, I, it, in, or, is。
  • Clarification: considering don't: this would be taken as 2 different 'words' in the ranges a-z and A-Z: (don and t).

    澄清:考虑不要:这将被当作两个不同的“单词”在a-z和a-z: (don和t)范围内。

  • Optionally (it's too late to be formally changing the specifications now) you may choose to drop all single-letter 'words' (this could potentially make for a shortening of the ignore list too).

    可以选择(现在正式更改规范已经太晚了),您可以选择删除所有单字母的“words”(这可能也会缩短忽略列表)。

Parse a given text (read a file specified via command line arguments or piped in; presume us-ascii) and build us a word frequency chart with the following characteristics:

解析给定的文本(通过命令行参数读取指定的文件或导入;假设是us-ascii)并为我们构建一个具有以下特征的字频图:

  • Display the chart (also see the example below) for the 22 most common words (ordered by descending frequency).
  • 显示22个最常见的单词(按降序频率排列)的图表(也见下面的示例)。
  • The bar width represents the number of occurences (frequency) of the word (proportionally). Append one space and print the word.
  • 条形宽度表示单词出现的次数(频率)(按比例)。附加一个空格并打印单词。
  • Make sure these bars (plus space-word-space) always fit: bar + [space] + word + [space] should be always <= 80 characters (make sure you account for possible differing bar and word lengths: e.g.: the second most common word could be a lot longer then the first while not differing so much in frequency). Maximize bar width within these constraints and scale the bars appropriately (according to the frequencies they represent).
  • 确保这些酒吧(加上space-word-space)总是适合:酒吧+(空间)+单词+(空间)应该总是< = 80个字符(确保你考虑可能的不同的酒吧和单词长度:例如:第二个最常见的词可以更长时间之后第一个而不是不同的频率如此之多)。在这些约束条件下最大限度地增加条形宽度,并适当地按比例(根据它们所代表的频率)。

An example:

一个例子:

The text for the example can be found here (Alice's Adventures in Wonderland, by Lewis Carroll).

这个例子的文本可以在这里找到(Lewis Carroll的《爱丽丝梦游仙境》)。

This specific text would yield the following chart:

这一具体案文将产生以下图表:

 _________________________________________________________________________|_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |____________________________________________________| alice |______________________________________________| was |__________________________________________| that |___________________________________| as |_______________________________| her |____________________________| with |____________________________| at |___________________________| s |___________________________| t |_________________________| on |_________________________| all |______________________| this |______________________| for |______________________| had |_____________________| but |____________________| be |____________________| not |___________________| they |__________________| so 

For your information: these are the frequencies the above chart is built upon:

供你参考:上述图表所依据的频率如下:

[('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t', 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), ('but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]

A second example (to check if you implemented the complete spec):Replace every occurence of you in the linked Alice in Wonderland file with superlongstringstring:

第二个例子(检查你是否实现了完整的规范):用superlongstringstring把你在《爱丽丝漫游奇境》中的每一个出现都替换掉:

 ________________________________________________________________|________________________________________________________________| she |_______________________________________________________| superlongstringstring |_____________________________________________________| said |______________________________________________| alice |________________________________________| was |_____________________________________| that |______________________________| as |___________________________| her |_________________________| with |_________________________| at |________________________| s |________________________| t |______________________| on |_____________________| all |___________________| this |___________________| for |___________________| had |__________________| but |_________________| be |_________________| not |________________| they |________________| so 

The winner:

赢家:

Shortest solution (by character count, per language). Have fun!

最短解(按字符计数,按语言)。玩得开心!


Edit: Table summarizing the results so far (2012-02-15) (originally added by user Nas Banov):

编辑:表汇总结果(2012-02-15)(最初由用户Nas Banov添加):

Language          Relaxed  Strict=========         =======  ======GolfScript          130     143Perl                        185Windows PowerShell  148     199Mathematica                 199Ruby                185     205Unix Toolchain      194     228Python              183     243Clojure                     282Scala                       311Haskell                     333Awk                         336R                   298Javascript          304     354Groovy              321Matlab                      404C#                          422Smalltalk           386PHP                 450F#                          452TSQL                483     507

The numbers represent the length of the shortest solution in a specific language. "Strict" refers to a solution that implements the spec completely (draws |____| bars, closes the first bar on top with a ____ line, accounts for the possibility of long words with high frequency etc). "Relaxed" means some liberties were taken to shorten to solution.

数字表示特定语言中最短解的长度。“严格”是指完全实现spec的解决方案(绘制|____| bar,在顶部用____行关闭第一个bar,考虑到可能出现高频长词等)。“放松”意味着采取一些*来缩短解决方案。

Only solutions shorter then 500 characters are included. The list of languages is sorted by the length of the 'strict' solution. 'Unix Toolchain' is used to signify various solutions that use traditional *nix shell plus a mix of tools (like grep, tr, sort, uniq, head, perl, awk).

只包含小于500个字符的解决方案。语言列表按“严格”解决方案的长度排序。“Unix工具链”用于表示使用传统的*nix shell加上混合工具(如grep、tr、sort、uniq、head、perl、awk)的各种解决方案。

59 个解决方案

#1


123  

LabVIEW 51 nodes, 5 structures, 10 diagrams

Teaching the elephant to tap-dance is never pretty. I'll, ah, skip the character count.

教大象跳踢踏舞从来都不是件好事。我将跳过字符计数。

为给定文本中最常用的单词构建一个ASCII图表

为给定文本中最常用的单词构建一个ASCII图表

The program flows from left to right:

程序从左向右流动:

为给定文本中最常用的单词构建一个ASCII图表

#2


42  

Ruby 1.9, 185 chars

(heavily based on the other Ruby solutions)

(主要基于其他Ruby解决方案)

w=($<.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22]k,l=w[0]puts [?\s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}]

Instead of using any command line switches like the other solutions, you can simply pass the filename as argument. (i.e. ruby1.9 wordfrequency.rb Alice.txt)

不像其他解决方案那样使用任何命令行开关,您可以简单地将文件名作为参数传递。(即ruby1.9 wordfrequency。rb Alice.txt)

Since I'm using character-literals here, this solution only works in Ruby 1.9.

由于我在这里使用字符-文字,这个解决方案只能在Ruby 1.9中使用。

Edit: Replaced semicolons by line breaks for "readability". :P

编辑:用换行符替换分号为“可读性”。:P

Edit 2: Shtééf pointed out I forgot the trailing space - fixed that.

编辑2:Shteef指出我忘记了后面的空格-修正了。

Edit 3: Removed the trailing space again ;)

编辑3:再次删除拖尾空间;

#3


39  

GolfScript, 177 175 173 167 164 163 144 131 130 chars

Slow - 3 minutes for the sample text (130)

慢-样本文本3分钟(130分钟)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*' '\@{"|"\~1*2/0*'| '@}/

Explanation:

解释:

{           #loop through all characters 32|.       #convert to uppercase and duplicate 123%97<    #determine if is a letter n@if       #return either the letter or a newline}%          #return an array (of ints)]''*        #convert array to a string with magicn%          #split on newline, removing blanks (stack is an array of words now)"oftoitinorisa"   #push this string2/          #split into groups of two, i.e. ["of" "to" "it" "in" "or" "is" "a"]-           #remove any occurrences from the text"theandi"3/-#remove "the", "and", and "i"$           #sort the array of words(1@         #takes the first word in the array, pushes a 1, reorders stack            #the 1 is the current number of occurrences of the first word{           #loop through the array .3$>1{;)}if#increment the count or push the next word and a 1}/]2/         #gather stack into an array and split into groups of 2{~~\;}$     #sort by the latter element - the count of occurrences of each word22<         #take the first 22 elements.0=~:2;     #store the highest count,76\-:1     #store the length of the first line'_':0*' '\@ #make the first line{           #loop through each word"|"\~        #start drawing the bar1*2/0       #divide by zero*'| '@      #finish drawing the bar}/

"Correct" (hopefully). (143)

“正确”(希望)。(143)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<..0=1=:^;{~76@,-^*\/}%$0=:1'_':0*' '\@{"|"\~1*^/0*'| '@}/

Less slow - half a minute. (162)

慢一点——半分钟。(162)

'"'/' ':S*n/S*'"#{%q'\+".downcase.tr('^a-z','')}\""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*S\@{"|"\~1*2/0*'| '@}/

Output visible in revision logs.

在修订日志中可见的输出。

#4


35  

206

shell, grep, tr, grep, sort, uniq, sort, head, perl

~ % wc -c wfg209 wfg~ % cat wfgegrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|of|to|a|i|it|in|or|is'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'~ % # usage:~ % sh wfg < 11.txt

hm, just seen above: sort -nr -> sort -n and then head -> tail => 208 :)
update2: erm, of course the above is silly, as it will be reversed then. So, 209.
update3: optimized the exclusion regexp -> 206

嗯,如上所示:sort -nr -> sort -n然后head -> tail => 208:)所以,209。update3:优化排除regexp -> 206

egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'



for fun, here's a perl-only version (much faster):

有趣的是,这里有一个只使用perl的版本(更快):

~ % wc -c pgolf204 pgolf~ % cat pgolfperl -lne'$1=~/^(the|and|o[fr]|to|.|i[tns])$/i||$f{lc$1}++while/\b([a-z]+)/gi}{@w=(sort{$f{$b}<=>$f{$a}}keys%f)[0..21];$Q=$f{$_=$w[0]};$B=76-y///c;print" "."_"x$B;print"|"."_"x($B*$f{$_}/$Q)."| $_"for@w'~ % # usage:~ % sh pgolf < 11.txt

#5


35  

Transact SQL set based solution (SQL Server 2005) 1063 892 873 853 827 820 783 683 647 644 630 characters

Thanks to Gabe for some useful suggestions to reduce the character count.

感谢Gabe提供的一些有用的建议来减少角色数量。

NB: Line breaks added to avoid scrollbars only the last line break is required.

NB:添加换行符以避免滚动条,只需要最后一次换行。

DECLARE @ VARCHAR(MAX),@F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A',SINGLE_BLOB)x;WITH N AS(SELECT 1 i,LEFT(@,1)L UNION ALL SELECT i+1,SUBSTRING(@,i+1,1)FROM N WHERE i<LEN(@))SELECT i,L,i-RANK()OVER(ORDER BY i)R INTO #DFROM N WHERE L LIKE'[A-Z]'OPTION(MAXRECURSION 0)SELECT TOP 22 W,-COUNT(*)CINTO # FROM(SELECT DISTINCT R,(SELECT''+L FROM #D WHERE R=b.R FOR XML PATH(''))W FROM #D b)t WHERE LEN(W)>1 AND W NOT IN('the','and','of','to','it','in','or','is')GROUP BY W ORDER BY C SELECT @F=MIN(($76-LEN(W))/-C),@=' '+REPLICATE('_',-MIN(C)*@F)+' 'FROM # SELECT @=@+' |'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @

Readable Version

可读版本

DECLARE @  VARCHAR(MAX),        @F REALSELECT @=BulkColumnFROM   OPENROWSET(BULK'A',SINGLE_BLOB)x; /*  Loads text file from path                                             C:\WINDOWS\system32\A  *//*Recursive common table expression togenerate a table of numbers from 1 to string length(and associated characters)*/WITH N AS     (SELECT 1 i,             LEFT(@,1)L     UNION ALL     SELECT i+1,            SUBSTRING(@,i+1,1)     FROM   N     WHERE  i<LEN(@)     )  SELECT   i,           L,           i-RANK()OVER(ORDER BY i)R           /*Will group characters           from the same word together*/  INTO     #D  FROM     N  WHERE    L LIKE'[A-Z]'OPTION(MAXRECURSION 0)             /*Assuming case insensitive accent sensitive collation*/SELECT   TOP 22 W,         -COUNT(*)CINTO     #FROM     (SELECT DISTINCT R,                          (SELECT ''+L                          FROM    #D                          WHERE   R=b.R FOR XML PATH('')                          )W                          /*Reconstitute the word from the characters*/         FROM             #D b         )         TWHERE    LEN(W)>1AND      W NOT IN('the',                  'and',                  'of' ,                  'to' ,                  'it' ,                  'in' ,                  'or' ,                  'is')GROUP BY WORDER BY C/*Just noticed this looks risky as it relies on the order of evaluation of the  variables. I'm not sure that's guaranteed but it works on my machine :-) */SELECT @F=MIN(($76-LEN(W))/-C),       @ =' '      +REPLICATE('_',-MIN(C)*@F)+' 'FROM   #SELECT @=@+' |'+REPLICATE('_',-C*@F)+'| '+W             FROM     #             ORDER BY CPRINT @

Output

输出

 _________________________________________________________________________ |_________________________________________________________________________| she|_______________________________________________________________| You|____________________________________________________________| said|_____________________________________________________| Alice|_______________________________________________| was|___________________________________________| that|____________________________________| as|________________________________| her|_____________________________| at|_____________________________| with|__________________________| on|__________________________| all|_______________________| This|_______________________| for|_______________________| had|_______________________| but|______________________| be|_____________________| not|____________________| they|____________________| So|___________________| very|__________________| what

And with the long string

用长长的绳子。

 _______________________________________________________________ |_______________________________________________________________| she|_______________________________________________________| superlongstringstring|____________________________________________________| said|______________________________________________| Alice|________________________________________| was|_____________________________________| that|_______________________________| as|____________________________| her|_________________________| at|_________________________| with|_______________________| on|______________________| all|____________________| This|____________________| for|____________________| had|____________________| but|___________________| be|__________________| not|_________________| they|_________________| So|________________| very|________________| what

#6


34  

Ruby 207 213 211 210 207 203 201 200 chars

An improvement on Anurag, incorporating suggestion from rfusca. Also removes argument to sort and a few other minor golfings.

对Anurag的改进,包含来自rfusca的建议。还可以删除要排序的参数和其他一些次要的golfings。

w=(STDIN.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort.take 22;k,l=w[0];m=76.0-l.size;puts' '+'_'*m;w.map{|f,x|puts"|#{'_'*(m*f/k)}| #{x} "}

Execute as:

执行:

ruby GolfedWordFrequencies.rb < Alice.txt

Edit: put 'puts' back in, needs to be there to avoid having quotes in output.
Edit2: Changed File->IO
Edit3: removed /i
Edit4: Removed parentheses around (f*1.0), recounted
Edit5: Use string addition for the first line; expand s in-place.
Edit6: Made m float, removed 1.0. EDIT: Doesn't work, changes lengths. EDIT: No worse than before
Edit7: Use STDIN.read.

编辑:put“put”back in, need to be there to avoid having quotes in output。修改后的文件->IO Edit3:删除/i Edit4:删除(f*1.0)附近的圆括号,重述Edit5:第一行使用字符串添加;扩大就地。使m浮动,删除1.0。编辑:不工作,改变长度。编辑:不会比编辑前更糟:使用STDIN.read。

#7


28  

Mathematica ( 297 284 248 244 242 199 chars) Pure Functional

and Zipf's Law Testing

Look Mamma ... no vars, no hands, .. no head

看妈妈……没有桨,没有手。没有头

Edit 1> some shorthands defined (284 chars)

编辑1个>定义的一些短字符(284个字符)

f[x_, y_] := Flatten[Take[x, All, y]]; BarChart[f[{##}, -1],          BarOrigin -> Left,          ChartLabels -> Placed[f[{##}, 1], After],          Axes -> None] & @@Take[  SortBy[     Tally[       Select[        StringSplit[ToLowerCase[Import[i]], RegularExpression["\\W+"]],        !MemberQ[{"the", "and", "of", "to", "a", "i", "it", "in", "or","is"}, #]&]     ],   Last], -22]

Some explanations

一些解释

Import[]    # Get The FileToLowerCase []   # To Lower Case :)StringSplit[ STRING , RegularExpression["\\W+"]]   # Split By Words, getting a LISTSelect[ LIST, !MemberQ[{LIST_TO_AVOID}, #]&]   #  Select from LIST except those words in LIST_TO_AVOID   #  Note that !MemberQ[{LIST_TO_AVOID}, #]& is a FUNCTION for the testTally[LIST]   # Get the LIST {word,word,..}      and produce another  {{word,counter},{word,counter}...}SortBy[ LIST ,Last]   # Get the list produced bt tally and sort by counters     Note that counters are the LAST element of {word,counter}Take[ LIST ,-22]   # Once sorted, get the biggest 22 countersBarChart[f[{##}, -1], ChartLabels -> Placed[f[{##}, 1], After]] &@@ LIST   # Get the list produced by Take as input and produce a bar chartf[x_, y_] := Flatten[Take[x, All, y]]   # Auxiliary to get the list of the first or second element of lists of lists x_     dependending upon y   # So f[{##}, -1] is the list of counters   # and f[{##}, 1] is the list of words (labels for the chart)

Output

输出

alt text http://i49.tinypic.com/2n8mrer.jpg

alt文本http://i49.tinypic.com/2n8mrer.jpg

Mathematica is not well suited for golfing, and that is just because of the long, descriptive function names. Functions like "RegularExpression[]" or "StringSplit[]" just make me sob :(.

Mathematica不太适合打高尔夫球,这仅仅是由于冗长的描述性函数名。像“RegularExpression[]”或“StringSplit[]”这样的函数会让我哭泣。

Zipf's Law Testing

The Zipf's law predicts that for a natural language text, the Log (Rank) vs Log (occurrences) Plot follows a linear relationship.

Zipf定律预测,对于自然语言文本,日志(等级)与日志(事件)的关系是线性的。

The law is used in developing algorithms for criptography and data compression. (But it's NOT the "Z" in the LZW algorithm).

该定律适用于用于分析和数据压缩的算法。(但它不是LZW算法中的“Z”)。

In our text, we can test it with the following

在我们的文本中,我们可以使用以下代码进行测试

 f[x_, y_] := Flatten[Take[x, All, y]];  ListLogLogPlot[     Reverse[f[{##}, -1]],      AxesLabel -> {"Log (Rank)", "Log Counter"},      PlotLabel -> "Testing Zipf's Law"] & @@ Take[  SortBy[    Tally[       StringSplit[ToLowerCase[b], RegularExpression["\\W+"]]    ],    Last], -1000]

The result is (pretty well linear)

结果是(相当线性)

alt text http://i46.tinypic.com/33fcmdk.jpg

alt文本http://i46.tinypic.com/33fcmdk.jpg

Edit 6 > (242 Chars)

Refactoring the Regex (no Select function anymore)
Dropping 1 char words
More efficient definition for function "f"

重构Regex(不再选择函数),删除1个字符,更有效地定义函数f

f = Flatten[Take[#1, All, #2]]&; BarChart[     f[{##}, -1],      BarOrigin -> Left,      ChartLabels -> Placed[f[{##}, 1], After],      Axes -> None] & @@  Take[    SortBy[       Tally[         StringSplit[ToLowerCase[Import[i]],           RegularExpression["(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"]]       ],    Last],  -22]

Edit 7 → 199 characters

BarChart[#2, BarOrigin->Left, ChartLabels->Placed[#1, After], Axes->None]&@@   Transpose@Take[SortBy[Tally@StringSplit[ToLowerCase@Import@i,     RegularExpression@"(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"],Last], -22]
  • Replaced f with Transpose and Slot (#1/#2) arguments.
  • 将f替换为转置和槽(#1/#2)参数。
  • We don't need no stinkin' brackets (use f@x instead of f[x] where possible)
  • 我们不需要任何臭括号(在可能的情况下使用f@x代替f[x])

#8


27  

C# - 510 451 436 446 434 426 422 chars (minified)

Not that short, but now probably correct! Note, the previous version did not show the first line of the bars, did not scale the bars correctly, downloaded the file instead of getting it from stdin, and did not include all the required C# verbosity. You could easily shave many strokes if C# didn't need so much extra crap. Maybe Powershell could do better.

不是那么短,但现在可能是正确的!注意,之前的版本没有显示第一行的条形图,没有正确地缩放条形图,没有从stdin中下载文件,也没有包含所有必需的c# verbosity。如果c#不需要那么多多余的废话,你可以很容易地剃掉许多笔划。也许Powershell可以做得更好。

using C=System.Console;   // alias for Consoleusing System.Linq;  // for Split, GroupBy, Select, OrderBy, etc.class Class // must define a class{    static void Main()  // must define a Main    {        // split into words        var allwords = System.Text.RegularExpressions.Regex.Split(                // convert stdin to lowercase                C.In.ReadToEnd().ToLower(),                // eliminate stopwords and non-letters                @"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+")            .GroupBy(x => x)    // group by words            .OrderBy(x => -x.Count()) // sort descending by count            .Take(22);   // take first 22 words        // compute length of longest bar + word        var lendivisor = allwords.Max(y => y.Count() / (76.0 - y.Key.Length));        // prepare text to print        var toPrint = allwords.Select(x=>             new {                 // remember bar pseudographics (will be used in two places)                Bar = new string('_',(int)(x.Count()/lendivisor)),                 Word=x.Key             })            .ToList();  // convert to list so we can index into it        // print top of first bar        C.WriteLine(" " + toPrint[0].Bar);        toPrint.ForEach(x =>  // for each word, print its bar and the word            C.WriteLine("|" + x.Bar + "| " + x.Word));    }}

422 chars with lendivisor inlined (which makes it 22 times slower) in the below form (newlines used for select spaces):

422带有lendivisor内联(速度慢22倍)的字符(用于选择空格的新行):

using System.Linq;using C=System.Console;class M{static void Main(){vara=System.Text.RegularExpressions.Regex.Split(C.In.ReadToEnd().ToLower(),@"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+").GroupBy(x=>x).OrderBy(x=>-x.Count()).Take(22);varb=a.Select(x=>new{p=new string('_',(int)(x.Count()/a.Max(y=>y.Count()/(76d-y.Key.Length)))),t=x.Key}).ToList();C.WriteLine(" "+b[0].p);b.ForEach(x=>C.WriteLine("|"+x.p+"| "+x.t));}}

#9


25  

Perl, 237 229 209 chars

(Updated again to beat the Ruby version with more dirty golf tricks, replacing split/[^a-z/,lc with lc=~/[a-z]+/g, and eliminating a check for empty string in another place. These were inspired by the Ruby version, so credit where credit is due.)

(再次击败了Ruby版本更新更脏高尔夫技巧,取代分裂/[^ a - z / lc,lc = ~ /[a - z]+ / g,并消除检查空字符串在另一个地方。这些都是受到Ruby版本的启发,因此值得称赞。

Update: now with Perl 5.10! Replace print with say, and use ~~ to avoid a map. This has to be invoked on the command line as perl -E '<one-liner>' alice.txt. Since the entire script is on one line, writing it as a one-liner shouldn't present any difficulty :).

更新:现在使用Perl 5.10!用say替换打印,并使用~~以避免映射。这必须在命令行上调用,作为perl -E ' ' alice.txt。由于整个脚本都在一行上,因此将其编写为一行并不会带来任何困难:)。

 @s=qw/the and of to a i it in or is/;$c{$_}++foreach grep{!($_~~@s)}map{lc=~/[a-z]+/g}<>;@s=sort{$c{$b}<=>$c{$a}}keys%c;$f=76-length$s[0];say" "."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "foreach@s[0..21];

Note that this version normalizes for case. This doesn't shorten the solution any, since removing ,lc (for lower-casing) requires you to add A-Z to the split regex, so it's a wash.

注意,该版本对case进行了规范化。这不会缩短解决方案,因为移除,lc(对于较低的外壳)要求您在分割后的regex中添加a - z,所以这是一个清洗。

If you're on a system where a newline is one character and not two, you can shorten this by another two chars by using a literal newline in place of \n. However, I haven't written the above sample that way, since it's "clearer" (ha!) that way.

如果在一个系统中,换行符是一个字符而不是两个字符,您可以使用一个文字换行符代替\n来缩短另两个字符。然而,我并没有这样写上面的示例,因为这样写“更清楚”(哈!)


Here is a mostly correct, but not remotely short enough, perl solution:

下面是一个基本正确但还不够简短的perl解决方案:

use strict;use warnings;my %short = map { $_ => 1 } qw/the and of to a i it in or is/;my %count = ();$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-zA-Z]/ } (<>);my @sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];my $widest = 76 - (length $sorted[0]);print " " . ("_" x $widest) . "\n";foreach (@sorted){    my $width = int(($count{$_} / $count{$sorted[0]}) * $widest);    print "|" . ("_" x $width) . "| $_ \n";}

The following is about as short as it can get while remaining relatively readable. (392 chars).

以下内容尽可能简短,同时保持可读性。(392字符)。

%short = map { $_ => 1 } qw/the and of to a i it in or is/;%count;$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-z]/, lc } (<>);@sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];$widest = 76 - (length $sorted[0]);print " " . "_" x $widest . "\n";print"|" . "_" x int(($count{$_} / $count{$sorted[0]}) * $widest) . "| $_ \n" foreach @sorted;

#10


20  

Windows PowerShell, 199 chars

$x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *filter f($w){' '+'_'*$w$x[-1..-22]|%{"|$('_'*($w*$_.Count/$x[-1].Count))| "+$_.Name}}f(76..1|?{!((f $_)-match'.'*80)})[0]

(The last line break isn't necessary, but included here for readability.)

(最后的换行符不是必需的,但是这里包含了可读性。)

(Current code and my test files available in my SVN repository. I hope my test cases catch most common errors (bar length, problems with regex matching and a few others))

(在我的SVN存储库中可用的当前代码和测试文件。我希望我的测试用例能够捕获最常见的错误(bar长度、regex匹配的问题以及其他一些问题)

Assumptions:

假设:

  • US ASCII as input. It probably gets weird with Unicode.
  • 我们ASCII作为输入。Unicode可能会变得很奇怪。
  • At least two non-stop words in the text
  • 课文中至少有两个不间断的单词。

History

历史

Relaxed version (137), since that's counted separately by now, apparently:

放松版(137),因为现在已经分开计算了,显然:

($x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *)[-1..-22]|%{"|$('_'*(76*$_.Count/$x[-1].Count))| "+$_.Name}
  • doesn't close the first bar
  • 没有关闭第一栏
  • doesn't account for word length of non-first word
  • 不考虑单词长度的非第一个单词。

Variations of the bar lengths of one character compared to other solutions is due to PowerShell using rounding instead of truncation when converting floating-point numbers into integers. Since the task required only proportional bar length this should be fine, though.

与其他解决方案相比,一个字符的bar长度的变化是由于PowerShell在将浮点数转换为整数时使用舍入而不是截断。由于这项任务只需要比例杆长,所以这应该没问题。

Compared to other solutions I took a slightly different approach in determining the longest bar length by simply trying out and taking the highest such length where no line is longer than 80 characters.

与其他解决方案相比,我采用了一种稍微不同的方法来确定最长的长度,通过简单的尝试,并以最高的长度来确定长度,在这个长度中,没有一行长度超过80个字符。

An older version explained can be found here.

可以在这里找到一个较早的版本。

#11


19  

Ruby, 215, 216, 218, 221, 224, 236, 237 chars

update 1: Hurray! It's a tie with JS Bangs' solution. Can't think of a way to cut down any more :)

更新1:华友世纪!这和JS Bangs的解决方案是一样的。再也想不出一个减少的方法了

update 2: Played a dirty golf trick. Changed each to map to save 1 character :)

更新2:玩了一个肮脏的高尔夫把戏。将每个修改为map以保存1个字符:)

update 3: Changed File.read to IO.read +2. Array.group_by wasn't very fruitful, changed to reduce +6. Case insensitive check is not needed after lower casing with downcase in regex +1. Sorting in descending order is easily done by negating the value +6. Total savings +15

更新3:改变文件。读IO。读+ 2。数组中。group_by并不是很有效,它改为reduce +6。在使用regex +1的下壳后,不需要进行不区分大小写的检查。按降序排序很容易通过否定值+6来完成。总储蓄+ 15

update 4: [0] rather than .first, +3. (@Shtééf)

更新4:[0]而不是.first, +3。(@Shteef)

update 5: Expand variable l in-place, +1. Expand variable s in-place, +2. (@Shtééf)

更新5:就地展开变量l, +1。展开变量s, +2。(@Shteef)

update 6: Use string addition rather than interpolation for the first line, +2. (@Shtééf)

更新6:第一行使用字符串相加而不是插值,+2。(@Shteef)

w=(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take 22;m=76-w[0][0].size;puts' '+'_'*m;w.map{|x,f|puts"|#{'_'*(f*1.0/w[0][1]*m)}| #{x} "}

update 7: I went through a whole lot of hoopla to detect the first iteration inside the loop, using instance variables. All I got is +1, though perhaps there is potential. Preserving the previous version, because I believe this one is black magic. (@Shtééf)

更新7:我使用实例变量进行了大量的宣传,以检测循环中的第一次迭代。我得到的是+1,尽管可能有潜力。保留之前的版本,因为我相信这个是黑魔法。(@Shteef)

(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take(22).map{|x,f|@f||(@f=f;puts' '+'_'*(@m=76-x.size));puts"|#{'_'*(f*1.0/@f*@m)}| #{x} "}

Readable version

可读版本

string = File.read($_).downcasewords = string.scan(/[a-z]+/i)allowed_words = words - %w{the and of to a i it in or is}sorted_words = allowed_words.group_by{ |x| x }.map{ |x,y| [x, y.size] }.sort{ |a,b| b[1] <=> a[1] }.take(22)highest_frequency = sorted_words.firsthighest_frequency_count = highest_frequency[1]highest_frequency_word = highest_frequency[0]word_length = highest_frequency_word.sizewidest = 76 - word_lengthputs " #{'_' * widest}"    sorted_words.each do |word, freq|  width = (freq * 1.0 / highest_frequency_count) * widest  puts "|#{'_' * width}| #{word} "end

To use:

使用方法:

echo "Alice.txt" | ruby -ln GolfedWordFrequencies.rb

Output:

输出:

 _________________________________________________________________________|_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |____________________________| s |____________________________| t |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so 

#12


19  

Python 2.x, latitudinarian approach = 227 183 chars

import sys,ret=re.split('\W+',sys.stdin.read().lower())r=sorted((-t.count(w),w)for w in set(t)if w not in'andithetoforinis')[:22]for l,w in r:print(78-len(r[0][1]))*l/r[0][0]*'=',w

Allowing for freedom in the implementation, I constructed a string concatenation that contains all the words requested for exclusion (the, and, of, to, a, i, it, in, or, is) - plus it also excludes the two infamous "words" s and t from the example - and I threw in for free the exclusion for an, for, he. I tried all concatenations of those words against corpus of the words from Alice, King James' Bible and the Jargon file to see if there are any words that will be mis-excluded by the string. And that is how I ended with two exclusion strings:itheandtoforinis and andithetoforinis.

允许*的实现,我构建了一个包含所有单词字符串连接要求排除(,,,,,我,,,,),加上它也排除了两个臭名昭著的“单词”s和t的例子——我把免费的排斥,他。我试着把所有这些单词串在爱丽丝,詹姆斯国王的圣经和行话文件的语料库上,看看有没有单词会被字符串错误地排除。这就是我用两个排除字符串结尾的原因:itheandtoforinis和thetoforinis。

PS. borrowed from other solutions to shorten the code.

从其他解决方案中借鉴来缩短代码。

=========================================================================== she ================================================================= you============================================================== said====================================================== alice================================================ was============================================ that===================================== as================================= her============================== at============================== with=========================== on=========================== all======================== this======================== had======================= but====================== be====================== not===================== they==================== so=================== very=================== what================= little

Rant

Regarding words to ignore, one would think those would be taken from list of the most used words in English. That list depends on the text corpus used. Per one of the most popular lists (http://en.wikipedia.org/wiki/Most_common_words_in_English, http://www.english-for-students.com/Frequently-Used-Words.html, http://www.sporcle.com/games/common_english_words.php), top 10 words are: the be(am/are/is/was/were) to of and a in that have I

对于要忽略的词,人们可能会认为它们是从英语中最常用的词的列表中摘取的。这个列表取决于所使用的文本语料库。对于最流行的列表(http://en.wikipedia.org/wiki/Most_common_words_in_English, http://www.english-for- students.com/frequency - used - words.html, http://www.sporcle.com/games/common_english_words.php),最热门的10个单词是

The top 10 words from the Alice in Wonderland text are the and to a of it she i you said
The top 10 words from the Jargon File (v4.4.7) are the a of to and in is that or for

在《爱丽丝漫游奇境记》中,前10个单词是你说过的,其中一个单词是“to”,另一个单词是“in”

So question is why or was included in the problem's ignore list, where it's ~30th in popularity when the word that (8th most used) is not. etc, etc. Hence I believe the ignore list should be provided dynamically (or could be omitted).

所以问题是为什么或被包含在问题的忽略列表中,当第8个最常用的词不被使用时,它的受欢迎程度是30。因此我认为忽略列表应该动态提供(或者可以省略)。

Alternative idea would be simply to skip the top 10 words from the result - which actually would shorten the solution (elementary - have to show only the 11th to 32nd entries).

另一种方法是简单地从结果中跳过前10个单词——这实际上会缩短答案(基本的),只显示第11到第32个条目。


Python 2.x, punctilious approach = 277 243 chars

The chart drawn in the above code is simplified (using only one character for the bars). If one wants to reproduce exactly the chart from the problem description (which was not required), this code will do it:

上面代码中绘制的图表被简化了(对条形图只使用一个字符)。如果您想要从问题描述(不是必需的)中精确地复制图表,此代码将这样做:

import sys,ret=re.split('\W+',sys.stdin.read().lower())r=sorted((-t.count(w),w)for w in set(t)-set(sys.argv))[:22]h=min(9*l/(77-len(w))for l,w in r)print'',9*r[0][0]/h*'_'for l,w in r:print'|'+9*l/h*'_'+'|',w

I take an issue with the somewhat random choice of the 10 words to exclude the, and, of, to, a, i, it, in, or, is so those are to be passed as command line parameters, like so:
python WordFrequencyChart.py the and of to a i it in or is <"Alice's Adventures in Wonderland.txt"

我对这10个单词的随机选择提出了一个问题,它排除了a, I, it, in, or, is,所以这些都将作为命令行参数传递,比如:python WordFrequencyChart。在或<"爱丽丝漫游奇境记"中

This is 213 chars + 30 if we account for the "original" ignore list passed on command line = 243

这是213个chars + 30,如果我们解释在命令行上传递的“原始”忽略列表= 243。

PS. The second code also does "adjustment" for the lengths of all top words, so none of them will overflow in degenerate case.

第二段代码也对所有顶字的长度进行了“调整”,因此在退化情况下不会出现任何溢出。

 _______________________________________________________________|_______________________________________________________________| she|_______________________________________________________| superlongstringstring|_____________________________________________________| said|______________________________________________| alice|_________________________________________| was|______________________________________| that|_______________________________| as|____________________________| her|__________________________| at|__________________________| with|_________________________| s|_________________________| t|_______________________| on|_______________________| all|____________________| this|____________________| for|____________________| had|____________________| but|___________________| be|___________________| not|_________________| they|_________________| so

#13


12  

Haskell - 366 351 344 337 333 characters

(One line break in main added for readability, and no line break needed at end of last line.)

(增加了主断行,增加了可读性,最后一行不需要断行。)

import Data.Listimport Data.Charl=lengtht=filterm=mapf c|isAlpha c=toLower c|0<1=' 'h w=(-l w,head w)x!(q,w)='|':replicate(minimum$m(q?)x)'_'++"| "++wq?(g,w)=q*(77-l w)`div`gb x=m(x!)xa(l:r)=(' ':t(=='_')l):l:rmain=interact$unlines.a.b.take 22.sort.m h.group.sort  .t(`notElem`words"the and of to a i it in or is").words.m f

How it works is best seen by reading the argument to interact backwards:

它是如何运作的,最好的办法是阅读后面的讨论:

  • map f lowercases alphabetics, replaces everything else with spaces.
  • 映射f小写字母,用空格替换其他所有东西。
  • words produces a list of words, dropping the separating whitespace.
  • 单词产生一个单词列表,去掉分隔的空格。
  • filter (notElemwords "the and of to a i it in or is") discards all entries with forbidden words.
  • 过滤器(notElemwords“the”和“to a i it in or is”)将所有条目以禁止的单词丢弃。
  • group . sort sorts the words, and groups identical ones into lists.
  • 组。对单词进行排序,并将相同的单词分组到列表中。
  • map h maps each list of identical words to a tuple of the form (-frequency, word).
  • map h将每个相同的单词列表映射到表单的一个元组(-frequency, word)。
  • take 22 . sort sorts the tuples by descending frequency (the first tuple entry), and keeps only the first 22 tuples.
  • 22。按降序频率(第一个元组条目)对元组进行排序,并只保留前22个元组。
  • b maps tuples to bars (see below).
  • b将元组映射到条形图(见下面)。
  • a prepends the first line of underscores, to complete the topmost bar.
  • 在下划线的第一行前加上前缀,以完成最上面的栏。
  • unlines joins all these lines together with newlines.
  • unlines将所有这些行与换行符连接在一起。

The tricky bit is getting the bar length right. I assumed that only underscores counted towards the length of the bar, so || would be a bar of zero length. The function b maps c x over x, where x is the list of histograms. The entire list is passed to c, so that each invocation of c can compute the scale factor for itself by calling u. In this way, I avoid using floating-point math or rationals, whose conversion functions and imports would eat many characters.

棘手的一点是要把杆长弄对。我假设只有下划线计算的长度是bar的长度,所以||将是一个0长度的bar。函数b映射cx / x,其中x是直方图的列表。整个列表传递给c,这样c的每次调用都可以通过调用u来计算自身的比例因子,这样我就避免使用浮点数或有理函数,它们的转换函数和导入会消耗很多字符。

Note the trick of using -frequency. This removes the need to reverse the sort since sorting (ascending) -frequency will places the words with the largest frequency first. Later, in the function u, two -frequency values are multiplied, which will cancel the negation out.

注意使用-frequency的技巧。这就不需要反向排序,因为排序(升序)-频率将首先放置频率最大的单词。之后,在函数u中,将两个频率值相乘,这将抵消掉对它的否定。

#14


11  

JavaScript 1.8 (SpiderMonkey) - 354

x={};p='|';e=' ';z=[];c=77while(l=readline())l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y)x[y]?x[y].c++:z.push(x[y]={w:y,c:1}))z=z.sort(function(a,b)b.c-a.c).slice(0,22)for each(v in z){v.r=v.c/z[0].cc=c>(l=(77-v.w.length)/v.r)?l:c}for(k in z){v=z[k]s=Array(v.r*c|0).join('_')if(!+k)print(e+s+e)print(p+s+p+e+v.w)}

Sadly, the for([k,v]in z) from the Rhino version doesn't seem to want to work in SpiderMonkey, and readFile() is a little easier than using readline() but moving up to 1.8 allows us to use function closures to cut a few more lines....

不幸的是,为(z)[k、v]从犀牛版本似乎没有想在SpiderMonkey工作,和readFile()是一个小比使用readline()但更容易移动1.8允许我们使用函数闭包....削减更多的行

Adding whitespace for readability:

为可读性:添加空格

x={};p='|';e=' ';z=[];c=77while(l=readline())  l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,   function(y) x[y] ? x[y].c++ : z.push( x[y] = {w: y, c: 1} )  )z=z.sort(function(a,b) b.c - a.c).slice(0,22)for each(v in z){  v.r=v.c/z[0].c  c=c>(l=(77-v.w.length)/v.r)?l:c}for(k in z){  v=z[k]  s=Array(v.r*c|0).join('_')  if(!+k)print(e+s+e)  print(p+s+p+e+v.w)}

Usage: js golf.js < input.txt

用法:js高尔夫球。js < input.txt

Output:

输出:

 _________________________________________________________________________ |_________________________________________________________________________| she|_______________________________________________________________| you|____________________________________________________________| said|____________________________________________________| alice|______________________________________________| was|___________________________________________| that|___________________________________| as|________________________________| her|_____________________________| at|_____________________________| with|____________________________| s|____________________________| t|__________________________| on|_________________________| all|_______________________| this|______________________| for|______________________| had|______________________| but|_____________________| be|_____________________| not|___________________| they|___________________| so

(base version - doesn't handle bar widths correctly)

(base版本-不正确处理bar宽度)

JavaScript (Rhino) - 405 395 387 377 368 343 304 chars

I think my sorting logic is off, but.. I duno. Brainfart fixed.

我想我的排序逻辑出问题了。我duno。Brainfart固定的。

Minified (abusing \n's interpreted as a ; sometimes):

缩小(滥用\n被解释为a;有时):

x={};p='|';e=' ';z=[]readFile(arguments[0]).toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y){x[y]?x[y].c++:z.push(x[y]={w:y,c:1})})z=z.sort(function(a,b){return b.c-a.c}).slice(0,22)for([k,v]in z){s=Array((v.c/z[0].c)*70|0).join('_')if(!+k)print(e+s+e)print(p+s+p+e+v.w)}

#15


11  

PHP CLI version (450 chars)

This solution takes into account the last requirement which most purists have conviniently chosen to ignore. That costed 170 characters!

这个解决方案考虑了大多数纯粹主义者为了方便而忽略的最后一个要求。那花费170个字符!

Usage: php.exe <this.php> <file.txt>

用法:php。exe <。php > < file.txt >

Minified:

缩小:

<?php $a=array_count_values(array_filter(preg_split('/[^a-z]/',strtolower(file_get_contents($argv[1])),-1,1),function($x){return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);}));arsort($a);$a=array_slice($a,0,22);function R($a,$F,$B){$r=array();foreach($a as$x=>$f){$l=strlen($x);$r[$x]=$b=$f*$B/$F;if($l+$b>76)return R($a,$f,76-$l);}return$r;}$c=R($a,max($a),76-strlen(key($a)));foreach($a as$x=>$f)echo '|',str_repeat('-',$c[$x]),"| $x\n";?>

Human readable:

人类可读的:

<?php// Read:$s = strtolower(file_get_contents($argv[1]));// Split:$a = preg_split('/[^a-z]/', $s, -1, PREG_SPLIT_NO_EMPTY);// Remove unwanted words:$a = array_filter($a, function($x){       return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);     });// Count:$a = array_count_values($a);// Sort:arsort($a);// Pick top 22:$a=array_slice($a,0,22);// Recursive function to adjust bar widths// according to the last requirement:function R($a,$F,$B){    $r = array();    foreach($a as $x=>$f){        $l = strlen($x);        $r[$x] = $b = $f * $B / $F;        if ( $l + $b > 76 )            return R($a,$f,76-$l);    }    return $r;}// Apply the function:$c = R($a,max($a),76-strlen(key($a)));// Output:foreach ($a as $x => $f)    echo '|',str_repeat('-',$c[$x]),"| $x\n";?>

Output:

输出:

|-------------------------------------------------------------------------| she|---------------------------------------------------------------| you|------------------------------------------------------------| said|-----------------------------------------------------| alice|-----------------------------------------------| was|-------------------------------------------| that|------------------------------------| as|--------------------------------| her|-----------------------------| at|-----------------------------| with|--------------------------| on|--------------------------| all|-----------------------| this|-----------------------| for|-----------------------| had|-----------------------| but|----------------------| be|---------------------| not|--------------------| they|--------------------| so|-------------------| very|------------------| what

When there is a long word, the bars are adjusted properly:

当有一个长字时,对横杠进行适当的调整:

|--------------------------------------------------------| she|---------------------------------------------------| thisisareallylongwordhere|-------------------------------------------------| you|-----------------------------------------------| said|-----------------------------------------| alice|------------------------------------| was|---------------------------------| that|---------------------------| as|-------------------------| her|-----------------------| with|-----------------------| at|--------------------| on|--------------------| all|------------------| this|------------------| for|------------------| had|-----------------| but|-----------------| be|----------------| not|---------------| they|---------------| so|--------------| very

#16


11  

Python 3.1 - 245 229 charaters

I guess using Counter is kind of cheating :) I just read about it about a week ago, so this was the perfect chance to see how it works.

我想使用计数器是一种欺骗:)我一周前刚刚读到它,所以这是一个很好的机会来看看它是如何工作的。

import re,collectionso=collections.Counter([w for w in re.findall("[a-z]+",open("!").read().lower())if w not in"a and i in is it of or the to".split()]).most_common(22)print('\n'.join('|'+76*v//o[0][1]*'_'+'| '+k for k,v in o))

Prints out:

打印出:

|____________________________________________________________________________| she|__________________________________________________________________| you|_______________________________________________________________| said|_______________________________________________________| alice|_________________________________________________| was|_____________________________________________| that|_____________________________________| as|__________________________________| her|_______________________________| with|_______________________________| at|______________________________| s|_____________________________| t|____________________________| on|___________________________| all|________________________| this|________________________| for|________________________| had|________________________| but|______________________| be|______________________| not|_____________________| they|____________________| so

Some of the code was "borrowed" from AKX's solution.

有些代码是从AKX的解决方案中“借来的”。

#17


11  

perl, 205 191 189 characters/ 205 characters (fully implemented)

Some parts were inspired by the earlier perl/ruby submissions, a couple similar ideas were arrived at independently, the others are original. Shorter version also incorporates some things I saw/learned from other submissions.

一些部分是受早期perl/ruby提交的启发,一些类似的想法是独立实现的,其他的是原创的。更短的版本也包含了一些我从其他提交中看到/学到的东西。

Original:

原:

$k{$_}++for grep{$_!~/^(the|and|of|to|a|i|it|in|or|is)$/}map{lc=~/[a-z]+/g}<>;@t=sort{$k{$b}<=>$k{$a}}keys%k;$l=76-length$t[0];printf" %s",'_'x$l;printf"|%s| $_",'_'x int$k{$_}/$k{$t[0]}*$l for@t[0..21];

Latest version down to 191 characters:

最新版本减少到191个字符:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s";$r=(76-y///c)/$k{$_=$e[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s"}@e[0,0..21]

Latest version down to 189 characters:

最新版本减少到189个字符:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@_=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s";$r=(76-m//)/$k{$_=$_[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s"}@_[0,0..21]

This version (205 char) accounts for the lines with words longer than what would be found later.

这个版本(205字符)描述的行比后面发现的要长。

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;($r)=sort{$a<=>$b}map{(76-y///c)/$k{$_}}@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s";map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s";}@e[0,0..21]

#18


10  

Perl: 203 202 201 198 195 208 203 / 231 chars

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&warn" "."_"x($z*$y)."\n";printf"|%.78s\n","_"x($z*$y)."| $_"}(sort{$x{$b}<=>$x{$a}}keys%x)[0..21]

Alternate, full implementation including indicated behaviour (global bar-squishing) for the pathological case in which the secondary word is both popular and long enough to combine to over 80 chars (this implementation is 231 chars):

在病理病例中,次要词既流行又足够长,可以合并到超过80个字符(此实现为231个字符),可替换、完整实现,包括指示行为(全局压扁):

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;@e=(sort{$x{$b}<=>$x{$a}}keys%x)[0..21];for(@e){$p=(76-y///c)/$x{$_};($y&&$p>$y)||($y=$p)}warn" "."_"x($x{$e[0]}*$y)."\n";for(@e){warn"|"."_"x($x{$_}*$y)."| $_\n"}

The specification didn't state anywhere that this had to go to STDOUT, so I used perl's warn() instead of print - four characters saved there. Used map instead of foreach, but I feel like there could still be some more savings in the split(join()). Still, got it down to 203 - might sleep on it. At least Perl's now under the "shell, grep, tr, grep, sort, uniq, sort, head, perl" char count for now ;)

规范中没有任何地方说明必须将其写入STDOUT,因此我使用了perl的warn()而不是打印—保存在那里的4个字符。使用map而不是foreach,但是我觉得在split(join())中仍然可以节省一些开销。尽管如此,把它降到了203年——也许可以考虑一下。至少Perl现在在“shell、grep、tr、grep、sort、uniq、sort、head、Perl”字符计数下面;

PS: Reddit says "Hi" ;)

附注:Reddit说“嗨”;)

Update: Removed join() in favour of assignment and implicit scalar conversion join. Down to 202. Also please note I have taken advantage of the optional "ignore 1-letter words" rule to shave 2 characters off, so bear in mind the frequency count will reflect this.

更新:删除join()以支持赋值和隐式标量转换连接。到202年。另外请注意,我已经利用了可选的“忽略1个字母的单词”规则来删除2个字符,所以请记住频率计数将反映这一点。

Update 2: Swapped out assignment and implicit join for killing $/ to get the file in one gulp using <> in the first place. Same size, but nastier. Swapped out if(!$y){} for $y||{}&&, saved 1 more char => 201.

更新2:首先使用<>替换出赋值和隐式连接以杀死$/,以一口气获取文件。大小相同,但糟糕。如果(!$y){为$y||{}&,则再保存1个char => 201。

Update 3: Took control of lowercasing early (lc<>) by moving lc out of the map block - Swapped out both regexes to no longer use /i option, as no longer needed. Swapped explicit conditional x?y:z construct for traditional perlgolf || implicit conditional construct - /^...$/i?1:$x{$}++ for /^...$/||$x{$}++ Saved three characters! => 198, broke the 200 barrier. Might sleep soon... perhaps.

更新3:通过将lc移出map块,提前控制小写(lc<>)——将两个regexe交换到不再使用/i选项(不再需要)。交换x显式条件?y:z构建传统perlgolf | |隐含条件构造——/ ^…/我美元吗?1:$ x { $ } + + / ^……美元/ | | $ x { $ } + +救了三个字符!=> 198,突破200大关。可能睡眠很快…也许。

Update 4: Sleep deprivation has made me insane. Well. More insane. Figuring that this only has to parse normal happy text files, I made it give up if it hits a null. Saved two characters. Replaced "length" with the 1-char shorter (and much more golfish) y///c - you hear me, GolfScript?? I'm coming for you!!! sob

更新4:睡眠不足让我发疯。好。更疯狂。我认为这只需要解析正常的快乐文本文件,如果它命中一个null,我就会放弃它。救了两个字符。替换“长度”,用1-char(和更多的golfish) y// c -你听到了吗,GolfScript??我来找你了! ! !发出呜咽声

Update 5: Sleep dep made me forget about the 22row limit and subsequent-line limiting. Back up to 208 with those handled. Not too bad, 13 characters to handle it isn't the end of the world. Played around with perl's regex inline eval, but having trouble getting it to both work and save chars... lol. Updated the example to match current output.

更新5:Sleep dep让我忘记了22行限制和后续行限制。拿着这些回到208。不太坏,13个角色的处理并不是世界末日。使用perl的regex内联eval,但在工作和保存chars时遇到麻烦……哈哈更新示例以匹配当前输出。

Update 6: Removed unneeded braces protecting (...)for, since the syntactic candy ++ allows shoving it up against the for happily. Thanks to input from Chas. Owens (reminding my tired brain), got the character class i[tns] solution in there. Back down to 203.

更新6:删除不必要的保护(…),因为语法糖果++允许推它到快乐。感谢Chas的输入。欧文斯(让我想起了我那疲惫的大脑),他得到了角色类i的解决方案。回到203年。

Update 7: Added second piece of work, full implementation of specs (including the full bar-squishing behaviour for secondary long-words, instead of truncation which most people are doing, based on the original spec without the pathological example case)

更新7:增加了第二部分的工作,充分执行了规范(包括对次要的长词的完全的压压行为,而不是大多数人都在做的,基于原始规范而没有病理的例子)

Examples:

例子:

 _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|____________________________________________________________| said|_____________________________________________________| alice|_______________________________________________| was|___________________________________________| that|____________________________________| as|________________________________| her|_____________________________| with|_____________________________| at|__________________________| on|__________________________| all|_______________________| this|_______________________| for|_______________________| had|_______________________| but|______________________| be|_____________________| not|____________________| they|____________________| so|___________________| very|__________________| what

Alternative implementation in pathological case example:

病理病例的替代实施:

 _______________________________________________________________|_______________________________________________________________| she|_______________________________________________________| superlongstringstring|____________________________________________________| said|______________________________________________| alice|________________________________________| was|_____________________________________| that|_______________________________| as|____________________________| her|_________________________| with|_________________________| at|_______________________| on|______________________| all|____________________| this|____________________| for|____________________| had|____________________| but|___________________| be|__________________| not|_________________| they|_________________| so|________________| very|________________| what

#19


9  

F#, 452 chars

Strightforward: get a sequence a of word-count pairs, find the best word-count-per-column multiplier k, then print results.

Strightforward:获取一个单词计数对序列a,找到最佳的单词计数/列乘法器k,然后打印结果。

let a= stdin.ReadToEnd().Split(" .?!,\":;'\r\n".ToCharArray(),enum 1) |>Seq.map(fun s->s.ToLower())|>Seq.countBy id |>Seq.filter(fun(w,n)->not(set["the";"and";"of";"to";"a";"i";"it";"in";"or";"is"].Contains w)) |>Seq.sortBy(fun(w,n)-> -n)|>Seq.take 22let k=a|>Seq.map(fun(w,n)->float(78-w.Length)/float n)|>Seq.minlet u n=String.replicate(int(float(n)*k)-2)"_"printfn" %s "(u(snd(Seq.nth 0 a)))for(w,n)in a do printfn"|%s| %s "(u n)w

Example (I have different freq counts than you, unsure why):

例子(我有不同的freq比你,不确定为什么):

% app.exe < Alice.txt _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|_____________________________________________________________| said|_____________________________________________________| alice|_______________________________________________| was|___________________________________________| that|___________________________________| as|________________________________| her|_____________________________| with|_____________________________| at|____________________________| t|____________________________| s|__________________________| on|_________________________| all|_______________________| this|______________________| had|______________________| for|_____________________| but|_____________________| be|____________________| not|___________________| they|__________________| so

#20


8  

Python 2.6, 347 chars

import reW,x={},"a and i in is it of or the to".split()[W.__setitem__(w,W.get(w,0)-1)for w in re.findall("[a-z]+",file("11.txt").read().lower())if w not in x]W=sorted(W.items(),key=lambda p:p[1])[:22]bm=(76.-len(W[0][0]))/W[0][1]U=lambda n:"_"*int(n*bm)print "".join(("%s\n|%s| %s "%((""if i else" "+U(n)),U(n),w))for i,(w,n)in enumerate(W))

Output:

输出:

 _________________________________________________________________________|_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |____________________________| s |____________________________| t |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so 

#21


7  

*sh (+curl), partial solution

This is incomplete, but for the hell of it, here's the word-frequency counting half of the problem in 192 bytes:

这是不完整的,但见鬼的是,这是字数计算问题的一半在192字节:

curl -s http://www.gutenberg.org/files/11/11.txt|sed -e 's@[^a-z]@\n@gi'|tr '[:upper:]' '[:lower:]'|egrep -v '(^[^a-z]*$|\b(the|and|of|to|a|i|it|in|or|is)\b)' |sort|uniq -c|sort -n|tail -n 22

#22


7  

Gawk -- 336 (originally 507) characters

(after fixing the output formatting; fixing the contractions thing; tweaking; tweaking again; removing a wholly unnecessary sorting step; tweaking yet again; and again (oops this one broke the formatting); tweak some more; taking up Matt's challenge I desperately tweak so more; found another place to save a few, but gave two back to fix the bar length bug)

(修改输出格式后;修复收缩的;调整;重新调整;删除完全不必要的排序步骤;再次调整;再说一遍(哦,这个破坏了格式);调整一些;接受了马特的挑战,我绝望地做了更多的调整;找到了另一个地方来保存一些,但给了两个回来修复条长错误)

Heh heh! I am momentarily ahead of [Matt's JavaScript][1] solutioncounter challenge! ;)and [AKX's python][2].

呵呵呵!我暂时领先(马特的JavaScript)[1]解决方案反挑战!,)和(AKX python)[2]。

The problem seems to call out for a language that implements native associative arrays, so of course I've chosen one with a horribly deficient set of operators on them. In particular, you cannot control the order in which awk offers up the elements of a hash map, so I repeatedly scan the whole map to find the currently most numerous item, print it and delete it from the array.

这个问题似乎需要一种实现本机关联数组的语言,因此我当然选择了一种具有糟糕的运算符集的语言。特别是,您无法控制awk提供哈希映射元素的顺序,因此我重复地扫描整个映射,以找到当前数量最多的项,并将其打印并从数组中删除。

It is all terribly inefficient, with all the golfifcations I've made it has gotten to be pretty awful, as well.

这一切都是非常低效的,我做的所有的golfifcations也变得非常糟糕。

Minified:

缩小:

{gsub("[^a-zA-Z]"," ");for(;NF;NF--)a[tolower($NF)]++}END{split("the and of to a i it in or is",b," ");for(w in b)delete a[b[w]];d=1;for(w in a){e=a[w]/(78-length(w));if(e>d)d=e}for(i=22;i;--i){e=0;for(w in a)if(a[w]>e)e=a[x=w];l=a[x]/d-2;t=sprintf(sprintf("%%%dc",l)," ");gsub(" ","_",t);if(i==22)print" "t;print"|"t"| "x;delete a[x]}}

line breaks for clarity only: they are not necessary and should not be counted.

换行只是为了清晰:它们不是必需的,不应该被计算。


Output:

输出:

$ gawk -f wordfreq.awk.min < 11.txt  _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|____________________________________________________________| said|____________________________________________________| alice|______________________________________________| was|__________________________________________| that|___________________________________| as|_______________________________| her|____________________________| with|____________________________| at|___________________________| s|___________________________| t|_________________________| on|_________________________| all|______________________| this|______________________| for|______________________| had|_____________________| but|____________________| be|____________________| not|___________________| they|__________________| so$ sed 's/you/superlongstring/gI' 11.txt | gawk -f wordfreq.awk.min ______________________________________________________________________|______________________________________________________________________| she|_____________________________________________________________| superlongstring|__________________________________________________________| said|__________________________________________________| alice|____________________________________________| was|_________________________________________| that|_________________________________| as|______________________________| her|___________________________| with|___________________________| at|__________________________| s|__________________________| t|________________________| on|________________________| all|_____________________| this|_____________________| for|_____________________| had|____________________| but|___________________| be|___________________| not|__________________| they|_________________| so

Readable; 633 characters (originally 949):

易读的;633个字符(原949):

{    gsub("[^a-zA-Z]"," ");    for(;NF;NF--)    a[tolower($NF)]++}END{    # remove "short" words    split("the and of to a i it in or is",b," ");    for (w in b)     delete a[b[w]];    # Find the bar ratio    d=1;    for (w in a) {    e=a[w]/(78-length(w));    if (e>d)        d=e    }    # Print the entries highest count first    for (i=22; i; --i){                   # find the highest count    e=0;    for (w in a)         if (a[w]>e)        e=a[x=w];        # Print the bar    l=a[x]/d-2;    # make a string of "_" the right length    t=sprintf(sprintf("%%%dc",l)," ");    gsub(" ","_",t);    if (i==22) print" "t;    print"|"t"| "x;    delete a[x]    }}

#23


7  

Common LISP, 670 characters

I'm a LISP newbie, and this is an attempt using an hash table for counting (so probably not the most compact method).

我是一个LISP新手,这是一个尝试使用哈希表进行计数(所以可能不是最紧凑的方法)。

(flet((r()(let((x(read-char t nil)))(and x(char-downcase x)))))(do((c(make-hash-table :test 'equal))(w NIL)(x(r)(r))y)((not x)(maphash(lambda(k v)(if(not(find k '("""the""and""of""to""a""i""it""in""or""is"):test'equal))(push(cons k v)y)))c)(setf y(sort y #'> :key #'cdr))(setf y(subseq y 0(min(length y)22)))(let((f(apply #'min(mapcar(lambda(x)(/(-76.0(length(car x)))(cdr x)))y))))(flet((o(n)(dotimes(i(floor(* n f)))(write-char #\_))))(write-char #\Space)(o(cdar y))(write-char #\Newline)(dolist(x y)(write-char #\|)(o(cdr x))(format t "| ~a~%"(car x))))))(cond((char<= #\a x #\z)(push x w))(t(incf(gethash(concatenate 'string(reverse w))c 0))(setf w nil)))))

can be run on for example withcat alice.txt | clisp -C golf.lisp.

可以运行,例如与猫爱丽丝。txt | clisp -C golf.lisp。

In readable form is

以可读的形式是

(flet ((r () (let ((x (read-char t nil)))               (and x (char-downcase x)))))  (do ((c (make-hash-table :test 'equal))  ; the word count map       w y                                 ; current word and final word list       (x (r) (r)))  ; iteration over all chars       ((not x)        ; make a list with (word . count) pairs removing stopwords        (maphash (lambda (k v)                   (if (not (find k '("" "the" "and" "of" "to"                                      "a" "i" "it" "in" "or" "is")                                  :test 'equal))                       (push (cons k v) y)))                 c)        ; sort and truncate the list        (setf y (sort y #'> :key #'cdr))        (setf y (subseq y 0 (min (length y) 22)))        ; find the scaling factor        (let ((f (apply #'min                        (mapcar (lambda (x) (/ (- 76.0 (length (car x)))                                               (cdr x)))                                y))))          ; output          (flet ((outx (n) (dotimes (i (floor (* n f))) (write-char #\_))))             (write-char #\Space)             (outx (cdar y))             (write-char #\Newline)             (dolist (x y)               (write-char #\|)               (outx (cdr x))               (format t "| ~a~%" (car x))))))       ; add alphabetic to current word, and bump word counter       ; on non-alphabetic       (cond        ((char<= #\a x #\z)         (push x w))        (t         (incf (gethash (concatenate 'string (reverse w)) c 0))         (setf w nil)))))

#24


6  

C (828)

It looks alot like obfuscated code, and uses glib for string, list and hash. Char count with wc -m says 828 . It does not consider single-char words. To calculate the max length of the bar, it consider the longest possible word among all, not only the first 22. Is this a deviation from the spec?

它看起来很像模糊的代码,使用glib处理字符串、列表和散列。用wc -m表示的Char计数是828。它不考虑单字字符。为了计算这个条的最大长度,它考虑所有可能的最长单词,而不仅仅是前22个。这与规格不符吗?

It does not handle failures and it does not release used memory.

它不处理失败,也不释放已使用的内存。

#include <glib.h>#define S(X)g_string_##X#define H(X)g_hash_table_##XGHashTable*h;int m,w=0,z=0;y(const void*a,const void*b){int*A,*B;A=H(lookup)(h,a);B=H(lookup)(h,b);return*B-*A;}void p(void*d,void*u){int *v=H(lookup)(h,d);if(w<22){g_printf("|");*v=*v*(77-z)/m;while(--*v>=0)g_printf("=");g_printf("| %s\n",d);w++;}}main(c){int*v;GList*l;GString*s=S(new)(NULL);h=H(new)(g_str_hash,g_str_equal);char*n[]={"the","and","of","to","it","in","or","is"};while((c=getchar())!=-1){if(isalpha(c))S(append_c)(s,tolower(c));else{if(s->len>1){for(c=0;c<8;c++)if(!strcmp(s->str,n[c]))goto x;if((v=H(lookup)(h,s->str))!=NULL)++*v;else{z=MAX(z,s->len);v=g_malloc(sizeof(int));*v=1;H(insert)(h,g_strdup(s->str),v);}}x:S(truncate)(s,0);}}l=g_list_sort(H(get_keys)(h),y);m=*(int*)H(lookup)(h,g_list_first(l)->data);g_list_foreach(l,p,NULL);}

#25


6  

Perl, 185 char

200 (slightly broken) 199 197 195 193 187 185 characters. Last two newlines are significant. Complies with the spec.

200(略断)199 197 195 19187 185个字符。最后两条换行是很重要的。符合规范。

map$X{+lc}+=!/^(.|the|and|to|i[nst]|o[rf])$/i,/[a-z]+/gfor<>;$n=$n>($:=$X{$_}/(76-y+++c))?$n:$:for@w=(sort{$X{$b}-$X{$a}}%X)[0..21];die map{$U='_'x($X{$_}/$n);" $U"x!$z++,"|$U| $_"}@w

First line loads counts of valid words into %X.

第一行将有效单词计数加载到%X中。

The second line computes minimum scaling factor so that all output lines will be <= 80 characters.

第二行计算最小比例因子,使所有输出行都为<= 80个字符。

The third line (contains two newline characters) produces the output.

第三行(包含两个换行字符)生成输出。

#26


5  

Java - 886 865 756 744 742 744 752 742 714 680 chars

  • Updates before first 742: improved regex, removed superfluous parameterized types, removed superfluous whitespace.

    前742更新:改进regex,删除多余的参数化类型,删除多余的空白。

  • Update 742 > 744 chars: fixed the fixed-length hack. It's only dependent on the 1st word, not other words (yet). Found several places to shorten the code (\\s in regex replaced by and ArrayList replaced by Vector). I'm now looking for a short way to remove the Commons IO dependency and reading from stdin.

    更新742 > 744 chars:固定固定长度的hack。它只取决于第一个单词,而不是其他单词。找到几个地方可以缩短代码(在regex中用和ArrayList用Vector替换)。我现在正在寻找一种简短的方法来删除公有IO依赖项并从stdin中读取。

  • Update 744 > 752 chars: I removed the commons dependency. It now reads from stdin. Paste the text in stdin and hit Ctrl+Z to get result.

    更新744 > 752 chars:我删除了commons依赖性。现在它读的是stdin。在stdin中粘贴文本并按下Ctrl+Z以获得结果。

  • Update 752 > 742 chars: I removed public and a space, made classname 1 char instead of 2 and it's now ignoring one-letter words.

    更新752 > 742 chars:我删除了public和空格,将classname 1替换为2,现在它忽略了一个字母单词。

  • Update 742 > 714 chars: Updated as per comments of Carl: removed redundant assignment (742 > 730), replaced m.containsKey(k) by m.get(k)!=null (730 > 728), introduced substringing of line (728 > 714).

    更新742 > 714 chars:按照Carl的评论更新:删除冗余赋值(742 > 730),用m.g ryskey (k)替换m.get(k)!=null(730 > 728),引入了line的子类化(728 > 714)。

  • Update 714 > 680 chars: Updated as per comments of Rotsor: improved bar size calculation to remove unnecessary casting and improved split() to remove unnecessary replaceAll().

    更新714 > 680 chars:根据Rotsor的评论进行更新:改进的bar尺寸计算,删除不必要的铸件,改进的split(),删除不必要的replaceAll()。


import java.util.*;class F{public static void main(String[]a)throws Exception{StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);}}

More readable version:

更可读版本:

import java.util.*;class F{ public static void main(String[]a)throws Exception{  StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));  final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);  List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});  int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);  for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w); }}

Output:

输出:

 _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|____________________________________________________________| said|_____________________________________________________| alice|_______________________________________________| was|___________________________________________| that|____________________________________| as|________________________________| her|_____________________________| with|_____________________________| at|__________________________| on|__________________________| all|_______________________| this|_______________________| for|_______________________| had|_______________________| but|______________________| be|_____________________| not|____________________| they|____________________| so|___________________| very|__________________| what

It pretty sucks that Java doesn't have String#join() and closures (yet).

Java没有字符串#join()和闭包(还),这太糟糕了。

Edit by Rotsor:

由Rotsor编辑:

I have made several changes to your solution:

我对你的解决方案做了几处修改:

  • Replaced List with a String[]
  • 用字符串[]替换列表
  • Reused the 'args' argument instead of declaring my own String array. Also used it as an argument to .ToArray()
  • 重用“args”参数,而不是声明自己的字符串数组。还将它用作.ToArray()的参数
  • Replaced StringBuffer with a String (yes, yes, terrible performance)
  • 用字符串替换StringBuffer(是的,糟糕的性能)
  • Replaced Java sorting with a selection-sort with early halting (only first 22 elements have to be found)
  • 用早期停止的选择排序替换Java排序(只需要找到前22个元素)
  • Aggregated some int declaration into a single statement
  • 将一些int声明聚合到一个语句中
  • Implemented the non-cheating algorithm finding the most limiting line of output. Implemented it without FP.
  • 实现了非欺骗算法查找输出的最极限线。没有《外交政策》实施。
  • Fixed the problem of the program crashing when there were less than 22 distinct words in the text
  • 修正了当文本中有少于22个不同的单词时程序崩溃的问题
  • Implemented a new algorithm of reading input, which is fast and only 9 characters longer than the slow one.
  • 实现了一种读取输入的新算法,速度快,只比慢输入长9个字符。

The condensed code is 688 711 684 characters long:

压缩码为688711684个字符长:

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;(j=System.in.read())>0;w+=(char)j);for(String W:w.toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(W,m.get(W)!=null?m.get(W)+1:1);l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}

The fast version ( 720 693 characters)

快速版(720 693个字符)

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}

More readable version:

更可读版本:

import java.util.*;class F{public static void main(String[]l)throws Exception{    Map<String,Integer>m=new HashMap();String w="";    int i=0,k=0,j=8,x,y,g=22;    for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{        if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";    }}    l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;    for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}    for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}    String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');    System.out.println(" "+s);    for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}

The version without behaviour improvements is 615 characters:

没有行为改进的版本是615个字符:

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);for(;i<g;++i)for(j=i;++j<l.length;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}i=76-l[0].length();String s=new String(new char[i]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/m.get(l[0]))+"| "+w);}}}

#27


4  

Scala 2.8, 311 314 320 330 332 336 341 375 characters

including long word adjustment. Ideas borrowed from the other solutions.

包括长词调整。从其他解决方案借鉴的想法。

Now as a script (a.scala):

现在作为一个脚本(a.scala):

val t="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile(argv(0)).mkString.toLowerCase).toSeq.groupBy(w=>w).mapValues(_.size).toSeq.sortBy(-_._2)take 22def b(p:Int)="_"*(p*(for((w,c)<-t)yield(76.0-w.size)/c).min).toIntprintln(" "+b(t(0)._2))for(p<-t)printf("|%s| %s \n",b(p._2),p._1)

Run with

运行

scala -howtorun:script a.scala alice.txt

BTW, the edit from 314 to 311 characters actually removes only 1 character. Someone got the counting wrong before (Windows CRs?).

顺便说一句,从314到311个字符的编辑实际上只删除一个字符。有人之前数错了(Windows CRs?)

#28


4  

Clojure 282 strict

(let[[[_ m]:as s](->>(slurp *in*).toLowerCase(re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")frequencies(sort-by val >)(take 22))[b](sort(map #(/(- 76(count(key %)))(val %))s))p #(do(print %1)(dotimes[_(* b %2)](print \_))(apply println %&))](p " " m)(doseq[[k v]s](p \| v \| k)))

Somewhat more legibly:

更明了地:

(let[[[_ m]:as s](->> (slurp *in*)                   .toLowerCase                   (re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")                   frequencies                   (sort-by val >)                   (take 22))     [b] (sort (map #(/ (- 76 (count (key %)))(val %)) s))     p #(do          (print %1)          (dotimes[_(* b %2)] (print \_))          (apply println %&))]  (p " " m)  (doseq[[k v] s] (p \| v \| k)))

#29


4  

Scala, 368 chars

First, a legible version in 592 characters:

第一,592个字符的清晰版本:

object Alice {  def main(args:Array[String]) {    val s = io.Source.fromFile(args(0))    val words = s.getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase)    val freqs = words.foldLeft(Map[String, Int]())((countmap, word)  => countmap + (word -> (countmap.getOrElse(word, 0)+1)))    val sortedFreqs = freqs.toList.sort((a, b)  => a._2 > b._2)    val top22 = sortedFreqs.take(22)    val highestWord = top22.head._1    val highestCount = top22.head._2    val widest = 76 - highestWord.length    println(" " + "_" * widest)    top22.foreach(t => {      val width = Math.round((t._2 * 1.0 / highestCount) * widest).toInt      println("|" + "_" * width + "| " + t._1)    })  }}

The console output looks like this:

控制台输出如下:

$ scalac alice.scala $ scala Alice aliceinwonderland.txt _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|_____________________________________________________________| said|_____________________________________________________| alice|_______________________________________________| was|____________________________________________| that|____________________________________| as|_________________________________| her|______________________________| at|______________________________| with|_____________________________| s|_____________________________| t|___________________________| on|__________________________| all|_______________________| had|_______________________| but|______________________| be|______________________| not|____________________| they|____________________| so|___________________| very|___________________| what

We can do some aggressive minifying and get it down to 415 characters:

我们可以做一些积极的缩小,把它降到415个字符:

object A{def main(args:Array[String]){val l=io.Source.fromFile(args(0)).getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase).foldLeft(Map[String, Int]())((c,w)=>c+(w->(c.getOrElse(w,0)+1))).toList.sort((a,b)=>a._2>b._2).take(22);println(" "+"_"*(76-l.head._1.length));l.foreach(t=>println("|"+"_"*Math.round((t._2*1.0/l.head._2)*(76-l.head._1.length)).toInt+"| "+t._1))}}

The console session looks like this:

控制台会话如下所示:

$ scalac a.scala $ scala A aliceinwonderland.txt _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|_____________________________________________________________| said|_____________________________________________________| alice|_______________________________________________| was|____________________________________________| that|____________________________________| as|_________________________________| her|______________________________| at|______________________________| with|_____________________________| s|_____________________________| t|___________________________| on|__________________________| all|_______________________| had|_______________________| but|______________________| be|______________________| not|____________________| they|____________________| so|___________________| very|___________________| what

I'm sure a Scala expert could do even better.

我相信Scala专家可以做得更好。

Update: In the comments Thomas gave an even shorter version, at 368 characters:

更新:托马斯给出了一个更短的版本,有368个字符:

object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}}

Legibly, at 375 characters:

明了地,在375个字符:

object Alice {  def main(a:Array[String]) {    val t = (Map[String, Int]() /: (      for (        x <- io.Source.fromFile(a(0)).getLines        y <- "(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(x)      ) yield y.toLowerCase    ).toList)((c, x) => c + (x -> (c.getOrElse(x, 0) + 1))).toList.sortBy(_._2).reverse.take(22)    val w = 76 - t.head._1.length    print (" "+"_"*w)    t.map(s => "\n|" + "_" * (s._2 * w / t.head._2) + "| " + s._1).foreach(print)  }}

#30


3  

Java - 896 chars

931 chars

1233 chars made unreadable

1977 chars "uncompressed"


Update: I have aggressively reduced the character count. Omits single-letter words per updated spec.

更新:我已经减少了字符数。每一个更新的规范省略单字。

I envy C# and LINQ so much.

我非常羡慕c#和LINQ。

import java.util.*;import java.io.*;import static java.util.regex.Pattern.*;class g{public static void main(String[] a)throws Exception{PrintStream o=System.out;Map<String,Integer> w=new HashMap();Scanner s=new Scanner(new File(a[0])).useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));while(s.hasNext()){String z=s.next().trim().toLowerCase();if(z.equals(""))continue;w.put(z,(w.get(z)==null?0:w.get(z))+1);}List<Integer> v=new Vector(w.values());Collections.sort(v);List<String> q=new Vector();int i,m;i=m=v.size()-1;while(q.size()<22){for(String t:w.keySet())if(!q.contains(t)&&w.get(t).equals(v.get(i)))q.add(t);i--;}int r=80-q.get(0).length()-4;String l=String.format("%1$0"+r+"d",0).replace("0","_");o.println(" "+l);o.println("|"+l+"| "+q.get(0)+" ");for(i=m-1;i>m-22;i--){o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");}}}

"Readable":

“可读”:

import java.util.*;import java.io.*;import static java.util.regex.Pattern.*;class g{   public static void main(String[] a)throws Exception      {      PrintStream o = System.out;      Map<String,Integer> w = new HashMap();      Scanner s = new Scanner(new File(a[0]))         .useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));      while(s.hasNext())      {         String z = s.next().trim().toLowerCase();         if(z.equals(""))            continue;         w.put(z,(w.get(z) == null?0:w.get(z))+1);      }      List<Integer> v = new Vector(w.values());      Collections.sort(v);      List<String> q = new Vector();      int i,m;      i = m = v.size()-1;      while(q.size()<22)      {         for(String t:w.keySet())            if(!q.contains(t)&&w.get(t).equals(v.get(i)))               q.add(t);         i--;      }      int r = 80-q.get(0).length()-4;      String l = String.format("%1$0"+r+"d",0).replace("0","_");      o.println(" "+l);      o.println("|"+l+"| "+q.get(0)+" ");      for(i = m-1; i > m-22; i--)      {         o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");      }   }}

Output of Alice:

爱丽丝的输出:

 _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|_____________________________________________________________| said|_____________________________________________________| alice|_______________________________________________| was|____________________________________________| that|____________________________________| as|_________________________________| her|______________________________| with|______________________________| at|___________________________| on|__________________________| all|________________________| this|________________________| for|_______________________| had|_______________________| but|______________________| be|______________________| not|____________________| they|____________________| so|___________________| very|___________________| what

Output of Don Quixote (also from Gutenberg):

堂吉诃德的输出(也来自古腾堡):

 ________________________________________________________________________|________________________________________________________________________| that|________________________________________________________| he|______________________________________________| for|__________________________________________| his|________________________________________| as|__________________________________| with|_________________________________| not|_________________________________| was|________________________________| him|______________________________| be|___________________________| don|_________________________| my|_________________________| this|_________________________| all|_________________________| they|________________________| said|_______________________| have|_______________________| me|______________________| on|______________________| so|_____________________| you|_____________________| quixote

#1


123  

LabVIEW 51 nodes, 5 structures, 10 diagrams

Teaching the elephant to tap-dance is never pretty. I'll, ah, skip the character count.

教大象跳踢踏舞从来都不是件好事。我将跳过字符计数。

为给定文本中最常用的单词构建一个ASCII图表

为给定文本中最常用的单词构建一个ASCII图表

The program flows from left to right:

程序从左向右流动:

为给定文本中最常用的单词构建一个ASCII图表

#2


42  

Ruby 1.9, 185 chars

(heavily based on the other Ruby solutions)

(主要基于其他Ruby解决方案)

w=($<.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22]k,l=w[0]puts [?\s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}]

Instead of using any command line switches like the other solutions, you can simply pass the filename as argument. (i.e. ruby1.9 wordfrequency.rb Alice.txt)

不像其他解决方案那样使用任何命令行开关,您可以简单地将文件名作为参数传递。(即ruby1.9 wordfrequency。rb Alice.txt)

Since I'm using character-literals here, this solution only works in Ruby 1.9.

由于我在这里使用字符-文字,这个解决方案只能在Ruby 1.9中使用。

Edit: Replaced semicolons by line breaks for "readability". :P

编辑:用换行符替换分号为“可读性”。:P

Edit 2: Shtééf pointed out I forgot the trailing space - fixed that.

编辑2:Shteef指出我忘记了后面的空格-修正了。

Edit 3: Removed the trailing space again ;)

编辑3:再次删除拖尾空间;

#3


39  

GolfScript, 177 175 173 167 164 163 144 131 130 chars

Slow - 3 minutes for the sample text (130)

慢-样本文本3分钟(130分钟)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*' '\@{"|"\~1*2/0*'| '@}/

Explanation:

解释:

{           #loop through all characters 32|.       #convert to uppercase and duplicate 123%97<    #determine if is a letter n@if       #return either the letter or a newline}%          #return an array (of ints)]''*        #convert array to a string with magicn%          #split on newline, removing blanks (stack is an array of words now)"oftoitinorisa"   #push this string2/          #split into groups of two, i.e. ["of" "to" "it" "in" "or" "is" "a"]-           #remove any occurrences from the text"theandi"3/-#remove "the", "and", and "i"$           #sort the array of words(1@         #takes the first word in the array, pushes a 1, reorders stack            #the 1 is the current number of occurrences of the first word{           #loop through the array .3$>1{;)}if#increment the count or push the next word and a 1}/]2/         #gather stack into an array and split into groups of 2{~~\;}$     #sort by the latter element - the count of occurrences of each word22<         #take the first 22 elements.0=~:2;     #store the highest count,76\-:1     #store the length of the first line'_':0*' '\@ #make the first line{           #loop through each word"|"\~        #start drawing the bar1*2/0       #divide by zero*'| '@      #finish drawing the bar}/

"Correct" (hopefully). (143)

“正确”(希望)。(143)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<..0=1=:^;{~76@,-^*\/}%$0=:1'_':0*' '\@{"|"\~1*^/0*'| '@}/

Less slow - half a minute. (162)

慢一点——半分钟。(162)

'"'/' ':S*n/S*'"#{%q'\+".downcase.tr('^a-z','')}\""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*S\@{"|"\~1*2/0*'| '@}/

Output visible in revision logs.

在修订日志中可见的输出。

#4


35  

206

shell, grep, tr, grep, sort, uniq, sort, head, perl

~ % wc -c wfg209 wfg~ % cat wfgegrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|of|to|a|i|it|in|or|is'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'~ % # usage:~ % sh wfg < 11.txt

hm, just seen above: sort -nr -> sort -n and then head -> tail => 208 :)
update2: erm, of course the above is silly, as it will be reversed then. So, 209.
update3: optimized the exclusion regexp -> 206

嗯,如上所示:sort -nr -> sort -n然后head -> tail => 208:)所以,209。update3:优化排除regexp -> 206

egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'



for fun, here's a perl-only version (much faster):

有趣的是,这里有一个只使用perl的版本(更快):

~ % wc -c pgolf204 pgolf~ % cat pgolfperl -lne'$1=~/^(the|and|o[fr]|to|.|i[tns])$/i||$f{lc$1}++while/\b([a-z]+)/gi}{@w=(sort{$f{$b}<=>$f{$a}}keys%f)[0..21];$Q=$f{$_=$w[0]};$B=76-y///c;print" "."_"x$B;print"|"."_"x($B*$f{$_}/$Q)."| $_"for@w'~ % # usage:~ % sh pgolf < 11.txt

#5


35  

Transact SQL set based solution (SQL Server 2005) 1063 892 873 853 827 820 783 683 647 644 630 characters

Thanks to Gabe for some useful suggestions to reduce the character count.

感谢Gabe提供的一些有用的建议来减少角色数量。

NB: Line breaks added to avoid scrollbars only the last line break is required.

NB:添加换行符以避免滚动条,只需要最后一次换行。

DECLARE @ VARCHAR(MAX),@F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A',SINGLE_BLOB)x;WITH N AS(SELECT 1 i,LEFT(@,1)L UNION ALL SELECT i+1,SUBSTRING(@,i+1,1)FROM N WHERE i<LEN(@))SELECT i,L,i-RANK()OVER(ORDER BY i)R INTO #DFROM N WHERE L LIKE'[A-Z]'OPTION(MAXRECURSION 0)SELECT TOP 22 W,-COUNT(*)CINTO # FROM(SELECT DISTINCT R,(SELECT''+L FROM #D WHERE R=b.R FOR XML PATH(''))W FROM #D b)t WHERE LEN(W)>1 AND W NOT IN('the','and','of','to','it','in','or','is')GROUP BY W ORDER BY C SELECT @F=MIN(($76-LEN(W))/-C),@=' '+REPLICATE('_',-MIN(C)*@F)+' 'FROM # SELECT @=@+' |'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @

Readable Version

可读版本

DECLARE @  VARCHAR(MAX),        @F REALSELECT @=BulkColumnFROM   OPENROWSET(BULK'A',SINGLE_BLOB)x; /*  Loads text file from path                                             C:\WINDOWS\system32\A  *//*Recursive common table expression togenerate a table of numbers from 1 to string length(and associated characters)*/WITH N AS     (SELECT 1 i,             LEFT(@,1)L     UNION ALL     SELECT i+1,            SUBSTRING(@,i+1,1)     FROM   N     WHERE  i<LEN(@)     )  SELECT   i,           L,           i-RANK()OVER(ORDER BY i)R           /*Will group characters           from the same word together*/  INTO     #D  FROM     N  WHERE    L LIKE'[A-Z]'OPTION(MAXRECURSION 0)             /*Assuming case insensitive accent sensitive collation*/SELECT   TOP 22 W,         -COUNT(*)CINTO     #FROM     (SELECT DISTINCT R,                          (SELECT ''+L                          FROM    #D                          WHERE   R=b.R FOR XML PATH('')                          )W                          /*Reconstitute the word from the characters*/         FROM             #D b         )         TWHERE    LEN(W)>1AND      W NOT IN('the',                  'and',                  'of' ,                  'to' ,                  'it' ,                  'in' ,                  'or' ,                  'is')GROUP BY WORDER BY C/*Just noticed this looks risky as it relies on the order of evaluation of the  variables. I'm not sure that's guaranteed but it works on my machine :-) */SELECT @F=MIN(($76-LEN(W))/-C),       @ =' '      +REPLICATE('_',-MIN(C)*@F)+' 'FROM   #SELECT @=@+' |'+REPLICATE('_',-C*@F)+'| '+W             FROM     #             ORDER BY CPRINT @

Output

输出

 _________________________________________________________________________ |_________________________________________________________________________| she|_______________________________________________________________| You|____________________________________________________________| said|_____________________________________________________| Alice|_______________________________________________| was|___________________________________________| that|____________________________________| as|________________________________| her|_____________________________| at|_____________________________| with|__________________________| on|__________________________| all|_______________________| This|_______________________| for|_______________________| had|_______________________| but|______________________| be|_____________________| not|____________________| they|____________________| So|___________________| very|__________________| what

And with the long string

用长长的绳子。

 _______________________________________________________________ |_______________________________________________________________| she|_______________________________________________________| superlongstringstring|____________________________________________________| said|______________________________________________| Alice|________________________________________| was|_____________________________________| that|_______________________________| as|____________________________| her|_________________________| at|_________________________| with|_______________________| on|______________________| all|____________________| This|____________________| for|____________________| had|____________________| but|___________________| be|__________________| not|_________________| they|_________________| So|________________| very|________________| what

#6


34  

Ruby 207 213 211 210 207 203 201 200 chars

An improvement on Anurag, incorporating suggestion from rfusca. Also removes argument to sort and a few other minor golfings.

对Anurag的改进,包含来自rfusca的建议。还可以删除要排序的参数和其他一些次要的golfings。

w=(STDIN.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort.take 22;k,l=w[0];m=76.0-l.size;puts' '+'_'*m;w.map{|f,x|puts"|#{'_'*(m*f/k)}| #{x} "}

Execute as:

执行:

ruby GolfedWordFrequencies.rb < Alice.txt

Edit: put 'puts' back in, needs to be there to avoid having quotes in output.
Edit2: Changed File->IO
Edit3: removed /i
Edit4: Removed parentheses around (f*1.0), recounted
Edit5: Use string addition for the first line; expand s in-place.
Edit6: Made m float, removed 1.0. EDIT: Doesn't work, changes lengths. EDIT: No worse than before
Edit7: Use STDIN.read.

编辑:put“put”back in, need to be there to avoid having quotes in output。修改后的文件->IO Edit3:删除/i Edit4:删除(f*1.0)附近的圆括号,重述Edit5:第一行使用字符串添加;扩大就地。使m浮动,删除1.0。编辑:不工作,改变长度。编辑:不会比编辑前更糟:使用STDIN.read。

#7


28  

Mathematica ( 297 284 248 244 242 199 chars) Pure Functional

and Zipf's Law Testing

Look Mamma ... no vars, no hands, .. no head

看妈妈……没有桨,没有手。没有头

Edit 1> some shorthands defined (284 chars)

编辑1个>定义的一些短字符(284个字符)

f[x_, y_] := Flatten[Take[x, All, y]]; BarChart[f[{##}, -1],          BarOrigin -> Left,          ChartLabels -> Placed[f[{##}, 1], After],          Axes -> None] & @@Take[  SortBy[     Tally[       Select[        StringSplit[ToLowerCase[Import[i]], RegularExpression["\\W+"]],        !MemberQ[{"the", "and", "of", "to", "a", "i", "it", "in", "or","is"}, #]&]     ],   Last], -22]

Some explanations

一些解释

Import[]    # Get The FileToLowerCase []   # To Lower Case :)StringSplit[ STRING , RegularExpression["\\W+"]]   # Split By Words, getting a LISTSelect[ LIST, !MemberQ[{LIST_TO_AVOID}, #]&]   #  Select from LIST except those words in LIST_TO_AVOID   #  Note that !MemberQ[{LIST_TO_AVOID}, #]& is a FUNCTION for the testTally[LIST]   # Get the LIST {word,word,..}      and produce another  {{word,counter},{word,counter}...}SortBy[ LIST ,Last]   # Get the list produced bt tally and sort by counters     Note that counters are the LAST element of {word,counter}Take[ LIST ,-22]   # Once sorted, get the biggest 22 countersBarChart[f[{##}, -1], ChartLabels -> Placed[f[{##}, 1], After]] &@@ LIST   # Get the list produced by Take as input and produce a bar chartf[x_, y_] := Flatten[Take[x, All, y]]   # Auxiliary to get the list of the first or second element of lists of lists x_     dependending upon y   # So f[{##}, -1] is the list of counters   # and f[{##}, 1] is the list of words (labels for the chart)

Output

输出

alt text http://i49.tinypic.com/2n8mrer.jpg

alt文本http://i49.tinypic.com/2n8mrer.jpg

Mathematica is not well suited for golfing, and that is just because of the long, descriptive function names. Functions like "RegularExpression[]" or "StringSplit[]" just make me sob :(.

Mathematica不太适合打高尔夫球,这仅仅是由于冗长的描述性函数名。像“RegularExpression[]”或“StringSplit[]”这样的函数会让我哭泣。

Zipf's Law Testing

The Zipf's law predicts that for a natural language text, the Log (Rank) vs Log (occurrences) Plot follows a linear relationship.

Zipf定律预测,对于自然语言文本,日志(等级)与日志(事件)的关系是线性的。

The law is used in developing algorithms for criptography and data compression. (But it's NOT the "Z" in the LZW algorithm).

该定律适用于用于分析和数据压缩的算法。(但它不是LZW算法中的“Z”)。

In our text, we can test it with the following

在我们的文本中,我们可以使用以下代码进行测试

 f[x_, y_] := Flatten[Take[x, All, y]];  ListLogLogPlot[     Reverse[f[{##}, -1]],      AxesLabel -> {"Log (Rank)", "Log Counter"},      PlotLabel -> "Testing Zipf's Law"] & @@ Take[  SortBy[    Tally[       StringSplit[ToLowerCase[b], RegularExpression["\\W+"]]    ],    Last], -1000]

The result is (pretty well linear)

结果是(相当线性)

alt text http://i46.tinypic.com/33fcmdk.jpg

alt文本http://i46.tinypic.com/33fcmdk.jpg

Edit 6 > (242 Chars)

Refactoring the Regex (no Select function anymore)
Dropping 1 char words
More efficient definition for function "f"

重构Regex(不再选择函数),删除1个字符,更有效地定义函数f

f = Flatten[Take[#1, All, #2]]&; BarChart[     f[{##}, -1],      BarOrigin -> Left,      ChartLabels -> Placed[f[{##}, 1], After],      Axes -> None] & @@  Take[    SortBy[       Tally[         StringSplit[ToLowerCase[Import[i]],           RegularExpression["(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"]]       ],    Last],  -22]

Edit 7 → 199 characters

BarChart[#2, BarOrigin->Left, ChartLabels->Placed[#1, After], Axes->None]&@@   Transpose@Take[SortBy[Tally@StringSplit[ToLowerCase@Import@i,     RegularExpression@"(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"],Last], -22]
  • Replaced f with Transpose and Slot (#1/#2) arguments.
  • 将f替换为转置和槽(#1/#2)参数。
  • We don't need no stinkin' brackets (use f@x instead of f[x] where possible)
  • 我们不需要任何臭括号(在可能的情况下使用f@x代替f[x])

#8


27  

C# - 510 451 436 446 434 426 422 chars (minified)

Not that short, but now probably correct! Note, the previous version did not show the first line of the bars, did not scale the bars correctly, downloaded the file instead of getting it from stdin, and did not include all the required C# verbosity. You could easily shave many strokes if C# didn't need so much extra crap. Maybe Powershell could do better.

不是那么短,但现在可能是正确的!注意,之前的版本没有显示第一行的条形图,没有正确地缩放条形图,没有从stdin中下载文件,也没有包含所有必需的c# verbosity。如果c#不需要那么多多余的废话,你可以很容易地剃掉许多笔划。也许Powershell可以做得更好。

using C=System.Console;   // alias for Consoleusing System.Linq;  // for Split, GroupBy, Select, OrderBy, etc.class Class // must define a class{    static void Main()  // must define a Main    {        // split into words        var allwords = System.Text.RegularExpressions.Regex.Split(                // convert stdin to lowercase                C.In.ReadToEnd().ToLower(),                // eliminate stopwords and non-letters                @"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+")            .GroupBy(x => x)    // group by words            .OrderBy(x => -x.Count()) // sort descending by count            .Take(22);   // take first 22 words        // compute length of longest bar + word        var lendivisor = allwords.Max(y => y.Count() / (76.0 - y.Key.Length));        // prepare text to print        var toPrint = allwords.Select(x=>             new {                 // remember bar pseudographics (will be used in two places)                Bar = new string('_',(int)(x.Count()/lendivisor)),                 Word=x.Key             })            .ToList();  // convert to list so we can index into it        // print top of first bar        C.WriteLine(" " + toPrint[0].Bar);        toPrint.ForEach(x =>  // for each word, print its bar and the word            C.WriteLine("|" + x.Bar + "| " + x.Word));    }}

422 chars with lendivisor inlined (which makes it 22 times slower) in the below form (newlines used for select spaces):

422带有lendivisor内联(速度慢22倍)的字符(用于选择空格的新行):

using System.Linq;using C=System.Console;class M{static void Main(){vara=System.Text.RegularExpressions.Regex.Split(C.In.ReadToEnd().ToLower(),@"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+").GroupBy(x=>x).OrderBy(x=>-x.Count()).Take(22);varb=a.Select(x=>new{p=new string('_',(int)(x.Count()/a.Max(y=>y.Count()/(76d-y.Key.Length)))),t=x.Key}).ToList();C.WriteLine(" "+b[0].p);b.ForEach(x=>C.WriteLine("|"+x.p+"| "+x.t));}}

#9


25  

Perl, 237 229 209 chars

(Updated again to beat the Ruby version with more dirty golf tricks, replacing split/[^a-z/,lc with lc=~/[a-z]+/g, and eliminating a check for empty string in another place. These were inspired by the Ruby version, so credit where credit is due.)

(再次击败了Ruby版本更新更脏高尔夫技巧,取代分裂/[^ a - z / lc,lc = ~ /[a - z]+ / g,并消除检查空字符串在另一个地方。这些都是受到Ruby版本的启发,因此值得称赞。

Update: now with Perl 5.10! Replace print with say, and use ~~ to avoid a map. This has to be invoked on the command line as perl -E '<one-liner>' alice.txt. Since the entire script is on one line, writing it as a one-liner shouldn't present any difficulty :).

更新:现在使用Perl 5.10!用say替换打印,并使用~~以避免映射。这必须在命令行上调用,作为perl -E ' ' alice.txt。由于整个脚本都在一行上,因此将其编写为一行并不会带来任何困难:)。

 @s=qw/the and of to a i it in or is/;$c{$_}++foreach grep{!($_~~@s)}map{lc=~/[a-z]+/g}<>;@s=sort{$c{$b}<=>$c{$a}}keys%c;$f=76-length$s[0];say" "."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "foreach@s[0..21];

Note that this version normalizes for case. This doesn't shorten the solution any, since removing ,lc (for lower-casing) requires you to add A-Z to the split regex, so it's a wash.

注意,该版本对case进行了规范化。这不会缩短解决方案,因为移除,lc(对于较低的外壳)要求您在分割后的regex中添加a - z,所以这是一个清洗。

If you're on a system where a newline is one character and not two, you can shorten this by another two chars by using a literal newline in place of \n. However, I haven't written the above sample that way, since it's "clearer" (ha!) that way.

如果在一个系统中,换行符是一个字符而不是两个字符,您可以使用一个文字换行符代替\n来缩短另两个字符。然而,我并没有这样写上面的示例,因为这样写“更清楚”(哈!)


Here is a mostly correct, but not remotely short enough, perl solution:

下面是一个基本正确但还不够简短的perl解决方案:

use strict;use warnings;my %short = map { $_ => 1 } qw/the and of to a i it in or is/;my %count = ();$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-zA-Z]/ } (<>);my @sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];my $widest = 76 - (length $sorted[0]);print " " . ("_" x $widest) . "\n";foreach (@sorted){    my $width = int(($count{$_} / $count{$sorted[0]}) * $widest);    print "|" . ("_" x $width) . "| $_ \n";}

The following is about as short as it can get while remaining relatively readable. (392 chars).

以下内容尽可能简短,同时保持可读性。(392字符)。

%short = map { $_ => 1 } qw/the and of to a i it in or is/;%count;$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-z]/, lc } (<>);@sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];$widest = 76 - (length $sorted[0]);print " " . "_" x $widest . "\n";print"|" . "_" x int(($count{$_} / $count{$sorted[0]}) * $widest) . "| $_ \n" foreach @sorted;

#10


20  

Windows PowerShell, 199 chars

$x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *filter f($w){' '+'_'*$w$x[-1..-22]|%{"|$('_'*($w*$_.Count/$x[-1].Count))| "+$_.Name}}f(76..1|?{!((f $_)-match'.'*80)})[0]

(The last line break isn't necessary, but included here for readability.)

(最后的换行符不是必需的,但是这里包含了可读性。)

(Current code and my test files available in my SVN repository. I hope my test cases catch most common errors (bar length, problems with regex matching and a few others))

(在我的SVN存储库中可用的当前代码和测试文件。我希望我的测试用例能够捕获最常见的错误(bar长度、regex匹配的问题以及其他一些问题)

Assumptions:

假设:

  • US ASCII as input. It probably gets weird with Unicode.
  • 我们ASCII作为输入。Unicode可能会变得很奇怪。
  • At least two non-stop words in the text
  • 课文中至少有两个不间断的单词。

History

历史

Relaxed version (137), since that's counted separately by now, apparently:

放松版(137),因为现在已经分开计算了,显然:

($x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *)[-1..-22]|%{"|$('_'*(76*$_.Count/$x[-1].Count))| "+$_.Name}
  • doesn't close the first bar
  • 没有关闭第一栏
  • doesn't account for word length of non-first word
  • 不考虑单词长度的非第一个单词。

Variations of the bar lengths of one character compared to other solutions is due to PowerShell using rounding instead of truncation when converting floating-point numbers into integers. Since the task required only proportional bar length this should be fine, though.

与其他解决方案相比,一个字符的bar长度的变化是由于PowerShell在将浮点数转换为整数时使用舍入而不是截断。由于这项任务只需要比例杆长,所以这应该没问题。

Compared to other solutions I took a slightly different approach in determining the longest bar length by simply trying out and taking the highest such length where no line is longer than 80 characters.

与其他解决方案相比,我采用了一种稍微不同的方法来确定最长的长度,通过简单的尝试,并以最高的长度来确定长度,在这个长度中,没有一行长度超过80个字符。

An older version explained can be found here.

可以在这里找到一个较早的版本。

#11


19  

Ruby, 215, 216, 218, 221, 224, 236, 237 chars

update 1: Hurray! It's a tie with JS Bangs' solution. Can't think of a way to cut down any more :)

更新1:华友世纪!这和JS Bangs的解决方案是一样的。再也想不出一个减少的方法了

update 2: Played a dirty golf trick. Changed each to map to save 1 character :)

更新2:玩了一个肮脏的高尔夫把戏。将每个修改为map以保存1个字符:)

update 3: Changed File.read to IO.read +2. Array.group_by wasn't very fruitful, changed to reduce +6. Case insensitive check is not needed after lower casing with downcase in regex +1. Sorting in descending order is easily done by negating the value +6. Total savings +15

更新3:改变文件。读IO。读+ 2。数组中。group_by并不是很有效,它改为reduce +6。在使用regex +1的下壳后,不需要进行不区分大小写的检查。按降序排序很容易通过否定值+6来完成。总储蓄+ 15

update 4: [0] rather than .first, +3. (@Shtééf)

更新4:[0]而不是.first, +3。(@Shteef)

update 5: Expand variable l in-place, +1. Expand variable s in-place, +2. (@Shtééf)

更新5:就地展开变量l, +1。展开变量s, +2。(@Shteef)

update 6: Use string addition rather than interpolation for the first line, +2. (@Shtééf)

更新6:第一行使用字符串相加而不是插值,+2。(@Shteef)

w=(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take 22;m=76-w[0][0].size;puts' '+'_'*m;w.map{|x,f|puts"|#{'_'*(f*1.0/w[0][1]*m)}| #{x} "}

update 7: I went through a whole lot of hoopla to detect the first iteration inside the loop, using instance variables. All I got is +1, though perhaps there is potential. Preserving the previous version, because I believe this one is black magic. (@Shtééf)

更新7:我使用实例变量进行了大量的宣传,以检测循环中的第一次迭代。我得到的是+1,尽管可能有潜力。保留之前的版本,因为我相信这个是黑魔法。(@Shteef)

(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take(22).map{|x,f|@f||(@f=f;puts' '+'_'*(@m=76-x.size));puts"|#{'_'*(f*1.0/@f*@m)}| #{x} "}

Readable version

可读版本

string = File.read($_).downcasewords = string.scan(/[a-z]+/i)allowed_words = words - %w{the and of to a i it in or is}sorted_words = allowed_words.group_by{ |x| x }.map{ |x,y| [x, y.size] }.sort{ |a,b| b[1] <=> a[1] }.take(22)highest_frequency = sorted_words.firsthighest_frequency_count = highest_frequency[1]highest_frequency_word = highest_frequency[0]word_length = highest_frequency_word.sizewidest = 76 - word_lengthputs " #{'_' * widest}"    sorted_words.each do |word, freq|  width = (freq * 1.0 / highest_frequency_count) * widest  puts "|#{'_' * width}| #{word} "end

To use:

使用方法:

echo "Alice.txt" | ruby -ln GolfedWordFrequencies.rb

Output:

输出:

 _________________________________________________________________________|_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |____________________________| s |____________________________| t |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so 

#12


19  

Python 2.x, latitudinarian approach = 227 183 chars

import sys,ret=re.split('\W+',sys.stdin.read().lower())r=sorted((-t.count(w),w)for w in set(t)if w not in'andithetoforinis')[:22]for l,w in r:print(78-len(r[0][1]))*l/r[0][0]*'=',w

Allowing for freedom in the implementation, I constructed a string concatenation that contains all the words requested for exclusion (the, and, of, to, a, i, it, in, or, is) - plus it also excludes the two infamous "words" s and t from the example - and I threw in for free the exclusion for an, for, he. I tried all concatenations of those words against corpus of the words from Alice, King James' Bible and the Jargon file to see if there are any words that will be mis-excluded by the string. And that is how I ended with two exclusion strings:itheandtoforinis and andithetoforinis.

允许*的实现,我构建了一个包含所有单词字符串连接要求排除(,,,,,我,,,,),加上它也排除了两个臭名昭著的“单词”s和t的例子——我把免费的排斥,他。我试着把所有这些单词串在爱丽丝,詹姆斯国王的圣经和行话文件的语料库上,看看有没有单词会被字符串错误地排除。这就是我用两个排除字符串结尾的原因:itheandtoforinis和thetoforinis。

PS. borrowed from other solutions to shorten the code.

从其他解决方案中借鉴来缩短代码。

=========================================================================== she ================================================================= you============================================================== said====================================================== alice================================================ was============================================ that===================================== as================================= her============================== at============================== with=========================== on=========================== all======================== this======================== had======================= but====================== be====================== not===================== they==================== so=================== very=================== what================= little

Rant

Regarding words to ignore, one would think those would be taken from list of the most used words in English. That list depends on the text corpus used. Per one of the most popular lists (http://en.wikipedia.org/wiki/Most_common_words_in_English, http://www.english-for-students.com/Frequently-Used-Words.html, http://www.sporcle.com/games/common_english_words.php), top 10 words are: the be(am/are/is/was/were) to of and a in that have I

对于要忽略的词,人们可能会认为它们是从英语中最常用的词的列表中摘取的。这个列表取决于所使用的文本语料库。对于最流行的列表(http://en.wikipedia.org/wiki/Most_common_words_in_English, http://www.english-for- students.com/frequency - used - words.html, http://www.sporcle.com/games/common_english_words.php),最热门的10个单词是

The top 10 words from the Alice in Wonderland text are the and to a of it she i you said
The top 10 words from the Jargon File (v4.4.7) are the a of to and in is that or for

在《爱丽丝漫游奇境记》中,前10个单词是你说过的,其中一个单词是“to”,另一个单词是“in”

So question is why or was included in the problem's ignore list, where it's ~30th in popularity when the word that (8th most used) is not. etc, etc. Hence I believe the ignore list should be provided dynamically (or could be omitted).

所以问题是为什么或被包含在问题的忽略列表中,当第8个最常用的词不被使用时,它的受欢迎程度是30。因此我认为忽略列表应该动态提供(或者可以省略)。

Alternative idea would be simply to skip the top 10 words from the result - which actually would shorten the solution (elementary - have to show only the 11th to 32nd entries).

另一种方法是简单地从结果中跳过前10个单词——这实际上会缩短答案(基本的),只显示第11到第32个条目。


Python 2.x, punctilious approach = 277 243 chars

The chart drawn in the above code is simplified (using only one character for the bars). If one wants to reproduce exactly the chart from the problem description (which was not required), this code will do it:

上面代码中绘制的图表被简化了(对条形图只使用一个字符)。如果您想要从问题描述(不是必需的)中精确地复制图表,此代码将这样做:

import sys,ret=re.split('\W+',sys.stdin.read().lower())r=sorted((-t.count(w),w)for w in set(t)-set(sys.argv))[:22]h=min(9*l/(77-len(w))for l,w in r)print'',9*r[0][0]/h*'_'for l,w in r:print'|'+9*l/h*'_'+'|',w

I take an issue with the somewhat random choice of the 10 words to exclude the, and, of, to, a, i, it, in, or, is so those are to be passed as command line parameters, like so:
python WordFrequencyChart.py the and of to a i it in or is <"Alice's Adventures in Wonderland.txt"

我对这10个单词的随机选择提出了一个问题,它排除了a, I, it, in, or, is,所以这些都将作为命令行参数传递,比如:python WordFrequencyChart。在或<"爱丽丝漫游奇境记"中

This is 213 chars + 30 if we account for the "original" ignore list passed on command line = 243

这是213个chars + 30,如果我们解释在命令行上传递的“原始”忽略列表= 243。

PS. The second code also does "adjustment" for the lengths of all top words, so none of them will overflow in degenerate case.

第二段代码也对所有顶字的长度进行了“调整”,因此在退化情况下不会出现任何溢出。

 _______________________________________________________________|_______________________________________________________________| she|_______________________________________________________| superlongstringstring|_____________________________________________________| said|______________________________________________| alice|_________________________________________| was|______________________________________| that|_______________________________| as|____________________________| her|__________________________| at|__________________________| with|_________________________| s|_________________________| t|_______________________| on|_______________________| all|____________________| this|____________________| for|____________________| had|____________________| but|___________________| be|___________________| not|_________________| they|_________________| so

#13


12  

Haskell - 366 351 344 337 333 characters

(One line break in main added for readability, and no line break needed at end of last line.)

(增加了主断行,增加了可读性,最后一行不需要断行。)

import Data.Listimport Data.Charl=lengtht=filterm=mapf c|isAlpha c=toLower c|0<1=' 'h w=(-l w,head w)x!(q,w)='|':replicate(minimum$m(q?)x)'_'++"| "++wq?(g,w)=q*(77-l w)`div`gb x=m(x!)xa(l:r)=(' ':t(=='_')l):l:rmain=interact$unlines.a.b.take 22.sort.m h.group.sort  .t(`notElem`words"the and of to a i it in or is").words.m f

How it works is best seen by reading the argument to interact backwards:

它是如何运作的,最好的办法是阅读后面的讨论:

  • map f lowercases alphabetics, replaces everything else with spaces.
  • 映射f小写字母,用空格替换其他所有东西。
  • words produces a list of words, dropping the separating whitespace.
  • 单词产生一个单词列表,去掉分隔的空格。
  • filter (notElemwords "the and of to a i it in or is") discards all entries with forbidden words.
  • 过滤器(notElemwords“the”和“to a i it in or is”)将所有条目以禁止的单词丢弃。
  • group . sort sorts the words, and groups identical ones into lists.
  • 组。对单词进行排序,并将相同的单词分组到列表中。
  • map h maps each list of identical words to a tuple of the form (-frequency, word).
  • map h将每个相同的单词列表映射到表单的一个元组(-frequency, word)。
  • take 22 . sort sorts the tuples by descending frequency (the first tuple entry), and keeps only the first 22 tuples.
  • 22。按降序频率(第一个元组条目)对元组进行排序,并只保留前22个元组。
  • b maps tuples to bars (see below).
  • b将元组映射到条形图(见下面)。
  • a prepends the first line of underscores, to complete the topmost bar.
  • 在下划线的第一行前加上前缀,以完成最上面的栏。
  • unlines joins all these lines together with newlines.
  • unlines将所有这些行与换行符连接在一起。

The tricky bit is getting the bar length right. I assumed that only underscores counted towards the length of the bar, so || would be a bar of zero length. The function b maps c x over x, where x is the list of histograms. The entire list is passed to c, so that each invocation of c can compute the scale factor for itself by calling u. In this way, I avoid using floating-point math or rationals, whose conversion functions and imports would eat many characters.

棘手的一点是要把杆长弄对。我假设只有下划线计算的长度是bar的长度,所以||将是一个0长度的bar。函数b映射cx / x,其中x是直方图的列表。整个列表传递给c,这样c的每次调用都可以通过调用u来计算自身的比例因子,这样我就避免使用浮点数或有理函数,它们的转换函数和导入会消耗很多字符。

Note the trick of using -frequency. This removes the need to reverse the sort since sorting (ascending) -frequency will places the words with the largest frequency first. Later, in the function u, two -frequency values are multiplied, which will cancel the negation out.

注意使用-frequency的技巧。这就不需要反向排序,因为排序(升序)-频率将首先放置频率最大的单词。之后,在函数u中,将两个频率值相乘,这将抵消掉对它的否定。

#14


11  

JavaScript 1.8 (SpiderMonkey) - 354

x={};p='|';e=' ';z=[];c=77while(l=readline())l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y)x[y]?x[y].c++:z.push(x[y]={w:y,c:1}))z=z.sort(function(a,b)b.c-a.c).slice(0,22)for each(v in z){v.r=v.c/z[0].cc=c>(l=(77-v.w.length)/v.r)?l:c}for(k in z){v=z[k]s=Array(v.r*c|0).join('_')if(!+k)print(e+s+e)print(p+s+p+e+v.w)}

Sadly, the for([k,v]in z) from the Rhino version doesn't seem to want to work in SpiderMonkey, and readFile() is a little easier than using readline() but moving up to 1.8 allows us to use function closures to cut a few more lines....

不幸的是,为(z)[k、v]从犀牛版本似乎没有想在SpiderMonkey工作,和readFile()是一个小比使用readline()但更容易移动1.8允许我们使用函数闭包....削减更多的行

Adding whitespace for readability:

为可读性:添加空格

x={};p='|';e=' ';z=[];c=77while(l=readline())  l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,   function(y) x[y] ? x[y].c++ : z.push( x[y] = {w: y, c: 1} )  )z=z.sort(function(a,b) b.c - a.c).slice(0,22)for each(v in z){  v.r=v.c/z[0].c  c=c>(l=(77-v.w.length)/v.r)?l:c}for(k in z){  v=z[k]  s=Array(v.r*c|0).join('_')  if(!+k)print(e+s+e)  print(p+s+p+e+v.w)}

Usage: js golf.js < input.txt

用法:js高尔夫球。js < input.txt

Output:

输出:

 _________________________________________________________________________ |_________________________________________________________________________| she|_______________________________________________________________| you|____________________________________________________________| said|____________________________________________________| alice|______________________________________________| was|___________________________________________| that|___________________________________| as|________________________________| her|_____________________________| at|_____________________________| with|____________________________| s|____________________________| t|__________________________| on|_________________________| all|_______________________| this|______________________| for|______________________| had|______________________| but|_____________________| be|_____________________| not|___________________| they|___________________| so

(base version - doesn't handle bar widths correctly)

(base版本-不正确处理bar宽度)

JavaScript (Rhino) - 405 395 387 377 368 343 304 chars

I think my sorting logic is off, but.. I duno. Brainfart fixed.

我想我的排序逻辑出问题了。我duno。Brainfart固定的。

Minified (abusing \n's interpreted as a ; sometimes):

缩小(滥用\n被解释为a;有时):

x={};p='|';e=' ';z=[]readFile(arguments[0]).toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y){x[y]?x[y].c++:z.push(x[y]={w:y,c:1})})z=z.sort(function(a,b){return b.c-a.c}).slice(0,22)for([k,v]in z){s=Array((v.c/z[0].c)*70|0).join('_')if(!+k)print(e+s+e)print(p+s+p+e+v.w)}

#15


11  

PHP CLI version (450 chars)

This solution takes into account the last requirement which most purists have conviniently chosen to ignore. That costed 170 characters!

这个解决方案考虑了大多数纯粹主义者为了方便而忽略的最后一个要求。那花费170个字符!

Usage: php.exe <this.php> <file.txt>

用法:php。exe <。php > < file.txt >

Minified:

缩小:

<?php $a=array_count_values(array_filter(preg_split('/[^a-z]/',strtolower(file_get_contents($argv[1])),-1,1),function($x){return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);}));arsort($a);$a=array_slice($a,0,22);function R($a,$F,$B){$r=array();foreach($a as$x=>$f){$l=strlen($x);$r[$x]=$b=$f*$B/$F;if($l+$b>76)return R($a,$f,76-$l);}return$r;}$c=R($a,max($a),76-strlen(key($a)));foreach($a as$x=>$f)echo '|',str_repeat('-',$c[$x]),"| $x\n";?>

Human readable:

人类可读的:

<?php// Read:$s = strtolower(file_get_contents($argv[1]));// Split:$a = preg_split('/[^a-z]/', $s, -1, PREG_SPLIT_NO_EMPTY);// Remove unwanted words:$a = array_filter($a, function($x){       return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);     });// Count:$a = array_count_values($a);// Sort:arsort($a);// Pick top 22:$a=array_slice($a,0,22);// Recursive function to adjust bar widths// according to the last requirement:function R($a,$F,$B){    $r = array();    foreach($a as $x=>$f){        $l = strlen($x);        $r[$x] = $b = $f * $B / $F;        if ( $l + $b > 76 )            return R($a,$f,76-$l);    }    return $r;}// Apply the function:$c = R($a,max($a),76-strlen(key($a)));// Output:foreach ($a as $x => $f)    echo '|',str_repeat('-',$c[$x]),"| $x\n";?>

Output:

输出:

|-------------------------------------------------------------------------| she|---------------------------------------------------------------| you|------------------------------------------------------------| said|-----------------------------------------------------| alice|-----------------------------------------------| was|-------------------------------------------| that|------------------------------------| as|--------------------------------| her|-----------------------------| at|-----------------------------| with|--------------------------| on|--------------------------| all|-----------------------| this|-----------------------| for|-----------------------| had|-----------------------| but|----------------------| be|---------------------| not|--------------------| they|--------------------| so|-------------------| very|------------------| what

When there is a long word, the bars are adjusted properly:

当有一个长字时,对横杠进行适当的调整:

|--------------------------------------------------------| she|---------------------------------------------------| thisisareallylongwordhere|-------------------------------------------------| you|-----------------------------------------------| said|-----------------------------------------| alice|------------------------------------| was|---------------------------------| that|---------------------------| as|-------------------------| her|-----------------------| with|-----------------------| at|--------------------| on|--------------------| all|------------------| this|------------------| for|------------------| had|-----------------| but|-----------------| be|----------------| not|---------------| they|---------------| so|--------------| very

#16


11  

Python 3.1 - 245 229 charaters

I guess using Counter is kind of cheating :) I just read about it about a week ago, so this was the perfect chance to see how it works.

我想使用计数器是一种欺骗:)我一周前刚刚读到它,所以这是一个很好的机会来看看它是如何工作的。

import re,collectionso=collections.Counter([w for w in re.findall("[a-z]+",open("!").read().lower())if w not in"a and i in is it of or the to".split()]).most_common(22)print('\n'.join('|'+76*v//o[0][1]*'_'+'| '+k for k,v in o))

Prints out:

打印出:

|____________________________________________________________________________| she|__________________________________________________________________| you|_______________________________________________________________| said|_______________________________________________________| alice|_________________________________________________| was|_____________________________________________| that|_____________________________________| as|__________________________________| her|_______________________________| with|_______________________________| at|______________________________| s|_____________________________| t|____________________________| on|___________________________| all|________________________| this|________________________| for|________________________| had|________________________| but|______________________| be|______________________| not|_____________________| they|____________________| so

Some of the code was "borrowed" from AKX's solution.

有些代码是从AKX的解决方案中“借来的”。

#17


11  

perl, 205 191 189 characters/ 205 characters (fully implemented)

Some parts were inspired by the earlier perl/ruby submissions, a couple similar ideas were arrived at independently, the others are original. Shorter version also incorporates some things I saw/learned from other submissions.

一些部分是受早期perl/ruby提交的启发,一些类似的想法是独立实现的,其他的是原创的。更短的版本也包含了一些我从其他提交中看到/学到的东西。

Original:

原:

$k{$_}++for grep{$_!~/^(the|and|of|to|a|i|it|in|or|is)$/}map{lc=~/[a-z]+/g}<>;@t=sort{$k{$b}<=>$k{$a}}keys%k;$l=76-length$t[0];printf" %s",'_'x$l;printf"|%s| $_",'_'x int$k{$_}/$k{$t[0]}*$l for@t[0..21];

Latest version down to 191 characters:

最新版本减少到191个字符:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s";$r=(76-y///c)/$k{$_=$e[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s"}@e[0,0..21]

Latest version down to 189 characters:

最新版本减少到189个字符:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@_=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s";$r=(76-m//)/$k{$_=$_[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s"}@_[0,0..21]

This version (205 char) accounts for the lines with words longer than what would be found later.

这个版本(205字符)描述的行比后面发现的要长。

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;($r)=sort{$a<=>$b}map{(76-y///c)/$k{$_}}@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s";map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s";}@e[0,0..21]

#18


10  

Perl: 203 202 201 198 195 208 203 / 231 chars

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&warn" "."_"x($z*$y)."\n";printf"|%.78s\n","_"x($z*$y)."| $_"}(sort{$x{$b}<=>$x{$a}}keys%x)[0..21]

Alternate, full implementation including indicated behaviour (global bar-squishing) for the pathological case in which the secondary word is both popular and long enough to combine to over 80 chars (this implementation is 231 chars):

在病理病例中,次要词既流行又足够长,可以合并到超过80个字符(此实现为231个字符),可替换、完整实现,包括指示行为(全局压扁):

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;@e=(sort{$x{$b}<=>$x{$a}}keys%x)[0..21];for(@e){$p=(76-y///c)/$x{$_};($y&&$p>$y)||($y=$p)}warn" "."_"x($x{$e[0]}*$y)."\n";for(@e){warn"|"."_"x($x{$_}*$y)."| $_\n"}

The specification didn't state anywhere that this had to go to STDOUT, so I used perl's warn() instead of print - four characters saved there. Used map instead of foreach, but I feel like there could still be some more savings in the split(join()). Still, got it down to 203 - might sleep on it. At least Perl's now under the "shell, grep, tr, grep, sort, uniq, sort, head, perl" char count for now ;)

规范中没有任何地方说明必须将其写入STDOUT,因此我使用了perl的warn()而不是打印—保存在那里的4个字符。使用map而不是foreach,但是我觉得在split(join())中仍然可以节省一些开销。尽管如此,把它降到了203年——也许可以考虑一下。至少Perl现在在“shell、grep、tr、grep、sort、uniq、sort、head、Perl”字符计数下面;

PS: Reddit says "Hi" ;)

附注:Reddit说“嗨”;)

Update: Removed join() in favour of assignment and implicit scalar conversion join. Down to 202. Also please note I have taken advantage of the optional "ignore 1-letter words" rule to shave 2 characters off, so bear in mind the frequency count will reflect this.

更新:删除join()以支持赋值和隐式标量转换连接。到202年。另外请注意,我已经利用了可选的“忽略1个字母的单词”规则来删除2个字符,所以请记住频率计数将反映这一点。

Update 2: Swapped out assignment and implicit join for killing $/ to get the file in one gulp using <> in the first place. Same size, but nastier. Swapped out if(!$y){} for $y||{}&&, saved 1 more char => 201.

更新2:首先使用<>替换出赋值和隐式连接以杀死$/,以一口气获取文件。大小相同,但糟糕。如果(!$y){为$y||{}&,则再保存1个char => 201。

Update 3: Took control of lowercasing early (lc<>) by moving lc out of the map block - Swapped out both regexes to no longer use /i option, as no longer needed. Swapped explicit conditional x?y:z construct for traditional perlgolf || implicit conditional construct - /^...$/i?1:$x{$}++ for /^...$/||$x{$}++ Saved three characters! => 198, broke the 200 barrier. Might sleep soon... perhaps.

更新3:通过将lc移出map块,提前控制小写(lc<>)——将两个regexe交换到不再使用/i选项(不再需要)。交换x显式条件?y:z构建传统perlgolf | |隐含条件构造——/ ^…/我美元吗?1:$ x { $ } + + / ^……美元/ | | $ x { $ } + +救了三个字符!=> 198,突破200大关。可能睡眠很快…也许。

Update 4: Sleep deprivation has made me insane. Well. More insane. Figuring that this only has to parse normal happy text files, I made it give up if it hits a null. Saved two characters. Replaced "length" with the 1-char shorter (and much more golfish) y///c - you hear me, GolfScript?? I'm coming for you!!! sob

更新4:睡眠不足让我发疯。好。更疯狂。我认为这只需要解析正常的快乐文本文件,如果它命中一个null,我就会放弃它。救了两个字符。替换“长度”,用1-char(和更多的golfish) y// c -你听到了吗,GolfScript??我来找你了! ! !发出呜咽声

Update 5: Sleep dep made me forget about the 22row limit and subsequent-line limiting. Back up to 208 with those handled. Not too bad, 13 characters to handle it isn't the end of the world. Played around with perl's regex inline eval, but having trouble getting it to both work and save chars... lol. Updated the example to match current output.

更新5:Sleep dep让我忘记了22行限制和后续行限制。拿着这些回到208。不太坏,13个角色的处理并不是世界末日。使用perl的regex内联eval,但在工作和保存chars时遇到麻烦……哈哈更新示例以匹配当前输出。

Update 6: Removed unneeded braces protecting (...)for, since the syntactic candy ++ allows shoving it up against the for happily. Thanks to input from Chas. Owens (reminding my tired brain), got the character class i[tns] solution in there. Back down to 203.

更新6:删除不必要的保护(…),因为语法糖果++允许推它到快乐。感谢Chas的输入。欧文斯(让我想起了我那疲惫的大脑),他得到了角色类i的解决方案。回到203年。

Update 7: Added second piece of work, full implementation of specs (including the full bar-squishing behaviour for secondary long-words, instead of truncation which most people are doing, based on the original spec without the pathological example case)

更新7:增加了第二部分的工作,充分执行了规范(包括对次要的长词的完全的压压行为,而不是大多数人都在做的,基于原始规范而没有病理的例子)

Examples:

例子:

 _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|____________________________________________________________| said|_____________________________________________________| alice|_______________________________________________| was|___________________________________________| that|____________________________________| as|________________________________| her|_____________________________| with|_____________________________| at|__________________________| on|__________________________| all|_______________________| this|_______________________| for|_______________________| had|_______________________| but|______________________| be|_____________________| not|____________________| they|____________________| so|___________________| very|__________________| what

Alternative implementation in pathological case example:

病理病例的替代实施:

 _______________________________________________________________|_______________________________________________________________| she|_______________________________________________________| superlongstringstring|____________________________________________________| said|______________________________________________| alice|________________________________________| was|_____________________________________| that|_______________________________| as|____________________________| her|_________________________| with|_________________________| at|_______________________| on|______________________| all|____________________| this|____________________| for|____________________| had|____________________| but|___________________| be|__________________| not|_________________| they|_________________| so|________________| very|________________| what

#19


9  

F#, 452 chars

Strightforward: get a sequence a of word-count pairs, find the best word-count-per-column multiplier k, then print results.

Strightforward:获取一个单词计数对序列a,找到最佳的单词计数/列乘法器k,然后打印结果。

let a= stdin.ReadToEnd().Split(" .?!,\":;'\r\n".ToCharArray(),enum 1) |>Seq.map(fun s->s.ToLower())|>Seq.countBy id |>Seq.filter(fun(w,n)->not(set["the";"and";"of";"to";"a";"i";"it";"in";"or";"is"].Contains w)) |>Seq.sortBy(fun(w,n)-> -n)|>Seq.take 22let k=a|>Seq.map(fun(w,n)->float(78-w.Length)/float n)|>Seq.minlet u n=String.replicate(int(float(n)*k)-2)"_"printfn" %s "(u(snd(Seq.nth 0 a)))for(w,n)in a do printfn"|%s| %s "(u n)w

Example (I have different freq counts than you, unsure why):

例子(我有不同的freq比你,不确定为什么):

% app.exe < Alice.txt _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|_____________________________________________________________| said|_____________________________________________________| alice|_______________________________________________| was|___________________________________________| that|___________________________________| as|________________________________| her|_____________________________| with|_____________________________| at|____________________________| t|____________________________| s|__________________________| on|_________________________| all|_______________________| this|______________________| had|______________________| for|_____________________| but|_____________________| be|____________________| not|___________________| they|__________________| so

#20


8  

Python 2.6, 347 chars

import reW,x={},"a and i in is it of or the to".split()[W.__setitem__(w,W.get(w,0)-1)for w in re.findall("[a-z]+",file("11.txt").read().lower())if w not in x]W=sorted(W.items(),key=lambda p:p[1])[:22]bm=(76.-len(W[0][0]))/W[0][1]U=lambda n:"_"*int(n*bm)print "".join(("%s\n|%s| %s "%((""if i else" "+U(n)),U(n),w))for i,(w,n)in enumerate(W))

Output:

输出:

 _________________________________________________________________________|_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |____________________________| s |____________________________| t |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so 

#21


7  

*sh (+curl), partial solution

This is incomplete, but for the hell of it, here's the word-frequency counting half of the problem in 192 bytes:

这是不完整的,但见鬼的是,这是字数计算问题的一半在192字节:

curl -s http://www.gutenberg.org/files/11/11.txt|sed -e 's@[^a-z]@\n@gi'|tr '[:upper:]' '[:lower:]'|egrep -v '(^[^a-z]*$|\b(the|and|of|to|a|i|it|in|or|is)\b)' |sort|uniq -c|sort -n|tail -n 22

#22


7  

Gawk -- 336 (originally 507) characters

(after fixing the output formatting; fixing the contractions thing; tweaking; tweaking again; removing a wholly unnecessary sorting step; tweaking yet again; and again (oops this one broke the formatting); tweak some more; taking up Matt's challenge I desperately tweak so more; found another place to save a few, but gave two back to fix the bar length bug)

(修改输出格式后;修复收缩的;调整;重新调整;删除完全不必要的排序步骤;再次调整;再说一遍(哦,这个破坏了格式);调整一些;接受了马特的挑战,我绝望地做了更多的调整;找到了另一个地方来保存一些,但给了两个回来修复条长错误)

Heh heh! I am momentarily ahead of [Matt's JavaScript][1] solutioncounter challenge! ;)and [AKX's python][2].

呵呵呵!我暂时领先(马特的JavaScript)[1]解决方案反挑战!,)和(AKX python)[2]。

The problem seems to call out for a language that implements native associative arrays, so of course I've chosen one with a horribly deficient set of operators on them. In particular, you cannot control the order in which awk offers up the elements of a hash map, so I repeatedly scan the whole map to find the currently most numerous item, print it and delete it from the array.

这个问题似乎需要一种实现本机关联数组的语言,因此我当然选择了一种具有糟糕的运算符集的语言。特别是,您无法控制awk提供哈希映射元素的顺序,因此我重复地扫描整个映射,以找到当前数量最多的项,并将其打印并从数组中删除。

It is all terribly inefficient, with all the golfifcations I've made it has gotten to be pretty awful, as well.

这一切都是非常低效的,我做的所有的golfifcations也变得非常糟糕。

Minified:

缩小:

{gsub("[^a-zA-Z]"," ");for(;NF;NF--)a[tolower($NF)]++}END{split("the and of to a i it in or is",b," ");for(w in b)delete a[b[w]];d=1;for(w in a){e=a[w]/(78-length(w));if(e>d)d=e}for(i=22;i;--i){e=0;for(w in a)if(a[w]>e)e=a[x=w];l=a[x]/d-2;t=sprintf(sprintf("%%%dc",l)," ");gsub(" ","_",t);if(i==22)print" "t;print"|"t"| "x;delete a[x]}}

line breaks for clarity only: they are not necessary and should not be counted.

换行只是为了清晰:它们不是必需的,不应该被计算。


Output:

输出:

$ gawk -f wordfreq.awk.min < 11.txt  _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|____________________________________________________________| said|____________________________________________________| alice|______________________________________________| was|__________________________________________| that|___________________________________| as|_______________________________| her|____________________________| with|____________________________| at|___________________________| s|___________________________| t|_________________________| on|_________________________| all|______________________| this|______________________| for|______________________| had|_____________________| but|____________________| be|____________________| not|___________________| they|__________________| so$ sed 's/you/superlongstring/gI' 11.txt | gawk -f wordfreq.awk.min ______________________________________________________________________|______________________________________________________________________| she|_____________________________________________________________| superlongstring|__________________________________________________________| said|__________________________________________________| alice|____________________________________________| was|_________________________________________| that|_________________________________| as|______________________________| her|___________________________| with|___________________________| at|__________________________| s|__________________________| t|________________________| on|________________________| all|_____________________| this|_____________________| for|_____________________| had|____________________| but|___________________| be|___________________| not|__________________| they|_________________| so

Readable; 633 characters (originally 949):

易读的;633个字符(原949):

{    gsub("[^a-zA-Z]"," ");    for(;NF;NF--)    a[tolower($NF)]++}END{    # remove "short" words    split("the and of to a i it in or is",b," ");    for (w in b)     delete a[b[w]];    # Find the bar ratio    d=1;    for (w in a) {    e=a[w]/(78-length(w));    if (e>d)        d=e    }    # Print the entries highest count first    for (i=22; i; --i){                   # find the highest count    e=0;    for (w in a)         if (a[w]>e)        e=a[x=w];        # Print the bar    l=a[x]/d-2;    # make a string of "_" the right length    t=sprintf(sprintf("%%%dc",l)," ");    gsub(" ","_",t);    if (i==22) print" "t;    print"|"t"| "x;    delete a[x]    }}

#23


7  

Common LISP, 670 characters

I'm a LISP newbie, and this is an attempt using an hash table for counting (so probably not the most compact method).

我是一个LISP新手,这是一个尝试使用哈希表进行计数(所以可能不是最紧凑的方法)。

(flet((r()(let((x(read-char t nil)))(and x(char-downcase x)))))(do((c(make-hash-table :test 'equal))(w NIL)(x(r)(r))y)((not x)(maphash(lambda(k v)(if(not(find k '("""the""and""of""to""a""i""it""in""or""is"):test'equal))(push(cons k v)y)))c)(setf y(sort y #'> :key #'cdr))(setf y(subseq y 0(min(length y)22)))(let((f(apply #'min(mapcar(lambda(x)(/(-76.0(length(car x)))(cdr x)))y))))(flet((o(n)(dotimes(i(floor(* n f)))(write-char #\_))))(write-char #\Space)(o(cdar y))(write-char #\Newline)(dolist(x y)(write-char #\|)(o(cdr x))(format t "| ~a~%"(car x))))))(cond((char<= #\a x #\z)(push x w))(t(incf(gethash(concatenate 'string(reverse w))c 0))(setf w nil)))))

can be run on for example withcat alice.txt | clisp -C golf.lisp.

可以运行,例如与猫爱丽丝。txt | clisp -C golf.lisp。

In readable form is

以可读的形式是

(flet ((r () (let ((x (read-char t nil)))               (and x (char-downcase x)))))  (do ((c (make-hash-table :test 'equal))  ; the word count map       w y                                 ; current word and final word list       (x (r) (r)))  ; iteration over all chars       ((not x)        ; make a list with (word . count) pairs removing stopwords        (maphash (lambda (k v)                   (if (not (find k '("" "the" "and" "of" "to"                                      "a" "i" "it" "in" "or" "is")                                  :test 'equal))                       (push (cons k v) y)))                 c)        ; sort and truncate the list        (setf y (sort y #'> :key #'cdr))        (setf y (subseq y 0 (min (length y) 22)))        ; find the scaling factor        (let ((f (apply #'min                        (mapcar (lambda (x) (/ (- 76.0 (length (car x)))                                               (cdr x)))                                y))))          ; output          (flet ((outx (n) (dotimes (i (floor (* n f))) (write-char #\_))))             (write-char #\Space)             (outx (cdar y))             (write-char #\Newline)             (dolist (x y)               (write-char #\|)               (outx (cdr x))               (format t "| ~a~%" (car x))))))       ; add alphabetic to current word, and bump word counter       ; on non-alphabetic       (cond        ((char<= #\a x #\z)         (push x w))        (t         (incf (gethash (concatenate 'string (reverse w)) c 0))         (setf w nil)))))

#24


6  

C (828)

It looks alot like obfuscated code, and uses glib for string, list and hash. Char count with wc -m says 828 . It does not consider single-char words. To calculate the max length of the bar, it consider the longest possible word among all, not only the first 22. Is this a deviation from the spec?

它看起来很像模糊的代码,使用glib处理字符串、列表和散列。用wc -m表示的Char计数是828。它不考虑单字字符。为了计算这个条的最大长度,它考虑所有可能的最长单词,而不仅仅是前22个。这与规格不符吗?

It does not handle failures and it does not release used memory.

它不处理失败,也不释放已使用的内存。

#include <glib.h>#define S(X)g_string_##X#define H(X)g_hash_table_##XGHashTable*h;int m,w=0,z=0;y(const void*a,const void*b){int*A,*B;A=H(lookup)(h,a);B=H(lookup)(h,b);return*B-*A;}void p(void*d,void*u){int *v=H(lookup)(h,d);if(w<22){g_printf("|");*v=*v*(77-z)/m;while(--*v>=0)g_printf("=");g_printf("| %s\n",d);w++;}}main(c){int*v;GList*l;GString*s=S(new)(NULL);h=H(new)(g_str_hash,g_str_equal);char*n[]={"the","and","of","to","it","in","or","is"};while((c=getchar())!=-1){if(isalpha(c))S(append_c)(s,tolower(c));else{if(s->len>1){for(c=0;c<8;c++)if(!strcmp(s->str,n[c]))goto x;if((v=H(lookup)(h,s->str))!=NULL)++*v;else{z=MAX(z,s->len);v=g_malloc(sizeof(int));*v=1;H(insert)(h,g_strdup(s->str),v);}}x:S(truncate)(s,0);}}l=g_list_sort(H(get_keys)(h),y);m=*(int*)H(lookup)(h,g_list_first(l)->data);g_list_foreach(l,p,NULL);}

#25


6  

Perl, 185 char

200 (slightly broken) 199 197 195 193 187 185 characters. Last two newlines are significant. Complies with the spec.

200(略断)199 197 195 19187 185个字符。最后两条换行是很重要的。符合规范。

map$X{+lc}+=!/^(.|the|and|to|i[nst]|o[rf])$/i,/[a-z]+/gfor<>;$n=$n>($:=$X{$_}/(76-y+++c))?$n:$:for@w=(sort{$X{$b}-$X{$a}}%X)[0..21];die map{$U='_'x($X{$_}/$n);" $U"x!$z++,"|$U| $_"}@w

First line loads counts of valid words into %X.

第一行将有效单词计数加载到%X中。

The second line computes minimum scaling factor so that all output lines will be <= 80 characters.

第二行计算最小比例因子,使所有输出行都为<= 80个字符。

The third line (contains two newline characters) produces the output.

第三行(包含两个换行字符)生成输出。

#26


5  

Java - 886 865 756 744 742 744 752 742 714 680 chars

  • Updates before first 742: improved regex, removed superfluous parameterized types, removed superfluous whitespace.

    前742更新:改进regex,删除多余的参数化类型,删除多余的空白。

  • Update 742 > 744 chars: fixed the fixed-length hack. It's only dependent on the 1st word, not other words (yet). Found several places to shorten the code (\\s in regex replaced by and ArrayList replaced by Vector). I'm now looking for a short way to remove the Commons IO dependency and reading from stdin.

    更新742 > 744 chars:固定固定长度的hack。它只取决于第一个单词,而不是其他单词。找到几个地方可以缩短代码(在regex中用和ArrayList用Vector替换)。我现在正在寻找一种简短的方法来删除公有IO依赖项并从stdin中读取。

  • Update 744 > 752 chars: I removed the commons dependency. It now reads from stdin. Paste the text in stdin and hit Ctrl+Z to get result.

    更新744 > 752 chars:我删除了commons依赖性。现在它读的是stdin。在stdin中粘贴文本并按下Ctrl+Z以获得结果。

  • Update 752 > 742 chars: I removed public and a space, made classname 1 char instead of 2 and it's now ignoring one-letter words.

    更新752 > 742 chars:我删除了public和空格,将classname 1替换为2,现在它忽略了一个字母单词。

  • Update 742 > 714 chars: Updated as per comments of Carl: removed redundant assignment (742 > 730), replaced m.containsKey(k) by m.get(k)!=null (730 > 728), introduced substringing of line (728 > 714).

    更新742 > 714 chars:按照Carl的评论更新:删除冗余赋值(742 > 730),用m.g ryskey (k)替换m.get(k)!=null(730 > 728),引入了line的子类化(728 > 714)。

  • Update 714 > 680 chars: Updated as per comments of Rotsor: improved bar size calculation to remove unnecessary casting and improved split() to remove unnecessary replaceAll().

    更新714 > 680 chars:根据Rotsor的评论进行更新:改进的bar尺寸计算,删除不必要的铸件,改进的split(),删除不必要的replaceAll()。


import java.util.*;class F{public static void main(String[]a)throws Exception{StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);}}

More readable version:

更可读版本:

import java.util.*;class F{ public static void main(String[]a)throws Exception{  StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));  final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);  List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});  int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);  for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w); }}

Output:

输出:

 _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|____________________________________________________________| said|_____________________________________________________| alice|_______________________________________________| was|___________________________________________| that|____________________________________| as|________________________________| her|_____________________________| with|_____________________________| at|__________________________| on|__________________________| all|_______________________| this|_______________________| for|_______________________| had|_______________________| but|______________________| be|_____________________| not|____________________| they|____________________| so|___________________| very|__________________| what

It pretty sucks that Java doesn't have String#join() and closures (yet).

Java没有字符串#join()和闭包(还),这太糟糕了。

Edit by Rotsor:

由Rotsor编辑:

I have made several changes to your solution:

我对你的解决方案做了几处修改:

  • Replaced List with a String[]
  • 用字符串[]替换列表
  • Reused the 'args' argument instead of declaring my own String array. Also used it as an argument to .ToArray()
  • 重用“args”参数,而不是声明自己的字符串数组。还将它用作.ToArray()的参数
  • Replaced StringBuffer with a String (yes, yes, terrible performance)
  • 用字符串替换StringBuffer(是的,糟糕的性能)
  • Replaced Java sorting with a selection-sort with early halting (only first 22 elements have to be found)
  • 用早期停止的选择排序替换Java排序(只需要找到前22个元素)
  • Aggregated some int declaration into a single statement
  • 将一些int声明聚合到一个语句中
  • Implemented the non-cheating algorithm finding the most limiting line of output. Implemented it without FP.
  • 实现了非欺骗算法查找输出的最极限线。没有《外交政策》实施。
  • Fixed the problem of the program crashing when there were less than 22 distinct words in the text
  • 修正了当文本中有少于22个不同的单词时程序崩溃的问题
  • Implemented a new algorithm of reading input, which is fast and only 9 characters longer than the slow one.
  • 实现了一种读取输入的新算法,速度快,只比慢输入长9个字符。

The condensed code is 688 711 684 characters long:

压缩码为688711684个字符长:

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;(j=System.in.read())>0;w+=(char)j);for(String W:w.toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(W,m.get(W)!=null?m.get(W)+1:1);l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}

The fast version ( 720 693 characters)

快速版(720 693个字符)

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}

More readable version:

更可读版本:

import java.util.*;class F{public static void main(String[]l)throws Exception{    Map<String,Integer>m=new HashMap();String w="";    int i=0,k=0,j=8,x,y,g=22;    for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{        if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";    }}    l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;    for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}    for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}    String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');    System.out.println(" "+s);    for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}

The version without behaviour improvements is 615 characters:

没有行为改进的版本是615个字符:

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);for(;i<g;++i)for(j=i;++j<l.length;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}i=76-l[0].length();String s=new String(new char[i]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/m.get(l[0]))+"| "+w);}}}

#27


4  

Scala 2.8, 311 314 320 330 332 336 341 375 characters

including long word adjustment. Ideas borrowed from the other solutions.

包括长词调整。从其他解决方案借鉴的想法。

Now as a script (a.scala):

现在作为一个脚本(a.scala):

val t="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile(argv(0)).mkString.toLowerCase).toSeq.groupBy(w=>w).mapValues(_.size).toSeq.sortBy(-_._2)take 22def b(p:Int)="_"*(p*(for((w,c)<-t)yield(76.0-w.size)/c).min).toIntprintln(" "+b(t(0)._2))for(p<-t)printf("|%s| %s \n",b(p._2),p._1)

Run with

运行

scala -howtorun:script a.scala alice.txt

BTW, the edit from 314 to 311 characters actually removes only 1 character. Someone got the counting wrong before (Windows CRs?).

顺便说一句,从314到311个字符的编辑实际上只删除一个字符。有人之前数错了(Windows CRs?)

#28


4  

Clojure 282 strict

(let[[[_ m]:as s](->>(slurp *in*).toLowerCase(re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")frequencies(sort-by val >)(take 22))[b](sort(map #(/(- 76(count(key %)))(val %))s))p #(do(print %1)(dotimes[_(* b %2)](print \_))(apply println %&))](p " " m)(doseq[[k v]s](p \| v \| k)))

Somewhat more legibly:

更明了地:

(let[[[_ m]:as s](->> (slurp *in*)                   .toLowerCase                   (re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")                   frequencies                   (sort-by val >)                   (take 22))     [b] (sort (map #(/ (- 76 (count (key %)))(val %)) s))     p #(do          (print %1)          (dotimes[_(* b %2)] (print \_))          (apply println %&))]  (p " " m)  (doseq[[k v] s] (p \| v \| k)))

#29


4  

Scala, 368 chars

First, a legible version in 592 characters:

第一,592个字符的清晰版本:

object Alice {  def main(args:Array[String]) {    val s = io.Source.fromFile(args(0))    val words = s.getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase)    val freqs = words.foldLeft(Map[String, Int]())((countmap, word)  => countmap + (word -> (countmap.getOrElse(word, 0)+1)))    val sortedFreqs = freqs.toList.sort((a, b)  => a._2 > b._2)    val top22 = sortedFreqs.take(22)    val highestWord = top22.head._1    val highestCount = top22.head._2    val widest = 76 - highestWord.length    println(" " + "_" * widest)    top22.foreach(t => {      val width = Math.round((t._2 * 1.0 / highestCount) * widest).toInt      println("|" + "_" * width + "| " + t._1)    })  }}

The console output looks like this:

控制台输出如下:

$ scalac alice.scala $ scala Alice aliceinwonderland.txt _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|_____________________________________________________________| said|_____________________________________________________| alice|_______________________________________________| was|____________________________________________| that|____________________________________| as|_________________________________| her|______________________________| at|______________________________| with|_____________________________| s|_____________________________| t|___________________________| on|__________________________| all|_______________________| had|_______________________| but|______________________| be|______________________| not|____________________| they|____________________| so|___________________| very|___________________| what

We can do some aggressive minifying and get it down to 415 characters:

我们可以做一些积极的缩小,把它降到415个字符:

object A{def main(args:Array[String]){val l=io.Source.fromFile(args(0)).getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase).foldLeft(Map[String, Int]())((c,w)=>c+(w->(c.getOrElse(w,0)+1))).toList.sort((a,b)=>a._2>b._2).take(22);println(" "+"_"*(76-l.head._1.length));l.foreach(t=>println("|"+"_"*Math.round((t._2*1.0/l.head._2)*(76-l.head._1.length)).toInt+"| "+t._1))}}

The console session looks like this:

控制台会话如下所示:

$ scalac a.scala $ scala A aliceinwonderland.txt _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|_____________________________________________________________| said|_____________________________________________________| alice|_______________________________________________| was|____________________________________________| that|____________________________________| as|_________________________________| her|______________________________| at|______________________________| with|_____________________________| s|_____________________________| t|___________________________| on|__________________________| all|_______________________| had|_______________________| but|______________________| be|______________________| not|____________________| they|____________________| so|___________________| very|___________________| what

I'm sure a Scala expert could do even better.

我相信Scala专家可以做得更好。

Update: In the comments Thomas gave an even shorter version, at 368 characters:

更新:托马斯给出了一个更短的版本,有368个字符:

object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}}

Legibly, at 375 characters:

明了地,在375个字符:

object Alice {  def main(a:Array[String]) {    val t = (Map[String, Int]() /: (      for (        x <- io.Source.fromFile(a(0)).getLines        y <- "(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(x)      ) yield y.toLowerCase    ).toList)((c, x) => c + (x -> (c.getOrElse(x, 0) + 1))).toList.sortBy(_._2).reverse.take(22)    val w = 76 - t.head._1.length    print (" "+"_"*w)    t.map(s => "\n|" + "_" * (s._2 * w / t.head._2) + "| " + s._1).foreach(print)  }}

#30


3  

Java - 896 chars

931 chars

1233 chars made unreadable

1977 chars "uncompressed"


Update: I have aggressively reduced the character count. Omits single-letter words per updated spec.

更新:我已经减少了字符数。每一个更新的规范省略单字。

I envy C# and LINQ so much.

我非常羡慕c#和LINQ。

import java.util.*;import java.io.*;import static java.util.regex.Pattern.*;class g{public static void main(String[] a)throws Exception{PrintStream o=System.out;Map<String,Integer> w=new HashMap();Scanner s=new Scanner(new File(a[0])).useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));while(s.hasNext()){String z=s.next().trim().toLowerCase();if(z.equals(""))continue;w.put(z,(w.get(z)==null?0:w.get(z))+1);}List<Integer> v=new Vector(w.values());Collections.sort(v);List<String> q=new Vector();int i,m;i=m=v.size()-1;while(q.size()<22){for(String t:w.keySet())if(!q.contains(t)&&w.get(t).equals(v.get(i)))q.add(t);i--;}int r=80-q.get(0).length()-4;String l=String.format("%1$0"+r+"d",0).replace("0","_");o.println(" "+l);o.println("|"+l+"| "+q.get(0)+" ");for(i=m-1;i>m-22;i--){o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");}}}

"Readable":

“可读”:

import java.util.*;import java.io.*;import static java.util.regex.Pattern.*;class g{   public static void main(String[] a)throws Exception      {      PrintStream o = System.out;      Map<String,Integer> w = new HashMap();      Scanner s = new Scanner(new File(a[0]))         .useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));      while(s.hasNext())      {         String z = s.next().trim().toLowerCase();         if(z.equals(""))            continue;         w.put(z,(w.get(z) == null?0:w.get(z))+1);      }      List<Integer> v = new Vector(w.values());      Collections.sort(v);      List<String> q = new Vector();      int i,m;      i = m = v.size()-1;      while(q.size()<22)      {         for(String t:w.keySet())            if(!q.contains(t)&&w.get(t).equals(v.get(i)))               q.add(t);         i--;      }      int r = 80-q.get(0).length()-4;      String l = String.format("%1$0"+r+"d",0).replace("0","_");      o.println(" "+l);      o.println("|"+l+"| "+q.get(0)+" ");      for(i = m-1; i > m-22; i--)      {         o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");      }   }}

Output of Alice:

爱丽丝的输出:

 _________________________________________________________________________|_________________________________________________________________________| she|_______________________________________________________________| you|_____________________________________________________________| said|_____________________________________________________| alice|_______________________________________________| was|____________________________________________| that|____________________________________| as|_________________________________| her|______________________________| with|______________________________| at|___________________________| on|__________________________| all|________________________| this|________________________| for|_______________________| had|_______________________| but|______________________| be|______________________| not|____________________| they|____________________| so|___________________| very|___________________| what

Output of Don Quixote (also from Gutenberg):

堂吉诃德的输出(也来自古腾堡):

 ________________________________________________________________________|________________________________________________________________________| that|________________________________________________________| he|______________________________________________| for|__________________________________________| his|________________________________________| as|__________________________________| with|_________________________________| not|_________________________________| was|________________________________| him|______________________________| be|___________________________| don|_________________________| my|_________________________| this|_________________________| all|_________________________| they|________________________| said|_______________________| have|_______________________| me|______________________| on|______________________| so|_____________________| you|_____________________| quixote