python自然语言处理(二)

时间:2022-03-24 07:19:34

1.使用FreqDist需找文本中最常见的50个词

>>>fdist1=FreqDist(text1)

>>>fdist1

FreqDist({',':18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024, 'a': 4569, 'to': 4542,';': 4072, 'in': 3916, 'that': 2982, ...})

>>>vocab=fdist1.most_common(50)

>>>vocab

[(',',18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569),('to', 4542), (';', 4072), ('in', 3916), ('that', 2982), ("'", 2684),('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124), ('s', 1739), ('is',1695), ('he', 1661), ('with', 1659), ('was', 1632), ('as', 1620), ('"',1478), ('all', 1462), ('for', 1414), ('this', 1280), ('!', 1269), ('at', 1231),('by', 1137), ('but', 1113), ('not', 1103), ('--', 1070), ('him', 1058),('from', 1052), ('be', 1030), ('on', 1005), ('so', 918), ('whale', 906),('one', 889), ('you', 841), ('had', 767), ('have', 760), ('there', 715),('But', 705), ('or', 697), ('were', 680), ('now', 646), ('which', 640), ('?',637), ('me', 627), ('like', 624)]

2.得到出现次数前50的频率图

>>>fdist1.plot(50,cumulative=True)

3.查找只出现一次的词

>>>fdist1.hapaxes()

4.查找长度超过15个字符的词

>>>V=set(text1)

>>>long_words=[w for w in V if len(w)>15]

>>>sorted(long_words)

['CIRCUMNAVIGATION','Physiognomically','apprehensiveness','cannibalistically','characteristically','circumnavigating','circumnavigation','circumnavigations','comprehensiveness','hermaphroditical','indiscriminately','indispensableness','irresistibleness','physiognomically','preternaturalness','responsibilities','simultaneousness','subterraneousness','supernaturalness','superstitiousness','uncomfortableness','uncompromisedness','undiscriminating','uninterpenetratingly']

5.找出文本中所有长度超过7个字符且出现次数超过7次的词

>>>fdist5=FreqDist(text5)

>>>sorted(w for w in set(text5) if len(w)>7 and fdist5[w]>7)

['#14-19teens','#talkcity_adults', '((((((((((', '........', 'Question', 'actually','anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent','listening', 'remember', 'seriously', 'something', 'together', 'tomorrow','watching']

6.提取文本的双连词

>>>list(bigrams(['more','is','said','done','than']))

[('more','is'), ('is', 'said'), ('said', 'done'), ('done', 'than')]

注意要用from nltk import *才可以使用函数bigrams,用list写作列表

7.找到频繁出现的双连词

>>>text4.collocations()

UnitedStates; fellow citizens; four years; years ago; Federal

Government;General Government; American people; Vice President; Old

World;Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;

Godbless; every citizen; Indian tribes; public debt; one another;

foreignnations; political parties

8.创造词长的字典FreqDist

>>>fdist=FreqDist([len(w) for w in text1])

>>>fdist

FreqDist({3:50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399, 8: 9966, 9:6428, 10: 3528, ...})

>>>fdist.keys()

dict_keys([1,4, 2, 6, 8, 9, 11, 5, 7, 3, 10, 12, 13, 14, 16, 15, 17, 18, 20])

其中3:50223表示词长为3的出现50223次

fdist.keys()得到了所有的词长集合

8.对词长的一些操作

列出词长与其出现次数

>>>fdist.items()

dict_items([(1,47933), (4, 42345), (2, 38513), (6, 17111), (8, 9966), (9, 6428), (11, 1873),(5, 26597), (7, 14399), (3, 50223), (10, 3528), (12, 1053), (13, 567), (14,177), (16, 22), (15, 70), (17, 12), (18, 1), (20, 1)])

找到出现次数最多的词长

>>>fdist.max()

3

查找词长为3出现的次数

>>>fdist[3]

50223

词长为3出现的频率

>>>fdist.freq(3)

0.19255882431878046

9.NLTK频率分布类中定义的函数

fdist=FreqDist(samples)       创建包含给定样本的频率分布

fdist.inc(sample)                  增加样本

fdist[‘monstrous’]                计数给定样本出现的次数

fdist.freq(‘monstrous’)         给定样本的频率

fdist.N()                               样本总数

fdist.keys()                          以频率递减书序排序的样本链表

for samplein fdist:               以频率递减的顺序遍历样本

fidst.max()                           数值最大的样本

fdist.tabulate()                     绘制频率分布表

fdist.plot()                           绘制频率分布图

fdist.plot(cumulative=True)  绘制累积频率分布图

fdist1<fdist2                        测试样本在fdist1中出现的频率是否小于fdist2