【python3】爬取简书评论生成词云

一、起因：

昨天在简书上看到这么一篇文章《中国的父母，大都有毛病》，看完之后个人是比较认同作者的观点。

不过，翻了下评论，发现评论区争议颇大，基本两极化。好奇，想看看整体的评论是个什么样，就写个爬虫，做了词云。

二、怎么做：

① 观察页面，找到获取评论的请求，查看评论数据样式，写爬虫

② 用 jieba 模块，将爬取的评论做分词处理

③ 用 wordcloud 模块，生成词云

三、代码如下：

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

import requests,json,time

import jieba

import matplotlib.pyplot as plt

from bs4 import BeautifulSoup

from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator

# 存储爬取结果

def write(path,text):

    with open(path,'a', encoding='utf-8') as f:

        f.writelines(text)

        f.write('\n')

# 爬取评论

def getcomments(num,path):

    url = 'https://www.jianshu.com/notes/23437010/comments?comment_id=&author_only=false&since_id=0&max_id=1586510606000&order_by=likes_count&page='+str(num)

    response = requests.get(url).text

    response = json.loads(response)

    num = response['total_pages']

    for i in response['comments']:

        comment = BeautifulSoup(i['compiled_content'],'lxml').text

        write(path,comment)

    return num

# jieba 分词

def read(path):

    text=''

    with open(path, encoding='utf-8') as s:

        for line in s.readlines():

            line.strip()

            text += ' '.join(jieba.cut(line))

    return text

# WordCloud 生成词云

def wordcloud(imagepath):

    backgroud_Image = plt.imread(imagepath)

    wc = WordCloud(background_color='white',  # 设置背景颜色

                   mask=backgroud_Image,  # 设置背景图片

                   max_words=2000,  # 设置最大现实的字数

                   stopwords=STOPWORDS,  # 设置停用词

                   font_path='C:/Users/Windows/fonts/msyh.ttf',  # 设置字体格式，如不设置显示不了中文

                   max_font_size=120,  # 设置字体最大值

                   random_state=30,  # 设置有多少种随机生成状态，即有多少种配色方案

                   )

    wc.generate(text)

    image_colors = ImageColorGenerator(backgroud_Image)

    wc.recolor(color_func=image_colors)

    plt.imshow(wc)

    plt.axis('off')

    plt.show()

if __name__ == '__main__':

    path = '评论.txt' # 评论path

    imagepath = 'heart.jpg' #词云背景图path

    print('正在爬取评论')

    i,num=1,2

    while i <= num:

        num=getcomments(i,path) # 爬取评论

        time.sleep(2)

        i += 1

    print('正在分词处理')

    text = read(path)  # jieba 分词处理

    print('正在生成词云')

    wordcloud(imagepath) # WordCloud 生成词云

    print('词云生成成功')

效果：

【python3】爬取简书评论生成词云

秒客网

【python3】爬取简书评论生成词云

相关文章