几行代码完成微博热搜榜爬虫

1. 数据抓取

首先，我们得知道微博热搜内容的具体链接。https://s.weibo.com/top/summary

def get_html_data(self):
    res = requests.get(self.url, headers=self.headers).text
    return res

通过requests模块包，我们就能得到网页的html文件，接下来就是要对html文件的处理解析。

2. 数据处理

为了更好的分析html文件内容，我复制到编辑器上分析文本数据。

通过分析，不难发现，我们所想要的数据如下图所示结构中。

简单代码实现如下：

def deal_html_data(self, res):
    res = BeautifulSoup(res, "lxml")
    # 遍历热搜的标签
    # #pl_top_realtimehot 根据id, > table > tbody > tr 逐层查找
    for item in res.select("#pl_top_realtimehot > table > tbody > tr"):
        # 按类名.td-01提取热搜排名
        _rank = item.select_one(\'.td-01\').text
        if not _rank:
            continue
        # 按类名.td-02提取热搜关键词
        keyword = item.select_one(".td-02 > a").text

        # 提取热搜热度
        heat = item.select_one(".td-02 > span").text

        # 提取热搜标签
        icon = item.select_one(".td-03").text

        self.hot_list.append({"rank": _rank, "keyword": keyword, "heat": heat, "icon": icon, "time":
                              datetime.now().strftime("%Y-%m-%d %H:%M:%S")})

这里采用BeautifulSoup中select，和select_one去解析html文件。

这里对select和select_one做一下简单补充。

# 通过标签名查找
soup.select_one(\'a\')
# 通过类名查找
soup.select_one(\'.td-02\')
# 通过ID去查找
soup.select_one(\'#pl_top_realtimehot\')
# 组合查找，根据ID及标签层级关系查找
res.select("#pl_top_realtimehot > table > tbody > tr")

3. 数据存储

更多信息，请参考原文

https://mp.weixin.qq.com/s?__biz=Mzg3OTExODI3OA==&mid=2247484291&idx=1&sn=992419916130cf4b414b77b20c38db82&chksm=cf08112af87f983ce954aefd22a0179bbd7110540750b2db566cfc36e66a9000c953d675eecf&token=2077647426&lang=zh_CN#rd

秒客网

几行代码完成微博热搜榜爬虫

相关文章