爬去新浪网国内动态新闻

时间:2022-06-20 03:28:49

1、需求

  爬去所有新浪国内动态新闻的内容、标题、时间、来源、评论数及责任编辑

2、、整理思路

  新浪网新闻是滚动显示,并且有分页,首先需要找到每则新闻链接,然后爬去新闻内容

  其次找到分页链接,爬去每页所有新闻的链接

  最后完成分页操作,爬去所有页面的新闻内容。

3、方式:

  采取函数式编程,大致需要三个函数

4、编辑器

  使用jupyter notebook,可能用到的模块有requests、re、BeautifulSoup、datetime、json

5、遇到的问题

  分页链接不容易寻找,评论链接不容易寻找

6、代码

import requests
import json
from bs4 import BeautifulSoup
import re
from datetime import datetime

def getCommentCount(newsurl):
    commentURL = 'http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-{}&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1'
    m = re.search('doc-i(.+).shtml',newsurl)
    newsid = m.group(1)
    #newsid = newsurl.split('/')[-1].lstrip('doc-i').rstrip('.shtml')
    comments = requests.get(commentURL.format(newsid))
    jd = json.loads(comments.text)
    return jd['result']['count']['total']
    
def getNewsdetail(newsurl):
    result = {}
    res = requests.get(newsurl)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text,'html.parser')
    result['title'] = soup.select('.main-title')[0].text
    result['newssource'] = soup.select('.date-source a')[0].text
    timesource = soup.select('.date-source span')[0].text
    result['dt'] = datetime.strptime(timesource,'%Y年%m月%d日 %H:%M')
    result['article'] = article = '\n'.join(p.text.strip() for p in soup.select('.article p')[:-3])
    result['editor'] = soup.select('.show_author')[0].text.lstrip('责任编辑:').rstrip(' ')
    result['comments'] = getCommentCount(newsurl)
    return result


def ParseListLinks(url):
    newsdetails = []
    res = requests.get(url)
    res.encoding = 'utf-8'
    jd = json.loads(res.text.lstrip("  newsloadercallback(").rstrip(");"))
    for ent in jd['result']['data']:
        #print(ent['url'])
        newsdetails.append(getNewsdetail(ent['url']))
    return newsdetails

url = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}'
news_total = []
for i in range(5):
    if i != 2:
        newsurl = url.format(i)
        newsary = ParseListLinks(newsurl)
        news_total.extend(newsary)

import pandas
df = pandas.DataFrame(news_total)
df.head()