爬虫之爬取微博

昨天运用爬虫技术爬取了女朋友的发微博的时间，内容，点赞数，评论数，找到了评论的页面，还不清楚为什么会爬取失败

1.https://m.weibo.cn/u/XXXXXid?sudaref=germey.gitbooks.io&retcode=6102 和 https://weibo.com/mayun?refer_flag=1001030101_&is_hot=1这两个地址，前一个可以通过ajax找到我想要的信息，后一个找不到，不知道为什么（用谷歌搜索就可以获得想要的网址，用百度就不行）

2 找到了评论的url，但是requests的时候，出现的失败，应该要调用错误的那个包，看一下requests哪里错了

3忘记了包含yeild的函数生成器目的是可以作为一个迭代对象

4用谷歌进去，搜索百度，和用QQ浏览器进去，搜索百度，不仅网址不一样，而且F12开发者工具的结果也不一样（没错，就是谷歌方便）

贴源代码：

import requests
from urllib.parse import urlencode
from pyquery import PyQuery as pq
base_url = 'https://m.weibo.cn/api/container/getIndex?'  
header ={
            'Host':'m.weibo.cn',
            'Referer':'https://m.weibo.cn/u/3705846522?sudaref=germey.gitbooks.io&retcode=6102',
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.2604.400 QQBrowser/9.6.10875.400',
            'X-Requested-With':'XMLHttpRequest',##header requests中都有这些信息
}
def get_page(page,value):
    parmas = {
        'sudaref': 'germey.gitbooks.io',
        'retcode': '6102',
        'type': 'uid',
        'value': value,
        'containerid': '1076033705846522',
        'page': page,
    }
    url  = base_url+urlencode(parmas)
    try:
        r = requests.get(url,headers = header)
        if r.status_code == 200:
            return r.json()
    except:
        print('fail')
def parse_json(json):
    if json:
        weibo = {}
        url_list = []
        items = json.get('cards')##json可以直接get获得，但是是一层层的get
        for item in items :
            url_list.append(item.get('scheme'))
            item  = item.get('mblog')
            weibo['time'] = item.get('created_at')
            weibo['comments'] = item.get('comments_count')
            weibo['zan'] = item.get('attitudes_count')
            weibo['text'] = pq(item.get('text')).text()
            yield weibo##当时这里我也写了yeild weibo,url_list，然后在主函数中print，由于碰到yield函数就返回一次，所以输出的url_list第一次只有1个，第二次只有2个这个杨子


'''def get_commet():  try:  r = requests.get("https://m.weibo.cn/status/F3HR6wMS8?mblogid=F3HR6wMS8&luicode=10000011&lfid=1076033705846522", headers=header)  if r.status_code == 200:  return r.json()  except:  print('fail')'''##失败了  if __name__ =='__main__':
    print('{:10}{:5}{:5}{:}'.format('时间','评论数','赞数','内容'))
    for num in range(1,5):
        json = get_page(num)
        for weibo in parse_json(json):
            print('{:10}{:5}{:^10}{:}'.format(weibo['time'],weibo['comments'],weibo['zan'],weibo['text']))

秒客网

爬虫之爬取微博

相关文章