python 爬虫基础（一）

花了大概6小时，看了python, 然后直接学写python爬虫，当然很基础的，不过说明python爬虫入门并不难。

python爬虫基础知识

python 基础语法（不需要刻意学习语言，浪费时间；直接看教程，去做就行了，编程语言都是相通的）
python 正则表达式（主要用来处理数据）
urllib, urllib2，网络请求
http（http 请求，响应等），html基础知识（用浏览器审查元素，抓取想要的）
python 中文乱码问题（遇到了分析下）

注：基本上网上都查得到，包括实战项目，后期可以自己想项目去实践

糗事百科爬虫

参考：
http://blog.****.net/pleasecallmewhy/article/details/8932310

由于糗事百科网站内容的变化，所以是需要自己写正则获取内容的，即自己审查元素，看要抓取什么内容，如下图

python 爬虫基础（一）

代码（声明：代码是有缺陷的，见代码注释）

# -*- coding:utf-8 -*-

import urllib2
import urllib
import re
import thread
import time

# 糗事百科 可能需要点击认证, 然后才能正常访问得到数据（未写这部分，会有错误）
myUrl = "https://www.qiushibaike.com/hot/page/1"

# myResponse = urllib2.urlopen(myUrl)
# 直接访问行不通，需要伪装成浏览器
headers = {'User-Agent': 'Mozilla/5.0 (Windows; U: Windows NT 6.1; en-US: rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
req = urllib2.Request(
    url = myUrl,
    headers = headers )
myResponse = urllib2.urlopen(req)
myPage = myResponse.read()

unicodePage = myPage.decode("utf-8")

# 找到每一个段子（匿名用户，段子内容为空都需要淘汰掉）
myItems = re.findall('<div.*?article block.*?</div>', unicodePage, re.S)
items = []
for item in myItems:
# 作者
    name = re.findall('.*<div.*class="author.*".*>.*alt=\"(.+?)\".*</div>.*', item, re.S)
# name = re.findall(r".*alt=\"(.+?)\".*",item, re.S)
# 内容
    content = re.findall('.*<div.*?class="content".*?>.*<span>\r*\n*(.*?)\r*\n*</span>.*</div>.*', item, re.S)

if len(name) > 0 and name[0] != u'匿名用户' and len(content) > 0 and content[0] != "":
        contmp = content[0].replace("<br/>", "\r\n")
        items.append([name[0],contmp])

for it in items:
print "==================================="
print u"作者：", it[0]
print u"段子内容："
print it[1]
print "==================================="

运行截图

python 爬虫基础（一）

秒客网

python 爬虫基础（一）

python爬虫基础知识

糗事百科爬虫

相关文章