【Python学习笔记三】一个简单的python爬虫

这里写爬虫用的requests插件

1.一般那3.x版本的python安装后都带有相应的安装文件，目录在python安装目录的Scripts中,如下:

2.将scripts的目录配置到环境变量path中，例如我这边就是：C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\Scripts

然后执行： pip install requests

安装完后，在python执行器下，执行import requests正常就OK

一个简单的爬虫练习

网上流传比较广的一个初学例子：爬取图片并存放到指定目录；这里面加Header时用了不同的方法

import re

import urllib

import urllib.request

def getHtml(url):

    # 设置headers模拟客户端，防反爬虫

    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}

    req = urllib.request.Request(url=url, headers=headers)

    # 请求指定URL，获取内容

    page = urllib.request.urlopen(req)

    html = page.read()

    return html.decode('UTF-8', 'ignore')

def getImg(html):

    # 设置正则，并过滤html中的内容存放到imglist

    reg = r'src="(.+?\.jpg)"'

    imgre = re.compile(reg)

    imglist = imgre.findall(html)

    x = 0

    for imgurl in imglist:

        # 设置headers模拟客户端，防反爬虫

        opener = urllib.request.build_opener()

        opener.addheaders=[('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]

        urllib.request.install_opener(opener)

        # 存储图片到指定目录，修改图片名方便查看

        urllib.request.urlretrieve(imgurl, 'picture\pic%s.jpg' % x)

        x = x + 1

url = "xxx"#自定url

html = getHtml(url)

getImg(html)

print("END")

秒客网

【Python学习笔记三】一个简单的python爬虫

相关文章