Python入门爬虫1 腾讯招聘网站岗位爬取

作为一个热爱学习的*接班人，接下来一段时间我将持续更新python爬虫这一块的内容
在博客里将会持续并认真的记录我的学习过程
首先介绍一下我的学习环境: win10+Anaconda+Pycharm，默认会一些python的基础知识
希望我的博客能够给你带来帮助 - ̗̀(๑ᵔ⌔ᵔ๑)
下面进入正题：

爬取网站链接：https://hr.tencent.com/social.php

过程主要分为三部分：
1.获取整体页面数据
2.抽取想要的数据
3.数据存储

在开始之前，你要学会安装第三方库，pycharm的同学可以在终端(Terminal)里使用pip install + 库名进行安装

获取页面整体数据
1.初始化函数，使用请求头进行访问
大括号内输入你自己浏览器的请求头

获取请求头的方法：
打开一个网页，按F12（或者Fn+F12）,刷新一下，选择network，在左侧随便选择一个(一般里面都可以找到，没有的话你就多换两个试试),右侧可以找到user-agent，这个就是你电脑浏览器的请求头啦~~

Python入门爬虫1 腾讯招聘网站岗位爬取
如下是我定义的请求头

def __init__(self):
	self.headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36" }

2.选择一个你喜欢的岗位和地区，然后翻页观察一下网址有什么变化吧~
第一页： https://hr.tencent.com/position.php?keywords=&tid=87&lid=2218&start=0#a
第二页： https://hr.tencent.com/position.php?keywords=&tid=87&lid=2218&start=10#a
第十页：https://hr.tencent.com/position.php?keywords=&tid=87&lid=2218&start=90#a
可以看出来最后两位从０开始以10为步长进行增长
因此我们可以使用for循环来遍历全部页面的内容
我选取了其中的100页来进行爬取
使用requests库的get方法，返回了访问的请求头，并且将我们自己的请求头当作参数赋值给get函数
是一个初级的反爬，让网页认为你是使用浏览器来进行访问页面的！否则可能会爬不到

import requests

def __init__(self):
	self.headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36" }

def start_Request(self):
	for i in range(0,1000,10):
		#获取页面整体数据
   		url = "https://hr.tencent.com/position.php?&start="+str(i)
   		print("正在抓取网址：",url)
        response = requests.get(url,headers = self.headers)

3.使用etree来解析数据，
需要安装lxml库
requests.content 是获取了响应头的内容，输出的结果前面有一个b的字符，表示是字节字符串标志
为了将结果转换成utf-8的形式需要使用decode()
封装成一个类

import requests
from lxml import etree

class Spider(object):
    def __init__(self):
        self.headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36" }

    def start_Request(self):
        for i in range(0,1000,10):
            # 1. 获取整体页面数据  
            url = "https://hr.tencent.com/position.php?&start="+str(i)
            print("正在抓取网址：",url)
            response = requests.get(url,headers = self.headers)

            # 2.抽取想要的数据 lxml
            html = etree.HTML(response.content.decode())

4.从结果种提取想要的数据
我们的任务是提取出招聘名称和招聘的链接
复制职位名称，然后右击查看源代码，使用ctr+f 在其中搜索岗位名称
Python入门爬虫1 腾讯招聘网站岗位爬取

发现岗位被包含在<td class="1 square’>的代码块中
使用xpath对html页面进行解析
// 从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
/表示根节点

tit_list = html.xpath('//td[@class="l square"]/a/text()')

表示在td下的a标签内提取文本内容
Python入门爬虫1 腾讯招聘网站岗位爬取
其中href里是该岗位的招聘链接，同理可得

src_list = html.xpath('//td[@class="l square"]/a/@href')

5.使用zip函数将招聘岗位和招聘链接打包成一个元组，方便整合数据
使用json和with open进行数据存储（需要导入json库）
注意 windows用with open（）里面需要添加 encoding = ‘utf-8’ MAC默认是utf-8不需要添加

    def xpath_data(self,html):
        tit_list = html.xpath('//td[@class="l square"]/a/text()')
        src_list = html.xpath('//td[@class="l square"]/a/@href')
        for tit,src in zip(tit_list,src_list):
			数据存储 
            content = json.dumps({"招聘名称":tit,"招聘链接":"http:"+src},ensure_ascii=False)+ ",\n"
            print(content)
            with open("tencent2.json",'a',encoding='utf-8')as f:
                f.write(content)

注意：{}里面的内容是字典，招聘名称后需要使用：，导出的数据有很多，所以每一个招聘名称和链接后需要使用‘,\n’来进行分隔，是必须的！
json.dumps 序列化时对中文默认使用的ascii编码.想输出真正的中文需要指定ensure_ascii=False

6.将函数调用连接起来，封装成类，完整代码如下

import requests
from lxml import etree
import json

class Spider(object):
    def __init__(self):
        self.headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36" }

    def start_Request(self):
        for i in range(0,1000,10):
            # 1. 获取整体页面数据  
            url = "https://hr.tencent.com/position.php?&start="+str(i)
            print("正在抓取网址：",url)
            response = requests.get(url,headers = self.headers)

            # 2.抽取想要的数据 lxml
            html = etree.HTML(response.content.decode())
            self.xpath_data(html)

    def xpath_data(self,html):
        tit_list = html.xpath('//td[@class="l square"]/a/text()')
        src_list = html.xpath('//td[@class="l square"]/a/@href')
        for tit,src in zip(tit_list,src_list):

            # 3.数据存储 json with open
            content = json.dumps({"招聘名称":tit,"招聘链接":"http:"+src},ensure_ascii=False)+ ",\n"
            print(content)
            with open("tencent2.json",'a',encoding='utf-8')as f:
                f.write(content)

Spider().start_Request()

运行后将数据保存在with open 命名的文件夹下
我的内容是：
Python入门爬虫1 腾讯招聘网站岗位爬取

这样就成功了！
我的第一次爬虫分享就到这里结束了=.= 希望大家能够学会~
多学多练
认真学习的人运气都不会差！

秒客网

Python入门爬虫1 腾讯招聘网站岗位爬取

爬取网站链接：https://hr.tencent.com/social.php

相关文章