网络爬虫了解下

时间:2021-12-08 05:04:42

今天就简单学习了下网络爬虫基础知识。网络爬虫其实就是对网站的分析,然后抽取自己需要的信息,这就是一个熟练的过程。

目的:兴趣,以后可能会用到,就暂时先了解下。明天该回到正题了:计算机视觉,。毕竟要开题了,不能瞎浪了,忧伤!

今天:爬取了前程无忧上面关于机器学习的部分职位信息

code:

from bs4 import BeautifulSoup
from urllib.request import urlopen

def seek_job(URL):
#URL = "https://search.51job.com/list/090200,000000,0000,00,9,99,%25E6%259C%25BA%25E5%2599%25A8%25E5%25AD%25A6%25E4%25B9%25A0,2,1.html?"
html = urlopen(URL).read().decode('gbk')
soup = BeautifulSoup(html, 'html5lib')
d = soup.find("div", {"class": "dw_table", "id": "resultList"})
div = d.find_all('div', {"class": 'el'})
print("本页招聘公司个数:", len(div))

for l1 in range(1, len(div)): # 第一个有问题,所有写出来
#print(l1)
l = div[l1]
p = l.find("p", {"class": 't1'})
position = p.find("span").find("a")['title']
company2 = l.find("span", {"class": 't2'})
company = company2.find("a")["title"]
place = l.find("span", {"class": 't3'}).text
salary = l.find("span", {"class": 't4'}).text
print('%-3s' %"职位:",'%-20s' % position,'%-3s' % "公司:",'%-20s'%company,'%-15s' % "年薪:", salary, '%-12s' %"地点:", place)


# 一共爬取7页内容
for i in range(1,8):
URL = "https://search.51job.com/list/090200,000000,0000,00,9,99,%25E6%259C%25BA%25E5%2599%25A8%25E5%25AD%25A6%25E4%25B9%25A0,2,"+str(i)+".html?"
#print(URL)
seek_job(URL)