网络爬虫了解下

今天就简单学习了下网络爬虫基础知识。网络爬虫其实就是对网站的分析，然后抽取自己需要的信息，这就是一个熟练的过程。

目的：兴趣，以后可能会用到，就暂时先了解下。明天该回到正题了：计算机视觉，。毕竟要开题了，不能瞎浪了，忧伤！

今天：爬取了前程无忧上面关于机器学习的部分职位信息

code：

from bs4 import BeautifulSoup
from urllib.request import urlopen

def seek_job(URL):
    #URL = "https://search.51job.com/list/090200,000000,0000,00,9,99,%25E6%259C%25BA%25E5%2599%25A8%25E5%25AD%25A6%25E4%25B9%25A0,2,1.html?"
    html = urlopen(URL).read().decode('gbk')
    soup = BeautifulSoup(html, 'html5lib')
    d = soup.find("div", {"class": "dw_table", "id": "resultList"})
    div = d.find_all('div', {"class": 'el'})
    print("本页招聘公司个数：", len(div))

    for l1 in range(1, len(div)):  # 第一个有问题，所有写出来
        #print(l1)
        l = div[l1]
        p = l.find("p", {"class": 't1'})
        position = p.find("span").find("a")['title']
        company2 = l.find("span", {"class": 't2'})
        company = company2.find("a")["title"]
        place = l.find("span", {"class": 't3'}).text
        salary = l.find("span", {"class": 't4'}).text
        print('%-3s' %"职位：",'%-20s' % position,'%-3s' % "公司：",'%-20s'%company,'%-15s' % "年薪：", salary, '%-12s' %"地点：", place)


# 一共爬取7页内容
for i in range(1,8):
    URL = "https://search.51job.com/list/090200,000000,0000,00,9,99,%25E6%259C%25BA%25E5%2599%25A8%25E5%25AD%25A6%25E4%25B9%25A0,2,"+str(i)+".html?"
    #print(URL)
    seek_job(URL)

秒客网

网络爬虫了解下

相关文章