Python开发笔记：网络数据抓取

网络数据获取（爬取）分为两部分：

1、抓取（抓取网页）

· urlib内建模块，特别是urlib.request

· Requests第三方库（中小型网络爬虫的开发）

· Scrapy框架（大型网络爬虫的开发）

2、解析（解析网页内容）

· BeautifulSoup库

· re模块（正则表达式）

或者第三方API抓取和解析。

Requests库（http://www.python-requests.org/en/master/）

基本方法：

requests.get()：请求获取指定URL位置的资源，对应HTTP协议中的GET方法。

import requests  

r=requests.get('https://book.douban.com/subject/1084336/')  

r.status_code

Out[3]: 200  

r.text

import requests  

re=requests.get('http://finance.sina.com.cn/realstock/company/sh000001/nc.shtml')  

print(re.text)

　　BeautifulSoup库（https://www.crummy.com/software/BeautifulSoup/bs4/doc/）

from bs4 import BeautifulSoup  

markup = '<p class="title"><b>The Little Prince</b></p>'  

soup = BeautifulSoup(markup, "lxml")  

soup.b

Out[5]: <b>The Little Prince</b>  

type(soup.b)

Out[6]: bs4.element.Tag  

tag=soup.p  

tag.name

Out[8]: 'p'  

tag.attrs

Out[9]: {'class': ['title']}  

tag['class']

Out[10]: ['title']  

tag.string

Out[11]: 'The Little Prince'  

type(tag.string)

Out[12]: bs4.element.NavigableString  

soup.find_all('b')

Out[13]: [<b>The Little Prince</b>]

import requests

from bs4 import BeautifulSoup

r=requests.get('https://book.douban.com/subject/1084336/')

soup=BeautifulSoup(r.text,'lxml')

pattern=soup.find_all('p','comment-content')

for item in pattern:

    print(item.string)

秒客网

Python开发笔记：网络数据抓取

相关文章