美丽的汤解析网址以获取另一个网址数据

时间:2023-01-06 17:03:24

I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.

我需要解析一个url来获取链接到详细信息页面的URL列表。然后,从该页面我需要从该页面获取所有详细信息。我需要这样做,因为详细页面网址不会定期增加和更改,但事件列表页面保持不变。

Basically:

基本上:

example.com/events/
    <a href="http://example.com/events/1">Event 1</a>
    <a href="http://example.com/events/2">Event 2</a>

example.com/events/1
    ...some detail stuff I need

example.com/events/2
    ...some detail stuff I need

3 个解决方案

#1


57  

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
    print anchor['href']

It will give you the list of urls. Now You can iterate over those urls and parse the data.

它会为您提供网址列表。现在,您可以遍历这些URL并解析数据。

  • inner_div = soup.findAll("div", {"id": "y-shade"}) This is an example. You can go through the BeautifulSoup tutorials.
  • inner_div = soup.findAll(“div”,{“id”:“y-shade”})这是一个例子。您可以浏览BeautifulSoup教程。

#2


4  

For the next group of people that come across this, BeautifulSoup has been upgraded to v4 as of this post as v3 is no longer being updated..

对于遇到这种情况的下一群人,由于v3不再更新,BeautifulSoup已经升级到v4。

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

To use in Python...

在Python中使用...

import bs4 as BeautifulSoup

#3


3  

Use urllib2 to get the page, then use beautiful soup to get the list of links, also try scraperwiki.com

使用urllib2获取页面,然后使用漂亮的汤来获取链接列表,也可以尝试使用scraperwiki.com

Edit:

编辑:

Recent discovery: Using BeautifulSoup through lxml with

最近的发现:通过lxml使用BeautifulSoup

from lxml.html.soupparser import fromstring

is miles better than just BeautifulSoup. It lets you do dom.cssselect('your selector') which is a life saver. Just make sure you have a good version of BeautifulSoup installed. 3.2.1 works a treat.

比BeautifulSoup还要好几英里。它可以让你做dom.cssselect('你的选择器'),这是一个救生员。只需确保安装了GoodSoup的好版本。 3.2.1是一种享受。

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]

#1


57  

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
    print anchor['href']

It will give you the list of urls. Now You can iterate over those urls and parse the data.

它会为您提供网址列表。现在,您可以遍历这些URL并解析数据。

  • inner_div = soup.findAll("div", {"id": "y-shade"}) This is an example. You can go through the BeautifulSoup tutorials.
  • inner_div = soup.findAll(“div”,{“id”:“y-shade”})这是一个例子。您可以浏览BeautifulSoup教程。

#2


4  

For the next group of people that come across this, BeautifulSoup has been upgraded to v4 as of this post as v3 is no longer being updated..

对于遇到这种情况的下一群人,由于v3不再更新,BeautifulSoup已经升级到v4。

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

To use in Python...

在Python中使用...

import bs4 as BeautifulSoup

#3


3  

Use urllib2 to get the page, then use beautiful soup to get the list of links, also try scraperwiki.com

使用urllib2获取页面,然后使用漂亮的汤来获取链接列表,也可以尝试使用scraperwiki.com

Edit:

编辑:

Recent discovery: Using BeautifulSoup through lxml with

最近的发现:通过lxml使用BeautifulSoup

from lxml.html.soupparser import fromstring

is miles better than just BeautifulSoup. It lets you do dom.cssselect('your selector') which is a life saver. Just make sure you have a good version of BeautifulSoup installed. 3.2.1 works a treat.

比BeautifulSoup还要好几英里。它可以让你做dom.cssselect('你的选择器'),这是一个救生员。只需确保安装了GoodSoup的好版本。 3.2.1是一种享受。

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]