我怎样才能有效地网上刮掉很大程度上未连接的线路呢？

Sorry if that was a vague title. I'm trying to scrape the number of XKCD web-comics on a consistent basis. I saw that http://xkcd.com/ always has their newest comic on the front page along with a line further down the site saying:

对不起,如果这是一个模糊的标题。我试图在一致的基础上削减XKCD网络漫画的数量。我看到http://xkcd.com/总是在首页上有他们最新的漫画,并在网站的下方有一条线说:

Permanent link to this comic: http://xkcd.com/1520/

Where 1520 is the number of the newest comic on display. I want to scrape this number, however, I can't find any good way to do so. Currently all my attempts look really hackish like:

其中1520是展出的最新漫画的数量。我想要刮掉这个数字,但是,我找不到任何好方法。目前我的所有尝试看起来都很像hackish:

soup = BeautifulSoup(urllib.urlopen('http://xkcd.com/').read())
test = soup.find_all('div')[7].get_text().split()[20][-5:-1]

I mean.. That technically works, but if anything on the website gets moved in the slightest it could break horribly. I know there has to be better way to just search for http:xkcd.com/####/ within the a section of the front page and just return #### but I can't seem to find it. The Permanent link to this comic: http://xkcd.com/1520/ line just seems to be kind of floating around without any kinds of tags, class, or ID. Can anyone offer any assistance?

我的意思是......技术上有用,但如果网站上的任何内容被移动到最轻微,它可能会破坏。我知道必须有更好的方法来在首页的一部分中搜索http:xkcd.com / #### /并返回####但我似乎无法找到它。这个漫画的永久链接:http://xkcd.com/1520/ line似乎有点漂浮,没有任何类型的标签,类或ID。有人可以提供任何帮助吗?

1 个解决方案

#1

Usually I insist on using HTML parsers. Here, since we are looking for a specific text in HTML (not checking any tags), it is pretty much okay to apply a regular expression search on:

通常我坚持使用HTML解析器。在这里,由于我们正在寻找HTML中的特定文本(不检查任何标签),因此在以下方面应用正则表达式搜索是非常好的:

Permanent link to this comic: http://xkcd.com/(\d+)/

saving digits in a group.

保存组中的数字。

Demo:

>>> import re
>>> import requests
>>> 
>>> 
>>> data = requests.get("http://xkcd.com/").content
>>> pattern = re.compile(r'Permanent link to this comic: http://xkcd.com/(\d+)/')
>>> print pattern.search(data).group(1)
1520

#1