如何在Python中加载网站的所有资源,包括AJAX请求等?

时间:2022-10-05 18:35:07

I know how to request a web site and read its text with Python. In the past, I've tried using a library like BeautifulSoup to make all of the requests to links on a site, but that doesn't get things that don't look like full urls, such as AJAX requests and most requests to the original domain (since the "http://example.com" will be missing, and more importantly, isn't in an <a href='url'>Link</a>format, so BeautifulSoup will miss that).

我知道如何申请网站并使用Python阅读其文本。在过去,我尝试使用像BeautifulSoup这样的库来对网站上的链接发出所有请求,但这并不会产生看起来像完整网址的内容,例如AJAX请求和大多数请求。原始域名(因为“http://example.com”将丢失,更重要的是,不是链接格式,所以BeautifulSoup将错过)。

How can I load all of a site's resources in Python? Will it require interacting with something like Selenium, or is there a way that's not too difficult to implement without that? I haven't used Selenium much, so I'm not sure how difficult that will be.

如何在Python中加载网站的所有资源?它是否需要与像Selenium这样的东西进行交互,或者有没有一种方法在没有它的情况下实现起来并不太难?我没有太多使用Selenium,所以我不确定它会有多难。

Thanks

3 个解决方案

#1


2  

It all depends on what you want and how you want it. The closest that may work for you is

这一切都取决于你想要什么以及你想要它。最适合你的是

from ghost import Ghost
ghost = Ghost()
page, extra_resources = ghost.open("http://jeanphi.fr")
assert page.http_status==200 and 'jeanphix' in ghost.content

You can know more on: http://jeanphix.me/Ghost.py/

您可以了解更多信息:http://jeanphix.me/Ghost.py/

#2


0  

Mmm that's a pretty interesting question. For those resources whose URLs are not fully identifiable due to them being generated at runtime or something like that (such as those used in scripts, not only AJAX) you'd need to actually run the website, so scripts get executed and dynamic URLs get created.

嗯,这是一个非常有趣的问题。对于那些由于在运行时生成URL而无法完全识别的资源(例如脚本中使用的那些资源,不仅仅是AJAX),您需要实际运行网站,以便执行脚本并获取动态URL创建。

One option is using something like what this answer describes, which is using a third party library, like Qt, to actually run the website. To collect all URLs, you need some way of monitoring all requests made by the website, that could be done like this (although it's c++, but the code's essentially the same).

一种选择是使用类似于此答案描述的内容,即使用第三方库(如Qt)来实际运行网站。要收集所有URL,您需要一些方法来监控网站发出的所有请求,这可以这样做(尽管它是c ++,但代码基本相同)。

Finally once you have the URL's, you can use something like Requests to download the external resources.

最后,一旦有了URL,就可以使用Requests之类的东西来下载外部资源。

#3


0  

I would love to hear other ways of doing this, especially if they're more concise (easier to remember), but I think this accomplishes my goal. It does not fully answer my original question though--this just gets more of the stuff than using requests.get(url)--which was enough for me in this case`:

我很想听到其他方法,特别是如果它们更简洁(更容易记住),但我认为这实现了我的目标。它并没有完全回答我原来的问题 - 这只是比使用requests.get(url)更多的东西 - 在这种情况下这对我来说足够了:

import urllib2
url = 'http://example.com'
headers = {'User-Agent' : 'Mozilla/5.0'}
request = urllib2.Request(url,None,headers)
sock = urllib2.urlopen(request)
ch = sock.read()
sock.close()

#1


2  

It all depends on what you want and how you want it. The closest that may work for you is

这一切都取决于你想要什么以及你想要它。最适合你的是

from ghost import Ghost
ghost = Ghost()
page, extra_resources = ghost.open("http://jeanphi.fr")
assert page.http_status==200 and 'jeanphix' in ghost.content

You can know more on: http://jeanphix.me/Ghost.py/

您可以了解更多信息:http://jeanphix.me/Ghost.py/

#2


0  

Mmm that's a pretty interesting question. For those resources whose URLs are not fully identifiable due to them being generated at runtime or something like that (such as those used in scripts, not only AJAX) you'd need to actually run the website, so scripts get executed and dynamic URLs get created.

嗯,这是一个非常有趣的问题。对于那些由于在运行时生成URL而无法完全识别的资源(例如脚本中使用的那些资源,不仅仅是AJAX),您需要实际运行网站,以便执行脚本并获取动态URL创建。

One option is using something like what this answer describes, which is using a third party library, like Qt, to actually run the website. To collect all URLs, you need some way of monitoring all requests made by the website, that could be done like this (although it's c++, but the code's essentially the same).

一种选择是使用类似于此答案描述的内容,即使用第三方库(如Qt)来实际运行网站。要收集所有URL,您需要一些方法来监控网站发出的所有请求,这可以这样做(尽管它是c ++,但代码基本相同)。

Finally once you have the URL's, you can use something like Requests to download the external resources.

最后,一旦有了URL,就可以使用Requests之类的东西来下载外部资源。

#3


0  

I would love to hear other ways of doing this, especially if they're more concise (easier to remember), but I think this accomplishes my goal. It does not fully answer my original question though--this just gets more of the stuff than using requests.get(url)--which was enough for me in this case`:

我很想听到其他方法,特别是如果它们更简洁(更容易记住),但我认为这实现了我的目标。它并没有完全回答我原来的问题 - 这只是比使用requests.get(url)更多的东西 - 在这种情况下这对我来说足够了:

import urllib2
url = 'http://example.com'
headers = {'User-Agent' : 'Mozilla/5.0'}
request = urllib2.Request(url,None,headers)
sock = urllib2.urlopen(request)
ch = sock.read()
sock.close()