如何在Ruby脚本中备份整个web页面(包括图像等)?

时间:2022-06-01 20:58:55

If I have a URL of a webpage, how can I download it to locally, including all the images, stylesheets, etc? Would I have to manually parse the HTML and figure out all the external resources? Or is there a cleaner way?

如果我有一个网页的URL,我如何将它下载到本地,包括所有的图片、样式表等等?我是否必须手动解析HTML并找出所有外部资源?或者有更干净的方法吗?

Thanks!

谢谢!

3 个解决方案

#1


5  

This is one of those times I'd look elsewhere. Not that it can't be done in Ruby, but there are other existing tools made for this that do it very well. Why reinvent a wheel?

这是我去别的地方看看的时候。并不是说它不能在Ruby中完成,而是有其他的工具可以很好地完成它。为什么重新发明*?

Look at wget. It is a standard tool for retrieving web resources, including mirroring sites and is available on all platforms. From the docs:

看看wget。它是检索web资源(包括镜像站点)的标准工具,可以在所有平台上使用。从文档:

Retrieve only one html page, but make sure that all the elements needed for the page to be displayed, such as inline images and external style sheets, are also downloaded. Also make sure the downloaded page references the downloaded links.

只检索一个html页面,但请确保显示页面所需的所有元素(如内联图像和外部样式表)也被下载。还要确保下载的页面引用下载的链接。

wget -p --convert-links http://www.server.com/dir/page.html

The html page will be saved to www.server.com/dir/page.html, and the images, stylesheets, etc., somewhere under www.server.com/, depending on where they were on the remote server.

html页面将被保存到www.server.com/dir/page.html以及图片、样式表等等,在www.server.com/下的某个位置,这取决于它们在远程服务器上的位置。

You could easily call wget from within a Ruby script using backticks or %x:

您可以使用回签或%x从Ruby脚本中轻松调用wget:

`/path/to/wget -p --convert-links http://www.server.com/dir/page.html`

or

%x{/path/to/wget -p --convert-links http://www.server.com/dir/page.html}

There are a lot of other mechanisms to do the same thing in Ruby, which give you more control.

在Ruby中有许多其他的机制可以做同样的事情,这给了您更多的控制。

#2


3  

You can do this fairly easily (albeit not as easily as just learning to use 'wget') with Net::HTTP and Nokogiri:

通过Net: HTTP和Nokogiri,您可以相当容易地做到这一点(尽管不像学习使用“wget”那么容易):

require 'nokogiri'
require 'net/http'
require 'pathname'

# Set to the host and the path of the HTML file
host = 'rubygems.org'
path = '/'

# Fetch the page and parse it
source = Net::HTTP.get( host, path )
page   = Nokogiri::HTML( source )
dir    = Pathname( path ).dirname

# Download images
page.xpath( '//img[@src]' ).each do |imgtag|
    localpath = Pathname( imgtag[:src] ).relative_path_from( dir )
    localpath.mkpath
    localpath.open( 'w' ) do |fh|
        Net::HTTP.get_print( host, imgtag[:src], fh )
    end
end

# Download stylesheets
page.xpath( '//link[@rel="stylesheet"]' ).each do |linktag|
    localpath = Pathname( linktag[:href] ).relative_path_from( dir )
    localpath.mkpath
    localpath.open( 'w' ) do |fh|
        Net::HTTP.get_print( host, linktag[:href], fh )
    end
end

You'd obviously need better error-checking, and the resource-fetching code needs to be pulled up into a method, but if you really want to do this from Ruby, it's certainly possible.

显然,您需要更好的错误检查,并且需要将资源获取代码拉到一个方法中,但是如果您真的想要从Ruby中进行检查,那么这当然是可能的。

#3


-2  

Well if you're just doing a few instances, I don't think you'll need a script. You can simply just save the web page using any web browser and it'll download the necessary images and style sheets etc. Or in chrome, you can browse all the resources used in a single webpage.

如果你只是做一些实例,我认为你不需要脚本。你可以简单地使用任何web浏览器保存web页面,它会下载必要的图像和样式表等。或者在chrome中,你可以浏览单个页面中使用的所有资源。

#1


5  

This is one of those times I'd look elsewhere. Not that it can't be done in Ruby, but there are other existing tools made for this that do it very well. Why reinvent a wheel?

这是我去别的地方看看的时候。并不是说它不能在Ruby中完成,而是有其他的工具可以很好地完成它。为什么重新发明*?

Look at wget. It is a standard tool for retrieving web resources, including mirroring sites and is available on all platforms. From the docs:

看看wget。它是检索web资源(包括镜像站点)的标准工具,可以在所有平台上使用。从文档:

Retrieve only one html page, but make sure that all the elements needed for the page to be displayed, such as inline images and external style sheets, are also downloaded. Also make sure the downloaded page references the downloaded links.

只检索一个html页面,但请确保显示页面所需的所有元素(如内联图像和外部样式表)也被下载。还要确保下载的页面引用下载的链接。

wget -p --convert-links http://www.server.com/dir/page.html

The html page will be saved to www.server.com/dir/page.html, and the images, stylesheets, etc., somewhere under www.server.com/, depending on where they were on the remote server.

html页面将被保存到www.server.com/dir/page.html以及图片、样式表等等,在www.server.com/下的某个位置,这取决于它们在远程服务器上的位置。

You could easily call wget from within a Ruby script using backticks or %x:

您可以使用回签或%x从Ruby脚本中轻松调用wget:

`/path/to/wget -p --convert-links http://www.server.com/dir/page.html`

or

%x{/path/to/wget -p --convert-links http://www.server.com/dir/page.html}

There are a lot of other mechanisms to do the same thing in Ruby, which give you more control.

在Ruby中有许多其他的机制可以做同样的事情,这给了您更多的控制。

#2


3  

You can do this fairly easily (albeit not as easily as just learning to use 'wget') with Net::HTTP and Nokogiri:

通过Net: HTTP和Nokogiri,您可以相当容易地做到这一点(尽管不像学习使用“wget”那么容易):

require 'nokogiri'
require 'net/http'
require 'pathname'

# Set to the host and the path of the HTML file
host = 'rubygems.org'
path = '/'

# Fetch the page and parse it
source = Net::HTTP.get( host, path )
page   = Nokogiri::HTML( source )
dir    = Pathname( path ).dirname

# Download images
page.xpath( '//img[@src]' ).each do |imgtag|
    localpath = Pathname( imgtag[:src] ).relative_path_from( dir )
    localpath.mkpath
    localpath.open( 'w' ) do |fh|
        Net::HTTP.get_print( host, imgtag[:src], fh )
    end
end

# Download stylesheets
page.xpath( '//link[@rel="stylesheet"]' ).each do |linktag|
    localpath = Pathname( linktag[:href] ).relative_path_from( dir )
    localpath.mkpath
    localpath.open( 'w' ) do |fh|
        Net::HTTP.get_print( host, linktag[:href], fh )
    end
end

You'd obviously need better error-checking, and the resource-fetching code needs to be pulled up into a method, but if you really want to do this from Ruby, it's certainly possible.

显然,您需要更好的错误检查,并且需要将资源获取代码拉到一个方法中,但是如果您真的想要从Ruby中进行检查,那么这当然是可能的。

#3


-2  

Well if you're just doing a few instances, I don't think you'll need a script. You can simply just save the web page using any web browser and it'll download the necessary images and style sheets etc. Or in chrome, you can browse all the resources used in a single webpage.

如果你只是做一些实例,我认为你不需要脚本。你可以简单地使用任何web浏览器保存web页面,它会下载必要的图像和样式表等。或者在chrome中,你可以浏览单个页面中使用的所有资源。