如何在Ruby上获取包含所有对象的HTML页面

时间:2022-12-08 11:20:09

I need to fetch HTML page with all objects on it (stylesheets, javascripts, images) and store data in the database. It is possible to implement this by simple fetching files listed in src attributes, but maybe someone can suggest any helper gem for this.

我需要获取包含所有对象的HTML页面(样式表,javascripts,图像)并将数据存储在数据库中。可以通过简单地获取src属性中列出的文件来实现这一点,但也许有人可以建议任何帮助宝石。

Also, is there way to package all this files to one (like web archieve), which can be opened by most browsers?

是否有办法将所有这些文件打包成一个(如web archieve),大多数浏览器都可以打开它们?

Thanks

2 个解决方案

#1


You could use mechanize to do this job:

您可以使用mechanize来完成这项工作:

require "rubygems"
require "mechanize"

url = "http://*.com/"
agent = WWW::Mechanize.new
page = agent.get(url)


page.search('img[@src]').each do |image|
  src = image["src"]
  image_file = agent.get(src) if src
  # Store image_file data it in database ...  
end

page.search('link[rel="stylesheet"]').each do |css|
  src = css["src"]
  css_file = agent.get(src) if src
  # Store css_file data it in database ...  
end

page.search('script[type="text/javascript"]').each do |script|
  src = script["src"]
  script_file = agent.get(src) if src
  # Store script_file data it in database ...    
end

You still have to handle exceptions and fix resources with relative src attributes. But this should do the job. This solution will however not fetch images that are referenced in the stylesheets.

您仍然必须处理异常并使用相对src属性修复资源。但这应该做的工作。但是,此解决方案不会获取样式表中引用的图像。

#2


Check out Mechanize

查看Mechanize

#1


You could use mechanize to do this job:

您可以使用mechanize来完成这项工作:

require "rubygems"
require "mechanize"

url = "http://*.com/"
agent = WWW::Mechanize.new
page = agent.get(url)


page.search('img[@src]').each do |image|
  src = image["src"]
  image_file = agent.get(src) if src
  # Store image_file data it in database ...  
end

page.search('link[rel="stylesheet"]').each do |css|
  src = css["src"]
  css_file = agent.get(src) if src
  # Store css_file data it in database ...  
end

page.search('script[type="text/javascript"]').each do |script|
  src = script["src"]
  script_file = agent.get(src) if src
  # Store script_file data it in database ...    
end

You still have to handle exceptions and fix resources with relative src attributes. But this should do the job. This solution will however not fetch images that are referenced in the stylesheets.

您仍然必须处理异常并使用相对src属性修复资源。但这应该做的工作。但是,此解决方案不会获取样式表中引用的图像。

#2


Check out Mechanize

查看Mechanize