检查Ruby中是否存在URL。

时间:2023-01-06 10:28:57

How would I go about checking if a URL exists using Ruby?

如何使用Ruby检查URL是否存在?

For example, for the URL

例如,URL

https://google.com

the result should be truthy, but for the URLs

结果应该是真实的,但是对于url

https://no.such.domain

or

https://*.com/no/such/path

the result should be falsey

结果应该是福赛

5 个解决方案

#1


58  

Use the Net::HTTP library.

使用Net::HTTP库。

require "net/http"
url = URI.parse("http://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)

At this point res is a Net::HTTPResponse object containing the result of the request. You can then check the response code:

此时,res是一个Net::HTTPResponse对象,包含请求的结果。然后您可以检查响应代码:

do_something_with_it(url) if res.code == "200"

Note: To check for https based url, use_ssl attribute should be true as:

注意:要检查基于https的url, use_ssl属性应该为true:

require "net/http"
url = URI.parse("https://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.request_head(url.path)

#2


45  

Sorry for the late reply on this, but I think this deservers a better answer.

很抱歉这么晚才回复你,但我认为这是一个更好的答案。

There are three ways to look at this question:

有三种方式来看待这个问题:

  1. Strict check if the URL exist
  2. 严格检查URL是否存在
  3. Check if you are requesting the URL correclty
  4. 检查您是否正在请求URL correclty
  5. Check if you can request it correctly and the server can answer it correctly
  6. 检查您是否可以正确地请求它,并且服务器可以正确地回答它

1. Strict check if the URL exist

While 200 means that the server answers to that URL (thus, the URL exists), answering other status code doesn't means that the URL does not exist. For example, answering 302 - redirected means that the URL exists and is redirecting to another one. While browsing, 302 many times behaves the same than 200 to the final user. Other status code that can be returned if a URL exists is 500 - internal server error. After all, if the URL does not exists, how it comes the application server processed your request instead return simply 404 - not found?

虽然200表示服务器对该URL的响应(因此,URL存在),但是响应其他状态代码并不意味着URL不存在。例如,应答302 -重定向意味着URL存在并正在重定向到另一个URL。在浏览时,302多次对最终用户的行为与200次相同。如果URL存在,可以返回的其他状态码是500 -内部服务器错误。毕竟,如果URL不存在,应用服务器如何处理您的请求,而只返回404 -未找到?

So there is actually only one two cases when a URL does not exists: When the server does not exists or when the server exists but can't find the given URL path does not exists. Thus, the only way to check if the URL exists is checking if the server answers and the return code is not 404. The following code does just that.

因此,当URL不存在时,实际上只有两种情况:当服务器不存在时,或者当服务器存在但找不到给定的URL路径时。因此,检查URL是否存在的唯一方法是检查服务器是否应答,并且返回代码不是404。下面的代码就是这么做的。

require "net/http"
def url_exist?(url_string)
  url = URI.parse(url_string)
  req = Net::HTTP.new(url.host, url.port)
  req.use_ssl = (url.scheme == 'https')
  path = url.path if url.path.present?
  res = req.request_head(path || '/')
  res.code != "404" # false if returns 404 - not found
rescue Errno::ENOENT
  false # false if can't find the server
end

2. Check if you are requesting the URL correclty

However, most of the times we are not interested in see if a URL exists, but if we can access it. Fortunately looking to the HTTP status codes families, that is the 4xx family, which states for client error (thus, an error in your side, which means you are not requesting the page correctly, don't have permission or whatsoever). This is a good of errors to check if you can access this page. From wiki:

但是,大多数时候我们不关心URL是否存在,但是如果我们能够访问它。幸运的是,查看HTTP状态码的家庭,这是4xx家庭,这是客户错误的状态(因此,在您的一方出现错误,这意味着您没有请求正确的页面,没有权限或其他任何东西)。这是一个很好的错误,检查您是否可以访问这个页面。从维基:

The 4xx class of status code is intended for cases in which the client seems to have erred. Except when responding to a HEAD request, the server should include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition. These status codes are applicable to any request method. User agents should display any included entity to the user.

4xx类状态码是针对客户端似乎犯了错误的情况。除了响应HEAD请求时,服务器应该包含一个包含错误情况说明的实体,以及它是临时的还是永久的状态。这些状态码适用于任何请求方法。用户代理应该向用户显示任何包含的实体。

So the following code make sure the URL exists and you can access it:

下面的代码确保URL存在并且你可以访问它:

require "net/http"
def url_exist?(url_string)
  url = URI.parse(url_string)
  req = Net::HTTP.new(url.host, url.port)
  req.use_ssl = (url.scheme == 'https')
  path = url.path if url.path.present?
  res = req.request_head(path || '/')
  if res.kind_of?(Net::HTTPRedirection)
    url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL 
  else
    res.code[0] != "4" #false if http code starts with 4 - error on your side.
  end
rescue Errno::ENOENT
  false #false if can't find the server
end

3. Check if you can request it correctly and the server can answer it correctly

Just like the 4xx family checks if you can access the URL, the 5xx family checks if the server had any problem answering your request. An error on this family most of the times are due problems on the server itself, and hopefully they are working on solve it. If You need to be able to access the page and get a correct answer now, you should make sure the answer is not from 4xx or 5xx family, and if you was redirected, the redirected page answers correctly. So much similar to (2), you can simply use the following code:

就像4xx家庭检查您是否可以访问URL, 5xx家庭检查服务器是否有任何问题响应您的请求。这个家庭的一个错误大多数时候都是服务器本身的问题,希望他们正在解决它。如果您现在需要能够访问页面并得到正确的答案,您应该确保答案不是来自4xx或5xx家族,如果您被重定向,重定向页面的答案是正确的。与(2)非常相似,您可以简单地使用以下代码:

require "net/http"
def url_exist?(url_string)
  url = URI.parse(url_string)
  req = Net::HTTP.new(url.host, url.port)
  req.use_ssl = (url.scheme == 'https')
  path = url.path if url.path.present?
  res = req.request_head(path || '/')
  if res.kind_of?(Net::HTTPRedirection)
    url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL 
  else
    ! %W(4 5).include?(res.code[0]) # Not from 4xx or 5xx families
  end
rescue Errno::ENOENT
  false #false if can't find the server
end

#3


23  

Net::HTTP works but if you can work outside stdlib, Faraday is better.

HTTP可以工作,但是如果你可以在stdlib之外工作,Faraday会更好。

Faraday.head(the_url).status == 200

(200 is a success code, assuming that's what you meant by "exists".)

(200是一个成功的代码,假设这就是你所说的“存在”。)

#4


4  

You should read this article :

你应该读读这篇文章:

Validating URL/URI in Ruby on Rails

在Ruby on Rails中验证URL/URI

#5


3  

Simone's answer was very helpful to me.

西蒙的回答对我很有帮助。

Here is a version that returns true/false depending on URL validity, and which handles redirects:

这是一个返回true/false的版本,取决于URL的有效性,并且处理重定向:

require 'net/http'
require 'set'

def working_url?(url, max_redirects=6)
  response = nil
  seen = Set.new
  loop do
    url = URI.parse(url)
    break if seen.include? url.to_s
    break if seen.size > max_redirects
    seen.add(url.to_s)
    response = Net::HTTP.new(url.host, url.port).request_head(url.path)
    if response.kind_of?(Net::HTTPRedirection)
      url = response['location']
    else
      break
    end
  end
  response.kind_of?(Net::HTTPSuccess) && url.to_s
end

#1


58  

Use the Net::HTTP library.

使用Net::HTTP库。

require "net/http"
url = URI.parse("http://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)

At this point res is a Net::HTTPResponse object containing the result of the request. You can then check the response code:

此时,res是一个Net::HTTPResponse对象,包含请求的结果。然后您可以检查响应代码:

do_something_with_it(url) if res.code == "200"

Note: To check for https based url, use_ssl attribute should be true as:

注意:要检查基于https的url, use_ssl属性应该为true:

require "net/http"
url = URI.parse("https://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.request_head(url.path)

#2


45  

Sorry for the late reply on this, but I think this deservers a better answer.

很抱歉这么晚才回复你,但我认为这是一个更好的答案。

There are three ways to look at this question:

有三种方式来看待这个问题:

  1. Strict check if the URL exist
  2. 严格检查URL是否存在
  3. Check if you are requesting the URL correclty
  4. 检查您是否正在请求URL correclty
  5. Check if you can request it correctly and the server can answer it correctly
  6. 检查您是否可以正确地请求它,并且服务器可以正确地回答它

1. Strict check if the URL exist

While 200 means that the server answers to that URL (thus, the URL exists), answering other status code doesn't means that the URL does not exist. For example, answering 302 - redirected means that the URL exists and is redirecting to another one. While browsing, 302 many times behaves the same than 200 to the final user. Other status code that can be returned if a URL exists is 500 - internal server error. After all, if the URL does not exists, how it comes the application server processed your request instead return simply 404 - not found?

虽然200表示服务器对该URL的响应(因此,URL存在),但是响应其他状态代码并不意味着URL不存在。例如,应答302 -重定向意味着URL存在并正在重定向到另一个URL。在浏览时,302多次对最终用户的行为与200次相同。如果URL存在,可以返回的其他状态码是500 -内部服务器错误。毕竟,如果URL不存在,应用服务器如何处理您的请求,而只返回404 -未找到?

So there is actually only one two cases when a URL does not exists: When the server does not exists or when the server exists but can't find the given URL path does not exists. Thus, the only way to check if the URL exists is checking if the server answers and the return code is not 404. The following code does just that.

因此,当URL不存在时,实际上只有两种情况:当服务器不存在时,或者当服务器存在但找不到给定的URL路径时。因此,检查URL是否存在的唯一方法是检查服务器是否应答,并且返回代码不是404。下面的代码就是这么做的。

require "net/http"
def url_exist?(url_string)
  url = URI.parse(url_string)
  req = Net::HTTP.new(url.host, url.port)
  req.use_ssl = (url.scheme == 'https')
  path = url.path if url.path.present?
  res = req.request_head(path || '/')
  res.code != "404" # false if returns 404 - not found
rescue Errno::ENOENT
  false # false if can't find the server
end

2. Check if you are requesting the URL correclty

However, most of the times we are not interested in see if a URL exists, but if we can access it. Fortunately looking to the HTTP status codes families, that is the 4xx family, which states for client error (thus, an error in your side, which means you are not requesting the page correctly, don't have permission or whatsoever). This is a good of errors to check if you can access this page. From wiki:

但是,大多数时候我们不关心URL是否存在,但是如果我们能够访问它。幸运的是,查看HTTP状态码的家庭,这是4xx家庭,这是客户错误的状态(因此,在您的一方出现错误,这意味着您没有请求正确的页面,没有权限或其他任何东西)。这是一个很好的错误,检查您是否可以访问这个页面。从维基:

The 4xx class of status code is intended for cases in which the client seems to have erred. Except when responding to a HEAD request, the server should include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition. These status codes are applicable to any request method. User agents should display any included entity to the user.

4xx类状态码是针对客户端似乎犯了错误的情况。除了响应HEAD请求时,服务器应该包含一个包含错误情况说明的实体,以及它是临时的还是永久的状态。这些状态码适用于任何请求方法。用户代理应该向用户显示任何包含的实体。

So the following code make sure the URL exists and you can access it:

下面的代码确保URL存在并且你可以访问它:

require "net/http"
def url_exist?(url_string)
  url = URI.parse(url_string)
  req = Net::HTTP.new(url.host, url.port)
  req.use_ssl = (url.scheme == 'https')
  path = url.path if url.path.present?
  res = req.request_head(path || '/')
  if res.kind_of?(Net::HTTPRedirection)
    url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL 
  else
    res.code[0] != "4" #false if http code starts with 4 - error on your side.
  end
rescue Errno::ENOENT
  false #false if can't find the server
end

3. Check if you can request it correctly and the server can answer it correctly

Just like the 4xx family checks if you can access the URL, the 5xx family checks if the server had any problem answering your request. An error on this family most of the times are due problems on the server itself, and hopefully they are working on solve it. If You need to be able to access the page and get a correct answer now, you should make sure the answer is not from 4xx or 5xx family, and if you was redirected, the redirected page answers correctly. So much similar to (2), you can simply use the following code:

就像4xx家庭检查您是否可以访问URL, 5xx家庭检查服务器是否有任何问题响应您的请求。这个家庭的一个错误大多数时候都是服务器本身的问题,希望他们正在解决它。如果您现在需要能够访问页面并得到正确的答案,您应该确保答案不是来自4xx或5xx家族,如果您被重定向,重定向页面的答案是正确的。与(2)非常相似,您可以简单地使用以下代码:

require "net/http"
def url_exist?(url_string)
  url = URI.parse(url_string)
  req = Net::HTTP.new(url.host, url.port)
  req.use_ssl = (url.scheme == 'https')
  path = url.path if url.path.present?
  res = req.request_head(path || '/')
  if res.kind_of?(Net::HTTPRedirection)
    url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL 
  else
    ! %W(4 5).include?(res.code[0]) # Not from 4xx or 5xx families
  end
rescue Errno::ENOENT
  false #false if can't find the server
end

#3


23  

Net::HTTP works but if you can work outside stdlib, Faraday is better.

HTTP可以工作,但是如果你可以在stdlib之外工作,Faraday会更好。

Faraday.head(the_url).status == 200

(200 is a success code, assuming that's what you meant by "exists".)

(200是一个成功的代码,假设这就是你所说的“存在”。)

#4


4  

You should read this article :

你应该读读这篇文章:

Validating URL/URI in Ruby on Rails

在Ruby on Rails中验证URL/URI

#5


3  

Simone's answer was very helpful to me.

西蒙的回答对我很有帮助。

Here is a version that returns true/false depending on URL validity, and which handles redirects:

这是一个返回true/false的版本,取决于URL的有效性,并且处理重定向:

require 'net/http'
require 'set'

def working_url?(url, max_redirects=6)
  response = nil
  seen = Set.new
  loop do
    url = URI.parse(url)
    break if seen.include? url.to_s
    break if seen.size > max_redirects
    seen.add(url.to_s)
    response = Net::HTTP.new(url.host, url.port).request_head(url.path)
    if response.kind_of?(Net::HTTPRedirection)
      url = response['location']
    else
      break
    end
  end
  response.kind_of?(Net::HTTPSuccess) && url.to_s
end