PHP / Curl: HEAD请求在某些站点上花费很长时间。

时间:2022-06-01 18:59:32

I have simple code that does a head request for a URL and then prints the response headers. I've noticed that on some sites, this can take a long time to complete.

我有一个简单的代码,它对一个URL进行头部请求,然后打印响应头。我注意到在一些网站上,这可能需要很长时间才能完成。

For example, requesting http://www.arstechnica.com takes about two minutes. I've tried the same request using another web site that does the same basic task, and it comes back immediately. So there must be something I have set incorrectly that's causing this delay.

例如,请求http://www.arstechnica.com大约需要两分钟。我已经尝试了同样的请求,使用另一个网站做同样的基本任务,它马上就回来了。所以一定是我设置的错误导致了延时。

Here's the code I have:

这是我的代码:

$ch = curl_init();
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt ($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);

// Only calling the head
curl_setopt($ch, CURLOPT_HEADER, true); // header will be at output
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD'); // HTTP request is 'HEAD'

$content = curl_exec ($ch);
curl_close ($ch);

Here's a link to the web site that does the same function: http://www.seoconsultants.com/tools/headers.asp

这是一个链接到网站的功能:http://www.seoconsultants.com/tools/headers.asp。

The code above, at least on my server, takes two minutes to retrieve www.arstechnica.com, but the service at the link above returns it right away.

上面的代码,至少在我的服务器上,需要两分钟来检索www.arstechnica.com,但是上面链接的服务立即返回。

What am I missing?

我缺少什么?

5 个解决方案

#1


41  

Try simplifying it a little bit:

试着简化一下:

print htmlentities(file_get_contents("http://www.arstechnica.com"));

The above outputs instantly on my webserver. If it doesn't on yours, there's a good chance your web host has some kind of setting in place to throttle these kind of requests.

以上输出立即在我的webserver上。如果它不在你的网站上,很有可能你的网络主机有某种设置来限制这些请求。

EDIT:

编辑:

Since the above happens instantly for you, try setting this curl setting on your original code:

由于上述情况会立即发生,请尝试将这个curl设置设置在您的原始代码上:

curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true);

Using the tool you posted, I noticed that http://www.arstechnica.com has a 301 header sent for any request sent to it. It is possible that cURL is getting this and not following the new Location specified to it, thus causing your script to hang.

使用你发布的工具,我注意到http://www.arstechnica.com有一个301标题发送给它的任何请求。cURL可能会得到这个,而不遵循指定的新位置,从而导致您的脚本挂起。

SECOND EDIT:

第二个编辑:

Curiously enough, trying the same code you have above was making my webserver hang too. I replaced this code:

奇怪的是,尝试相同的代码也让我的webserver挂起了。我取代了这段代码:

curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD'); // HTTP request is 'HEAD'

With this:

用这个:

curl_setopt($ch, CURLOPT_NOBODY, true);

Which is the way the manual recommends you do a HEAD request. It made it work instantly.

这就是手册建议你做头部请求的方式。这让它立即生效。

#2


6  

You have to remember that HEAD is only a suggestion to the web server. For HEAD to do the right thing it often takes some explicit effort on the part of the admins. If you HEAD a static file Apache (or whatever your webserver is) will often step in an do the right thing. If you HEAD a dynamic page, the default for most setups is to execute the GET path, collect all the results, and just send back the headers without the content. If that application is in a 3 (or more) tier setup, that call could potentially be very expensive and needless for a HEAD context. For instance, on a Java servlet, by default doHead() just calls doGet(). To do something a little smarter for the application the developer would have to explicitly implement doHead() (and more often than not, they will not).

您必须记住,HEAD只是web服务器的一个建议。为了做正确的事情,它经常需要管理员的一些明确的努力。如果你是一个静态文件Apache(或者你的webserver是什么),通常会在做正确的事情。如果您负责一个动态页面,大多数设置的默认设置是执行GET路径,收集所有的结果,然后在没有内容的情况下返回头。如果该应用程序位于一个3(或更多)层的设置中,那么这个调用可能会非常昂贵,并且不需要一个头上下文。例如,在Java servlet中,默认的doHead()只调用doGet()。为了对应用程序做一些更聪明的事情,开发人员必须显式地实现doHead()(而且通常不会这样做)。

I encountered an app from a fortune 100 company that is used for downloading several hundred megabytes of pricing information. We'd check for updates to that data by executing HEAD requests fairly regularly until the modified date changed. It turns out that this request would actually make back end calls to generate this list every time we made the request which involved gigabytes of data on their back end and xfer it between several internal servers. They weren't terribly happy with us but once we explained the use case they quickly came up with an alternate solution. If they had implemented HEAD, rather than relying on their web server to fake it, it would not have been an issue.

我遇到了一个来自财富100强公司的应用程序,它被用来下载几百兆的价格信息。我们会定期执行HEAD请求来检查该数据的更新,直到修改日期发生变化。事实证明,每当我们提出请求时,这个请求实际上会返回到生成这个列表的调用,这个请求涉及到它们的后端上的千兆字节的数据,并在几个内部服务器之间进行xfer。他们对我们不是很满意,但是一旦我们解释了用例,他们很快就想出了另一个解决方案。如果他们已经实现了HEAD,而不是依靠他们的web服务器来伪造它,那么这将不会是一个问题。

#3


3  

I used the below function to find out the redirected URL.

我使用下面的函数来查找重定向的URL。

$head = get_headers($url, 1);

The second argument makes it return an array with keys. For e.g. the below will give the Location value.

第二个参数使它返回带有键的数组。例如下面将给出位置值。

$head["Location"]

http://php.net/manual/en/function.get-headers.php

http://php.net/manual/en/function.get-headers.php

#4


2  

If my memory doesn't fails me doing a HEAD request in CURL changes the HTTP protocol version to 1.0 (which is slow and probably the guilty part here) try changing that to:

如果我的内存没有失败,我在CURL中做一个HEAD请求将HTTP协议版本更改为1.0(这是缓慢的,很可能是这个错误的部分)尝试将其更改为:

$ch = curl_init();
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt ($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);

// Only calling the head
curl_setopt($ch, CURLOPT_HEADER, true); // header will be at output
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD'); // HTTP request is 'HEAD'
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1); // ADD THIS

$content = curl_exec ($ch);
curl_close ($ch);

#5


0  

This:

这样的:

curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);

I wasn't trying to get headers.
I was just trying to make the page load of some data not take 2 minutes similar to described above.
That magical little options has dropped it down to 2 seconds.

我并没有试图获取标题。我只是想让页面载入一些数据,而不是像上面描述的那样花费2分钟。这个神奇的小选择已经降到了2秒。

#1


41  

Try simplifying it a little bit:

试着简化一下:

print htmlentities(file_get_contents("http://www.arstechnica.com"));

The above outputs instantly on my webserver. If it doesn't on yours, there's a good chance your web host has some kind of setting in place to throttle these kind of requests.

以上输出立即在我的webserver上。如果它不在你的网站上,很有可能你的网络主机有某种设置来限制这些请求。

EDIT:

编辑:

Since the above happens instantly for you, try setting this curl setting on your original code:

由于上述情况会立即发生,请尝试将这个curl设置设置在您的原始代码上:

curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, true);

Using the tool you posted, I noticed that http://www.arstechnica.com has a 301 header sent for any request sent to it. It is possible that cURL is getting this and not following the new Location specified to it, thus causing your script to hang.

使用你发布的工具,我注意到http://www.arstechnica.com有一个301标题发送给它的任何请求。cURL可能会得到这个,而不遵循指定的新位置,从而导致您的脚本挂起。

SECOND EDIT:

第二个编辑:

Curiously enough, trying the same code you have above was making my webserver hang too. I replaced this code:

奇怪的是,尝试相同的代码也让我的webserver挂起了。我取代了这段代码:

curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD'); // HTTP request is 'HEAD'

With this:

用这个:

curl_setopt($ch, CURLOPT_NOBODY, true);

Which is the way the manual recommends you do a HEAD request. It made it work instantly.

这就是手册建议你做头部请求的方式。这让它立即生效。

#2


6  

You have to remember that HEAD is only a suggestion to the web server. For HEAD to do the right thing it often takes some explicit effort on the part of the admins. If you HEAD a static file Apache (or whatever your webserver is) will often step in an do the right thing. If you HEAD a dynamic page, the default for most setups is to execute the GET path, collect all the results, and just send back the headers without the content. If that application is in a 3 (or more) tier setup, that call could potentially be very expensive and needless for a HEAD context. For instance, on a Java servlet, by default doHead() just calls doGet(). To do something a little smarter for the application the developer would have to explicitly implement doHead() (and more often than not, they will not).

您必须记住,HEAD只是web服务器的一个建议。为了做正确的事情,它经常需要管理员的一些明确的努力。如果你是一个静态文件Apache(或者你的webserver是什么),通常会在做正确的事情。如果您负责一个动态页面,大多数设置的默认设置是执行GET路径,收集所有的结果,然后在没有内容的情况下返回头。如果该应用程序位于一个3(或更多)层的设置中,那么这个调用可能会非常昂贵,并且不需要一个头上下文。例如,在Java servlet中,默认的doHead()只调用doGet()。为了对应用程序做一些更聪明的事情,开发人员必须显式地实现doHead()(而且通常不会这样做)。

I encountered an app from a fortune 100 company that is used for downloading several hundred megabytes of pricing information. We'd check for updates to that data by executing HEAD requests fairly regularly until the modified date changed. It turns out that this request would actually make back end calls to generate this list every time we made the request which involved gigabytes of data on their back end and xfer it between several internal servers. They weren't terribly happy with us but once we explained the use case they quickly came up with an alternate solution. If they had implemented HEAD, rather than relying on their web server to fake it, it would not have been an issue.

我遇到了一个来自财富100强公司的应用程序,它被用来下载几百兆的价格信息。我们会定期执行HEAD请求来检查该数据的更新,直到修改日期发生变化。事实证明,每当我们提出请求时,这个请求实际上会返回到生成这个列表的调用,这个请求涉及到它们的后端上的千兆字节的数据,并在几个内部服务器之间进行xfer。他们对我们不是很满意,但是一旦我们解释了用例,他们很快就想出了另一个解决方案。如果他们已经实现了HEAD,而不是依靠他们的web服务器来伪造它,那么这将不会是一个问题。

#3


3  

I used the below function to find out the redirected URL.

我使用下面的函数来查找重定向的URL。

$head = get_headers($url, 1);

The second argument makes it return an array with keys. For e.g. the below will give the Location value.

第二个参数使它返回带有键的数组。例如下面将给出位置值。

$head["Location"]

http://php.net/manual/en/function.get-headers.php

http://php.net/manual/en/function.get-headers.php

#4


2  

If my memory doesn't fails me doing a HEAD request in CURL changes the HTTP protocol version to 1.0 (which is slow and probably the guilty part here) try changing that to:

如果我的内存没有失败,我在CURL中做一个HEAD请求将HTTP协议版本更改为1.0(这是缓慢的,很可能是这个错误的部分)尝试将其更改为:

$ch = curl_init();
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt ($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);

// Only calling the head
curl_setopt($ch, CURLOPT_HEADER, true); // header will be at output
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD'); // HTTP request is 'HEAD'
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1); // ADD THIS

$content = curl_exec ($ch);
curl_close ($ch);

#5


0  

This:

这样的:

curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);

I wasn't trying to get headers.
I was just trying to make the page load of some data not take 2 minutes similar to described above.
That magical little options has dropped it down to 2 seconds.

我并没有试图获取标题。我只是想让页面载入一些数据,而不是像上面描述的那样花费2分钟。这个神奇的小选择已经降到了2秒。