Web爬网程序在重复请求Web服务器之间等待的最佳持续时间是多少

时间:2023-01-22 01:11:21

Is there some standard time duration that a crawler must wait for between repeated hits to the same server, so as not to overburden the server.

是否有一些标准的持续时间,爬虫必须在重复命中到同一服务器之间等待,以免使服务器负担过重。

If not, any suggestions on what can be a good waiting period for the crawler to be considered polite.

如果没有,任何有关爬虫被认为是礼貌的好等待时间的建议。

Does this value also vary from server to server... and if so how can one determine it?

这个值是否也因服务器而异...如果是这样,如何确定它?

4 个解决方案

#1


This article on IBM goes into some detail on how the Web crawler uses the robots exclusion protocol and recrawl interval settings in the Web crawler

关于IBM的这篇文章详细介绍了Web爬网程序如何使用机器人排除协议以及重新爬网Web爬网程序中的间隔设置

To quote the articles.

引用文章。

The first time that a page is crawled, the crawler uses the date and time that the page is crawled and an average of the specified minimum and maximum recrawl intervals to set a recrawl date. The page will not be recrawled before that date. The time that the page will be recrawled after that date depends on the crawler load and the balance of new and old URLs in the crawl space.

第一次抓取页面时,抓取工具会使用抓取页面的日期和时间以及指定最小和最大重新抓取间隔的平均值来设置重新抓取日期。在该日期之前不会重新抓取该页面。在该日期之后重新抓取页面的时间取决于爬网程序负载以及爬网空间中新旧URL的余额。

Each time that the page is recrawled, the crawler checks to see if the content has changed. If the content has changed, the next recrawl interval will be shorter than the previous one, but never shorter than the specified minimum recrawl interval. If the content has not changed, the next recrawl interval will be longer than the previous one, but never longer than the specified maximum recrawl interval.

每次重新抓取页面时,抓取工具都会检查内容是否已更改。如果内容已更改,则下一个重新爬网间隔将比前一个更短,但不会短于指定的最小重新爬网间隔。如果内容未更改,则下一个重新爬网间隔将比前一个更长,但永远不会超过指定的最大重新爬网间隔。

This is about their web crawler but is very useful in reading while building your own tool.

这是关于他们的网络爬虫,但在构建自己的工具时非常有用。

#2


I know this might be a little late, but the answers weren't helping me with this question. I too am concerned about how often a crawler would hit an server. Especially reading the wikipedia.org robots.txt where it has disallowed bots that "Hits many times per second, not acceptable".

我知道这可能有点晚了,但答案并没有帮助我解决这个问题。我也担心爬虫会经常撞到服务器的频率。特别是阅读wikipedia.org robots.txt,它禁止机器人“每秒多次击中,不可接受”。

I have found this interesting MS Research article entitled Web Crawler Architecture - http://research.microsoft.com/pubs/102936/EDS-WebCrawlerArchitecture.pdf. The following is from the paper talking about politenes.

我发现这篇有趣的MS Research文章名为Web Crawler Architecture - http://research.microsoft.com/pubs/102936/EDS-WebCrawlerArchitecture.pdf。以下是关于politenes的论文。

There are many possible politeness policies; one that is particularly easy to implement is to disallow concurrent requests to the same web server; a slightly more sophisticated policy would be to wait for time proportional to the last download time before contacting a given web server again.

有很多可能的礼貌政策;一个特别容易实现的是禁止对同一个Web服务器的并发请求;稍微复杂的策略是在再次联系给定的Web服务器之前等待与上次下载时间成比例的时间。

#3


That will depend on how often the content changes. For example, it makes sense to crawl a news site more often than a site with static articles.

这取决于内容变化的频率。例如,与具有静态文章的网站相比,更频繁地抓取新闻网站是有意义的。

As to exactly how to determine the optimum - it will depend on how you judge the cost of fetching, indexing etc against the value of having up-to-date data. That's entirely up to you - but you will probably have to use some heuristics to work out how much the site is changing over time, based on observations. If a site hasn't changed for three fetches in a row, you might want to wait a little bit longer before fetching next time. Conversely, if a site always changes every time you fetch it, you might want to be a little bit more aggressive to avoid missing updates.

至于如何确定最佳值 - 它将取决于您如何判断获取,索引等的成本与具有最新数据的价值。这完全取决于您 - 但您可能必须使用一些启发式算法来根据观察结果确定网站随时间变化的程度。如果一个站点连续三次没有更改,您可能需要等待一段时间再下一次。相反,如果站点在每次获取时总是会更改,那么您可能希望更加积极一些,以避免错过更新。

#4


I don't think there is a minimum interval on how often you can hit a site, as it is highly dependent on current server load and server capability.

我不认为有多少时间间隔可以访问网站,因为它高度依赖于当前的服务器负载和服务器功能。

You can try to test the response time and time-out rates, if one site is responding slowly or getting you time-out errors, you should increase your re-hit interval, even though it might not be your crawler causing the slowness or time-outs.

您可以尝试测试响应时间和超时率,如果一个站点响应缓慢或导致超时错误,您应该增加重新命中间隔,即使它可能不是您的爬虫导致缓慢或时间-outs。

#1


This article on IBM goes into some detail on how the Web crawler uses the robots exclusion protocol and recrawl interval settings in the Web crawler

关于IBM的这篇文章详细介绍了Web爬网程序如何使用机器人排除协议以及重新爬网Web爬网程序中的间隔设置

To quote the articles.

引用文章。

The first time that a page is crawled, the crawler uses the date and time that the page is crawled and an average of the specified minimum and maximum recrawl intervals to set a recrawl date. The page will not be recrawled before that date. The time that the page will be recrawled after that date depends on the crawler load and the balance of new and old URLs in the crawl space.

第一次抓取页面时,抓取工具会使用抓取页面的日期和时间以及指定最小和最大重新抓取间隔的平均值来设置重新抓取日期。在该日期之前不会重新抓取该页面。在该日期之后重新抓取页面的时间取决于爬网程序负载以及爬网空间中新旧URL的余额。

Each time that the page is recrawled, the crawler checks to see if the content has changed. If the content has changed, the next recrawl interval will be shorter than the previous one, but never shorter than the specified minimum recrawl interval. If the content has not changed, the next recrawl interval will be longer than the previous one, but never longer than the specified maximum recrawl interval.

每次重新抓取页面时,抓取工具都会检查内容是否已更改。如果内容已更改,则下一个重新爬网间隔将比前一个更短,但不会短于指定的最小重新爬网间隔。如果内容未更改,则下一个重新爬网间隔将比前一个更长,但永远不会超过指定的最大重新爬网间隔。

This is about their web crawler but is very useful in reading while building your own tool.

这是关于他们的网络爬虫,但在构建自己的工具时非常有用。

#2


I know this might be a little late, but the answers weren't helping me with this question. I too am concerned about how often a crawler would hit an server. Especially reading the wikipedia.org robots.txt where it has disallowed bots that "Hits many times per second, not acceptable".

我知道这可能有点晚了,但答案并没有帮助我解决这个问题。我也担心爬虫会经常撞到服务器的频率。特别是阅读wikipedia.org robots.txt,它禁止机器人“每秒多次击中,不可接受”。

I have found this interesting MS Research article entitled Web Crawler Architecture - http://research.microsoft.com/pubs/102936/EDS-WebCrawlerArchitecture.pdf. The following is from the paper talking about politenes.

我发现这篇有趣的MS Research文章名为Web Crawler Architecture - http://research.microsoft.com/pubs/102936/EDS-WebCrawlerArchitecture.pdf。以下是关于politenes的论文。

There are many possible politeness policies; one that is particularly easy to implement is to disallow concurrent requests to the same web server; a slightly more sophisticated policy would be to wait for time proportional to the last download time before contacting a given web server again.

有很多可能的礼貌政策;一个特别容易实现的是禁止对同一个Web服务器的并发请求;稍微复杂的策略是在再次联系给定的Web服务器之前等待与上次下载时间成比例的时间。

#3


That will depend on how often the content changes. For example, it makes sense to crawl a news site more often than a site with static articles.

这取决于内容变化的频率。例如,与具有静态文章的网站相比,更频繁地抓取新闻网站是有意义的。

As to exactly how to determine the optimum - it will depend on how you judge the cost of fetching, indexing etc against the value of having up-to-date data. That's entirely up to you - but you will probably have to use some heuristics to work out how much the site is changing over time, based on observations. If a site hasn't changed for three fetches in a row, you might want to wait a little bit longer before fetching next time. Conversely, if a site always changes every time you fetch it, you might want to be a little bit more aggressive to avoid missing updates.

至于如何确定最佳值 - 它将取决于您如何判断获取,索引等的成本与具有最新数据的价值。这完全取决于您 - 但您可能必须使用一些启发式算法来根据观察结果确定网站随时间变化的程度。如果一个站点连续三次没有更改,您可能需要等待一段时间再下一次。相反,如果站点在每次获取时总是会更改,那么您可能希望更加积极一些,以避免错过更新。

#4


I don't think there is a minimum interval on how often you can hit a site, as it is highly dependent on current server load and server capability.

我不认为有多少时间间隔可以访问网站,因为它高度依赖于当前的服务器负载和服务器功能。

You can try to test the response time and time-out rates, if one site is responding slowly or getting you time-out errors, you should increase your re-hit interval, even though it might not be your crawler causing the slowness or time-outs.

您可以尝试测试响应时间和超时率,如果一个站点响应缓慢或导致超时错误,您应该增加重新命中间隔,即使它可能不是您的爬虫导致缓慢或时间-outs。