如何阻止机器人在PHP中递增我的文件下载计数器?

时间:2023-01-27 10:53:20

When a user clicks a link to download a file on my website, they go to this PHP file which increments a download counter for that file and then header()-redirects them to the actual file. I suspect that bots are following the download link, however, so the number of downloads is inaccurate.

当用户单击链接以在我的网站上下载文件时,他们会转到此PHP文件,该文件会递增该文件的下载计数器,然后header() - 将它们重定向到实际文件。我怀疑机器人正在关注下载链接,因此下载次数不准确。

  • How do I let bots know that they shouldn't follow the link?
  • 我怎么让机器人知道他们不应该关注这个链接?

  • Is there a way to detect most bots?
  • 有没有办法检测大多数机器人?

  • Is there a better way to count the number of downloads a file gets?
  • 有没有更好的方法来计算文件的下载次数?

4 个解决方案

#1


16  

robots.txt: http://www.robotstxt.org/robotstxt.html

Not all bots respect it, but most do. If you really want to prevent access via bots, make the link to it a POST instead of a GET. Bots will not follow POST urls. (I.E., use a small form that posts back to the site that takes you to the URL in question.)

并非所有机器人都尊重它,但大多数人都这样做。如果你真的想阻止通过僵尸程序访问,请将链接指向POST而不是GET。机器人不会关注POST网址。 (例如,使用一个小型表格回发到网站,将您带到相关网址。)

#2


4  

I would think Godeke's robots.txt answer would be sufficient. If you absolutely cannot have the bots up your counter, then I would recommend using the robots file in conjunction with not not incrementing the clicks with some common robot user agents.

我认为Godeke的robots.txt答案就足够了。如果您绝对无法将计数器放在计数器上,那么我建议您同时使用机器人文件,而不是使用一些常见的机器人用户代理来增加点击次数。

Neither way is perfect., but the mixture of the two is probably a little more strict. If is was me, I would probably just stick to the robots file though, since it is easy and probably the most effective solution.

这两种方式都不是完美的,但两者的混合可能更严格一些。如果是我,我可能只是坚持机器人文件,因为它很容易,可能是最有效的解决方案。

#3


3  

Godeke is right, robots.txt is the first thing to do to keep the bots from downloading.

Godeke是对的,robots.txt是阻止机器人下载的第一件事。

Regarding the counting, this is really a web analytics problem. Are you not keeping your www access logs and running them through an analytics program like Webalizer or AWStats (or fancy alternatives like Webtrends or Urchin)? To me that's the way to go for collecting this sort of info, because it's easy and there's no PHP, redirect or other performance hit when the user's downloading the file. You're just using the Apache logs that you're keeping anyway. (And grep -c will give you the quick 'n' dirty count on a particular file or wildcard pattern.)

关于计数,这实际上是一个网络分析问题。您是不是保留了www访问日志并通过Webalizer或AWStats等分析程序(或Webtrends或Urchin等花哨的替代品)运行它们?对我而言,这是收集此类信息的方法,因为它很容易,并且在用户下载文件时没有PHP,重定向或其他性能损失。您只是使用您正在保留的Apache日志。 (并且grep -c将为您提供特定文件或通配符模式的快速'n'脏计数。)

You can configure your stats software to ignore hits by bots, or specific user agents and other criteria (and if you change your criteria later on, you just reprocess the old log data). Of course, this does require you have all your old logs, so if you've been tossing them with something like logrotate you'll have to start out without any historical data.

您可以将统计软件配置为忽略机器人,特定用户代理和其他条件的命中(如果稍后更改条件,则只需重新处理旧的日志数据)。当然,这确实要求你拥有所有旧日志,所以如果你用logrotate之类的东西抛弃它们,你将不得不在没有任何历史数据的情况下开始。

#4


0  

You can also detect malicious bots, which wouldn't respect robots.txt using http://www.bad-behavior.ioerror.us/.

您还可以使用http://www.bad-behavior.ioerror.us/检测恶意机器人,这些机器人不会尊重robots.txt。

#1


16  

robots.txt: http://www.robotstxt.org/robotstxt.html

Not all bots respect it, but most do. If you really want to prevent access via bots, make the link to it a POST instead of a GET. Bots will not follow POST urls. (I.E., use a small form that posts back to the site that takes you to the URL in question.)

并非所有机器人都尊重它,但大多数人都这样做。如果你真的想阻止通过僵尸程序访问,请将链接指向POST而不是GET。机器人不会关注POST网址。 (例如,使用一个小型表格回发到网站,将您带到相关网址。)

#2


4  

I would think Godeke's robots.txt answer would be sufficient. If you absolutely cannot have the bots up your counter, then I would recommend using the robots file in conjunction with not not incrementing the clicks with some common robot user agents.

我认为Godeke的robots.txt答案就足够了。如果您绝对无法将计数器放在计数器上,那么我建议您同时使用机器人文件,而不是使用一些常见的机器人用户代理来增加点击次数。

Neither way is perfect., but the mixture of the two is probably a little more strict. If is was me, I would probably just stick to the robots file though, since it is easy and probably the most effective solution.

这两种方式都不是完美的,但两者的混合可能更严格一些。如果是我,我可能只是坚持机器人文件,因为它很容易,可能是最有效的解决方案。

#3


3  

Godeke is right, robots.txt is the first thing to do to keep the bots from downloading.

Godeke是对的,robots.txt是阻止机器人下载的第一件事。

Regarding the counting, this is really a web analytics problem. Are you not keeping your www access logs and running them through an analytics program like Webalizer or AWStats (or fancy alternatives like Webtrends or Urchin)? To me that's the way to go for collecting this sort of info, because it's easy and there's no PHP, redirect or other performance hit when the user's downloading the file. You're just using the Apache logs that you're keeping anyway. (And grep -c will give you the quick 'n' dirty count on a particular file or wildcard pattern.)

关于计数,这实际上是一个网络分析问题。您是不是保留了www访问日志并通过Webalizer或AWStats等分析程序(或Webtrends或Urchin等花哨的替代品)运行它们?对我而言,这是收集此类信息的方法,因为它很容易,并且在用户下载文件时没有PHP,重定向或其他性能损失。您只是使用您正在保留的Apache日志。 (并且grep -c将为您提供特定文件或通配符模式的快速'n'脏计数。)

You can configure your stats software to ignore hits by bots, or specific user agents and other criteria (and if you change your criteria later on, you just reprocess the old log data). Of course, this does require you have all your old logs, so if you've been tossing them with something like logrotate you'll have to start out without any historical data.

您可以将统计软件配置为忽略机器人,特定用户代理和其他条件的命中(如果稍后更改条件,则只需重新处理旧的日志数据)。当然,这确实要求你拥有所有旧日志,所以如果你用logrotate之类的东西抛弃它们,你将不得不在没有任何历史数据的情况下开始。

#4


0  

You can also detect malicious bots, which wouldn't respect robots.txt using http://www.bad-behavior.ioerror.us/.

您还可以使用http://www.bad-behavior.ioerror.us/检测恶意机器人,这些机器人不会尊重robots.txt。