被robots.txt禁止禁止:scrapy

时间:2022-02-07 04:04:08

while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/>

在抓取像https://www.netflix.com这样的网站时,被robots.txt禁止访问:https://www.netflix.com/>

ERROR: No response downloaded for: https://www.netflix.com/

错误:未下载响应:https://www.netflix.com/

2 个解决方案

#1


80  

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY

在2016-05-11推出的新版本(scrapy 1.1)中,抓取首先在抓取之前下载robots.txt。要使用ROBOTSTXT_OBEY更改settings.py中的此行为更改

ROBOTSTXT_OBEY=False

Here are the release notes

这是发行说明

#2


0  

First thing you need to ensure is that you change your user agent in the request, otherwise default user agent will be blocked for sure.

您需要确保的第一件事是您在请求中更改了用户代理,否则将默认阻止默认用户代理。

#1


80  

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY

在2016-05-11推出的新版本(scrapy 1.1)中,抓取首先在抓取之前下载robots.txt。要使用ROBOTSTXT_OBEY更改settings.py中的此行为更改

ROBOTSTXT_OBEY=False

Here are the release notes

这是发行说明

#2


0  

First thing you need to ensure is that you change your user agent in the request, otherwise default user agent will be blocked for sure.

您需要确保的第一件事是您在请求中更改了用户代理,否则将默认阻止默认用户代理。