如何从项目管道访问scrapy设置

时间:2022-09-10 14:28:21

How do I access the scrapy settings in settings.py from the item pipeline. The documentation mentions it can be accessed through the crawler in extensions, but I don't see how to access the crawler in the pipelines.

如何从项目管道访问settings.py中的scrapy设置。文档提到它可以通过扩展中的爬虫来访问,但是我没有看到如何在管道中访问爬虫。

4 个解决方案

#1


24  

Ok, so the documentation at http://doc.scrapy.org/en/latest/topics/extensions.html says that

好的,所以http://doc.scrapy.org/en/latest/topics/extensions.html上的文档说明了这一点

The main entry point for a Scrapy extension (this also includes middlewares and pipelines) is the from_crawler class method which receives a Crawler instance which is the main object controlling the Scrapy crawler. Through that object you can access settings, signals, stats, and also control the crawler behaviour, if your extension needs to such thing.

Scrapy扩展的主要入口点(也包括中间件和管道)是from_crawler类方法,它接收一个Crawler实例,它是控制Scrapy搜寻器的主要对象。通过该对象,您可以访问设置,信号,统计信息,还可以控制爬虫行为,如果您的扩展需要这样的话。

So then you can have a function to get the settings.

那么你可以有一个函数来获取设置。

@classmethod
def from_crawler(cls, crawler):
    settings = crawler.settings
    my_setting = settings.get("MY_SETTING")
    return cls(my_setting)

The crawler engine then calls the pipeline's init function with my_setting, like so:

然后,爬虫引擎使用my_setting调用管道的init函数,如下所示:

def __init__(self, my_setting):
    self.my_setting = my_setting

And other functions can access it with self.my_setting, as expected.

正如预期的那样,其他函数可以使用self.my_setting访问它。

Alternatively, in the from_crawler() function you can pass the crawler.settings object to __init__(), and then access settings from the pipeline as needed instead of pulling them all out in the constructor.

或者,在from_crawler()函数中,您可以将crawler.settings对象传递给__init __(),然后根据需要从管道访问设置,而不是在构造函数中将它们全部拉出。

#2


23  

The way to access your Scrapy settings (as defined in settings.py) from within your_spider.py is simple. All other answers are way too complicated. The reason for this is the very poor maintenance of the Scrapy documentation, combined with many recent updates & changes. Neither in the "Settings" documentation "How to access settings", nor in the "Settings API" have they bothered giving any workable example. Here's an example, how to get your current USER_AGENT string.

从your_spider.py中访问Scrapy设置(在settings.py中定义)的方法很简单。所有其他答案都太复杂了。这样做的原因是Scrapy文档的维护非常糟糕,并结合了许多最近的更新和更改。既没有在“设置”文档“如何访问设置”中,也没有在“设置API”中,他们打扰了任何可行的示例。这是一个示例,如何获取当前的USER_AGENT字符串。

Just add the following lines to your_spider.py:

只需将以下行添加到your_spider.py:

# To get your settings from (settings.py):
from scrapy.utils.project import get_project_settings
...
class YourSpider(BaseSpider):
    ...
    def parse(self, response):
        ...
        settings = get_project_settings()
        print "Your USER_AGENT is:\n%s" % (settings.get('USER_AGENT'))
        ...

As you can see, there's no need to use @classmethod or re-define the from_crawler() or __init__() functions. Hope this helps.

如您所见,不需要使用@classmethod或重新定义from_crawler()或__init __()函数。希望这可以帮助。

PS. I'm still not sure why using from scrapy.settings import Settings doesn't work the same way, since it would be the more obvious choice of import?

PS。我仍然不确定为什么使用scrapy.settings导入设置不能以相同的方式工作,因为它将是更明显的导入选择?

#3


16  

The correct answer is: it depends where in the pipeline you wish to access the settings.

正确答案是:它取决于您希望访问设置的管道中的位置。

avaleske has answered as if you wanted access to the settings outside of your pipelines process_item method but it's very likely this is where you'll want the setting and therefore there is a much easier way as the Spider instance itself gets passed in as an argument.

avaleske已回答好像您想要访问管道process_item方法之外的设置,但很可能这是您需要设置的地方,因此Spider实例本身作为参数传递时有更简单的方法。

class PipelineX(object):

    def process_item(self, item, spider):
         wanted_setting = spider.settings.get('WANTED_SETTING')

#4


2  

the project structure is quite flat, why not:

项目结构相当平坦,为什么不:

# pipeline.py
from myproject import settings

#1


24  

Ok, so the documentation at http://doc.scrapy.org/en/latest/topics/extensions.html says that

好的,所以http://doc.scrapy.org/en/latest/topics/extensions.html上的文档说明了这一点

The main entry point for a Scrapy extension (this also includes middlewares and pipelines) is the from_crawler class method which receives a Crawler instance which is the main object controlling the Scrapy crawler. Through that object you can access settings, signals, stats, and also control the crawler behaviour, if your extension needs to such thing.

Scrapy扩展的主要入口点(也包括中间件和管道)是from_crawler类方法,它接收一个Crawler实例,它是控制Scrapy搜寻器的主要对象。通过该对象,您可以访问设置,信号,统计信息,还可以控制爬虫行为,如果您的扩展需要这样的话。

So then you can have a function to get the settings.

那么你可以有一个函数来获取设置。

@classmethod
def from_crawler(cls, crawler):
    settings = crawler.settings
    my_setting = settings.get("MY_SETTING")
    return cls(my_setting)

The crawler engine then calls the pipeline's init function with my_setting, like so:

然后,爬虫引擎使用my_setting调用管道的init函数,如下所示:

def __init__(self, my_setting):
    self.my_setting = my_setting

And other functions can access it with self.my_setting, as expected.

正如预期的那样,其他函数可以使用self.my_setting访问它。

Alternatively, in the from_crawler() function you can pass the crawler.settings object to __init__(), and then access settings from the pipeline as needed instead of pulling them all out in the constructor.

或者,在from_crawler()函数中,您可以将crawler.settings对象传递给__init __(),然后根据需要从管道访问设置,而不是在构造函数中将它们全部拉出。

#2


23  

The way to access your Scrapy settings (as defined in settings.py) from within your_spider.py is simple. All other answers are way too complicated. The reason for this is the very poor maintenance of the Scrapy documentation, combined with many recent updates & changes. Neither in the "Settings" documentation "How to access settings", nor in the "Settings API" have they bothered giving any workable example. Here's an example, how to get your current USER_AGENT string.

从your_spider.py中访问Scrapy设置(在settings.py中定义)的方法很简单。所有其他答案都太复杂了。这样做的原因是Scrapy文档的维护非常糟糕,并结合了许多最近的更新和更改。既没有在“设置”文档“如何访问设置”中,也没有在“设置API”中,他们打扰了任何可行的示例。这是一个示例,如何获取当前的USER_AGENT字符串。

Just add the following lines to your_spider.py:

只需将以下行添加到your_spider.py:

# To get your settings from (settings.py):
from scrapy.utils.project import get_project_settings
...
class YourSpider(BaseSpider):
    ...
    def parse(self, response):
        ...
        settings = get_project_settings()
        print "Your USER_AGENT is:\n%s" % (settings.get('USER_AGENT'))
        ...

As you can see, there's no need to use @classmethod or re-define the from_crawler() or __init__() functions. Hope this helps.

如您所见,不需要使用@classmethod或重新定义from_crawler()或__init __()函数。希望这可以帮助。

PS. I'm still not sure why using from scrapy.settings import Settings doesn't work the same way, since it would be the more obvious choice of import?

PS。我仍然不确定为什么使用scrapy.settings导入设置不能以相同的方式工作,因为它将是更明显的导入选择?

#3


16  

The correct answer is: it depends where in the pipeline you wish to access the settings.

正确答案是:它取决于您希望访问设置的管道中的位置。

avaleske has answered as if you wanted access to the settings outside of your pipelines process_item method but it's very likely this is where you'll want the setting and therefore there is a much easier way as the Spider instance itself gets passed in as an argument.

avaleske已回答好像您想要访问管道process_item方法之外的设置,但很可能这是您需要设置的地方,因此Spider实例本身作为参数传递时有更简单的方法。

class PipelineX(object):

    def process_item(self, item, spider):
         wanted_setting = spider.settings.get('WANTED_SETTING')

#4


2  

the project structure is quite flat, why not:

项目结构相当平坦,为什么不:

# pipeline.py
from myproject import settings