有没有办法使用scrapy将报废的每个项目导出到单独的json文件中?

时间:2022-05-08 22:36:58

currently I am using "yield item" after every item i scrape, though it gives me all the items in one single Json file.

目前我在每个项目后使用“yield item”,虽然它给了我一个Json文件中的所有项目。

1 个解决方案

#1


You can use scrapy-pipeline and from there you can insert each item into seperate files.

您可以使用scrapy-pipeline,然后您可以将每个项目插入单独的文件中。

I have set a counter in my spider so that it increments on each item yield and added that value to item. Using that counter value I'm creating file names.

我在我的蜘蛛中设置了一个计数器,以便它在每个项目产量上递增并将该值添加到项目中。使用该计数器值我正在创建文件名。

Test_spider.py

class TestSpider(Spider):
    # spider name and all
    file_counter = 0

def parse(self, response):
    # your code here

def parse_item(self, response):
     # your code here
     self.file_counter += 1
      item = Testtem(
        #other items, 
        counter=self.file_counter)
     yield item

enable pipeline in settings.py by

在settings.py中启用管道

ITEM_PIPELINES = {'test1.pipelines.TestPipeline': 100}

pipelines.py

class TestPipeline(object):

    def process_item(self, item, spider):
        with open('test_data_%s' % item.get('counter'), 'w') as wr:
            item.pop('counter') # remove the counter data, you don't need this in your item
            wr.write(str(item))
        return item

#1


You can use scrapy-pipeline and from there you can insert each item into seperate files.

您可以使用scrapy-pipeline,然后您可以将每个项目插入单独的文件中。

I have set a counter in my spider so that it increments on each item yield and added that value to item. Using that counter value I'm creating file names.

我在我的蜘蛛中设置了一个计数器,以便它在每个项目产量上递增并将该值添加到项目中。使用该计数器值我正在创建文件名。

Test_spider.py

class TestSpider(Spider):
    # spider name and all
    file_counter = 0

def parse(self, response):
    # your code here

def parse_item(self, response):
     # your code here
     self.file_counter += 1
      item = Testtem(
        #other items, 
        counter=self.file_counter)
     yield item

enable pipeline in settings.py by

在settings.py中启用管道

ITEM_PIPELINES = {'test1.pipelines.TestPipeline': 100}

pipelines.py

class TestPipeline(object):

    def process_item(self, item, spider):
        with open('test_data_%s' % item.get('counter'), 'w') as wr:
            item.pop('counter') # remove the counter data, you don't need this in your item
            wr.write(str(item))
        return item