如何在scrapy python中使用多个请求并在它们之间传递项目

时间:2023-01-24 11:05:06

I have the item object and i need to pass that along many pages to store data in single item

我有项目对象,我需要将其传递到许多页面以将数据存储在单个项目中

LIke my item is

喜欢我的项目

class DmozItem(Item):
    title = Field()
    description1 = Field()
    description2 = Field()
    description3 = Field()

Now those three description are in three separate pages. i want to do somrething like

现在这三个描述分为三个单独的页面。我想做点什么

Now this works good for parseDescription1

现在这适用于parseDescription1

def page_parser(self, response):
    sites = hxs.select('//div[@class="row"]')
    items = []
    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)
    request.meta['item'] = item
    return request 

def parseDescription1(self,response):
    item = response.meta['item']
    item['desc1'] = "test"
    return item

But i want something like

但我想要类似的东西

def page_parser(self, response):
    sites = hxs.select('//div[@class="row"]')
    items = []
    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)
    request.meta['item'] = item

    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
    request.meta['item'] = item

    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
    request.meta['item'] = item

    return request 

def parseDescription1(self,response):
    item = response.meta['item']
    item['desc1'] = "test"
    return item

def parseDescription2(self,response):
    item = response.meta['item']
    item['desc2'] = "test2"
    return item

def parseDescription3(self,response):
    item = response.meta['item']
    item['desc3'] = "test3"
    return item

3 个解决方案

#1


26  

No problem. Instead of

没问题。代替

def page_parser(self, response):
      sites = hxs.select('//div[@class="row"]')
      items = []
      request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)
      request.meta['item'] = item

      request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
      request.meta['item'] = item

      request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
      request.meta['item'] = item


      return request 

def parseDescription1(self,response):
            item = response.meta['item']
            item['desc1'] = "test"
            return item

def parseDescription2(self,response):
            item = response.meta['item']
            item['desc2'] = "test2"
            return item

def parseDescription3(self,response):
            item = response.meta['item']
            item['desc3'] = "test3"
            return item

Do

def page_parser(self, response):
      sites = hxs.select('//div[@class="row"]')
      items = []

      request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
      request.meta['item'] = item
      yield request

      request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription2, meta={'item': item})
      yield request

      yield Request("http://www.example.com/lin1.cpp", callback=self.parseDescription3, meta={'item': item})

def parseDescription1(self,response):
            item = response.meta['item']
            item['desc1'] = "test"
            return item

def parseDescription2(self,response):
            item = response.meta['item']
            item['desc2'] = "test2"
            return item

def parseDescription3(self,response):
            item = response.meta['item']
            item['desc3'] = "test3"
            return item

#2


21  

In order to guarantee an ordering of the requests/callbacks and that only one item is ultimately returned you need to chain your requests using a form like:

为了保证请求/回调的排序,并且最终只返回一个项目,您需要使用以下形式链接您的请求:

  def page_parser(self, response):
        sites = hxs.select('//div[@class="row"]')
        items = []

        request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
        request.meta['item'] = Item()
        return [request]


  def parseDescription1(self,response):
        item = response.meta['item']
        item['desc1'] = "test"
        return [Request("http://www.example.com/lin2.cpp", callback=self.parseDescription2, meta={'item': item})]


  def parseDescription2(self,response):
        item = response.meta['item']
        item['desc2'] = "test2"
        return [Request("http://www.example.com/lin3.cpp", callback=self.parseDescription3, meta={'item': item})]

  def parseDescription3(self,response):
        item = response.meta['item']
        item['desc3'] = "test3"
        return [item]

Each callback function returns an iterable of items or requests, requests are scheduled and items are run through your item pipeline.

每个回调函数返回一个可迭代的项目或请求,计划请求并通过项目管道运行项目。

If you return an item from each of the callbacks, you'll end up with 4 items in various states of completeness in your pipeline, but if you return the next request, then you can guaruntee the order of requests and that you will have exactly one item at the end of execution.

如果你从每个回调中返回一个项目,你最终将在管道中以不同的完整状态得到4个项目,但是如果你返回下一个请求,那么你可以保证请求的顺序,你将完全拥有执行结束时的一个项目。

#3


13  

The accepted answer returns a total of three items [with desc(i) set for i=1,2,3].

接受的答案总共返回三个项目[desc(i)设置为i = 1,2,3]。

If you want to return a single item, Dave McLain's item does work, however it requires parseDescription1, parseDescription2, and parseDescription3 to succeed and run without errors in order to return the item.

如果你想返回一个项目,Dave McLain的项目确实有效,但是它需要parseDescription1,parseDescription2和parseDescription3才能成功运行并且没有错误才能返回该项目。

For my use case, some of the subrequests MAY return HTTP 403/404 errors at random, thus I lost some of the items, even though I could have scraped them partially.

对于我的用例,一些子请求可以随机返回HTTP 403/404错误,因此我丢失了一些项目,即使我可以部分地删除它们。


Workaround

Thus, I currently employ the following workaround: Instead of only passing the item around in the request.meta dict, pass around a call stack that knows what request to call next. It will call the next item on the stack (so long as it isn't empty), and returns the item if the stack is empty.

因此,我目前采用以下解决方法:而不是仅在request.meta dict中传递项目,而是传递一个知道接下来要调用的请求的调用堆栈。它将调用堆栈上的下一个项目(只要它不为空),并在堆栈为空时返回该项目。

The errback request parameter is used to return to the dispatcher method upon errors and simply continue with the next stack item.

errback请求参数用于在出错时返回调度程序方法,并继续使用下一个堆栈项。

def callnext(self, response):
    ''' Call next target for the item loader, or yields it if completed. '''

    # Get the meta object from the request, as the response
    # does not contain it.
    meta = response.request.meta

    # Items remaining in the stack? Execute them
    if len(meta['callstack']) > 0:
        target = meta['callstack'].pop(0)
        yield Request(target['url'], meta=meta, callback=target['callback'], errback=self.callnext)
    else:
        yield meta['loader'].load_item()

def parseDescription1(self, response):

    # Recover item(loader)
    l = response.meta['loader']

    # Use just as before
    l.add_css(...)

    # Build the call stack
    callstack = [
        {'url': "http://www.example.com/lin2.cpp",
        'callback': self.parseDescription2 },
        {'url': "http://www.example.com/lin3.cpp",
        'callback': self.parseDescription3 }
    ]

    return self.callnext(response)

def parseDescription2(self, response):

    # Recover item(loader)
    l = response.meta['loader']

    # Use just as before
    l.add_css(...)

    return self.callnext(response)


def parseDescription3(self, response):

    # ...

    return self.callnext(response)

Warning

This solution is still synchronous, and will still fail if you have any exceptions within the callbacks.

此解决方案仍然是同步的,如果回调中有任何异常,它仍然会失败。

For more information, check the blog post I wrote about that solution.

有关更多信息,请查看我撰写的有关该解决方案的博文。

#1


26  

No problem. Instead of

没问题。代替

def page_parser(self, response):
      sites = hxs.select('//div[@class="row"]')
      items = []
      request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)
      request.meta['item'] = item

      request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
      request.meta['item'] = item

      request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
      request.meta['item'] = item


      return request 

def parseDescription1(self,response):
            item = response.meta['item']
            item['desc1'] = "test"
            return item

def parseDescription2(self,response):
            item = response.meta['item']
            item['desc2'] = "test2"
            return item

def parseDescription3(self,response):
            item = response.meta['item']
            item['desc3'] = "test3"
            return item

Do

def page_parser(self, response):
      sites = hxs.select('//div[@class="row"]')
      items = []

      request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
      request.meta['item'] = item
      yield request

      request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription2, meta={'item': item})
      yield request

      yield Request("http://www.example.com/lin1.cpp", callback=self.parseDescription3, meta={'item': item})

def parseDescription1(self,response):
            item = response.meta['item']
            item['desc1'] = "test"
            return item

def parseDescription2(self,response):
            item = response.meta['item']
            item['desc2'] = "test2"
            return item

def parseDescription3(self,response):
            item = response.meta['item']
            item['desc3'] = "test3"
            return item

#2


21  

In order to guarantee an ordering of the requests/callbacks and that only one item is ultimately returned you need to chain your requests using a form like:

为了保证请求/回调的排序,并且最终只返回一个项目,您需要使用以下形式链接您的请求:

  def page_parser(self, response):
        sites = hxs.select('//div[@class="row"]')
        items = []

        request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
        request.meta['item'] = Item()
        return [request]


  def parseDescription1(self,response):
        item = response.meta['item']
        item['desc1'] = "test"
        return [Request("http://www.example.com/lin2.cpp", callback=self.parseDescription2, meta={'item': item})]


  def parseDescription2(self,response):
        item = response.meta['item']
        item['desc2'] = "test2"
        return [Request("http://www.example.com/lin3.cpp", callback=self.parseDescription3, meta={'item': item})]

  def parseDescription3(self,response):
        item = response.meta['item']
        item['desc3'] = "test3"
        return [item]

Each callback function returns an iterable of items or requests, requests are scheduled and items are run through your item pipeline.

每个回调函数返回一个可迭代的项目或请求,计划请求并通过项目管道运行项目。

If you return an item from each of the callbacks, you'll end up with 4 items in various states of completeness in your pipeline, but if you return the next request, then you can guaruntee the order of requests and that you will have exactly one item at the end of execution.

如果你从每个回调中返回一个项目,你最终将在管道中以不同的完整状态得到4个项目,但是如果你返回下一个请求,那么你可以保证请求的顺序,你将完全拥有执行结束时的一个项目。

#3


13  

The accepted answer returns a total of three items [with desc(i) set for i=1,2,3].

接受的答案总共返回三个项目[desc(i)设置为i = 1,2,3]。

If you want to return a single item, Dave McLain's item does work, however it requires parseDescription1, parseDescription2, and parseDescription3 to succeed and run without errors in order to return the item.

如果你想返回一个项目,Dave McLain的项目确实有效,但是它需要parseDescription1,parseDescription2和parseDescription3才能成功运行并且没有错误才能返回该项目。

For my use case, some of the subrequests MAY return HTTP 403/404 errors at random, thus I lost some of the items, even though I could have scraped them partially.

对于我的用例,一些子请求可以随机返回HTTP 403/404错误,因此我丢失了一些项目,即使我可以部分地删除它们。


Workaround

Thus, I currently employ the following workaround: Instead of only passing the item around in the request.meta dict, pass around a call stack that knows what request to call next. It will call the next item on the stack (so long as it isn't empty), and returns the item if the stack is empty.

因此,我目前采用以下解决方法:而不是仅在request.meta dict中传递项目,而是传递一个知道接下来要调用的请求的调用堆栈。它将调用堆栈上的下一个项目(只要它不为空),并在堆栈为空时返回该项目。

The errback request parameter is used to return to the dispatcher method upon errors and simply continue with the next stack item.

errback请求参数用于在出错时返回调度程序方法,并继续使用下一个堆栈项。

def callnext(self, response):
    ''' Call next target for the item loader, or yields it if completed. '''

    # Get the meta object from the request, as the response
    # does not contain it.
    meta = response.request.meta

    # Items remaining in the stack? Execute them
    if len(meta['callstack']) > 0:
        target = meta['callstack'].pop(0)
        yield Request(target['url'], meta=meta, callback=target['callback'], errback=self.callnext)
    else:
        yield meta['loader'].load_item()

def parseDescription1(self, response):

    # Recover item(loader)
    l = response.meta['loader']

    # Use just as before
    l.add_css(...)

    # Build the call stack
    callstack = [
        {'url': "http://www.example.com/lin2.cpp",
        'callback': self.parseDescription2 },
        {'url': "http://www.example.com/lin3.cpp",
        'callback': self.parseDescription3 }
    ]

    return self.callnext(response)

def parseDescription2(self, response):

    # Recover item(loader)
    l = response.meta['loader']

    # Use just as before
    l.add_css(...)

    return self.callnext(response)


def parseDescription3(self, response):

    # ...

    return self.callnext(response)

Warning

This solution is still synchronous, and will still fail if you have any exceptions within the callbacks.

此解决方案仍然是同步的,如果回调中有任何异常,它仍然会失败。

For more information, check the blog post I wrote about that solution.

有关更多信息,请查看我撰写的有关该解决方案的博文。