枕头+ scrapy =有时无法识别图像文件

时间:2021-01-31 19:36:15

I've little bug with scrapy and Pillow. Know they've many "same" question but I try all I find and it's not works..

我对scrapy和Pillow有点小错误。知道他们有很多“相同”的问题,但我尝试了所有我找到的并且它不起作用..

I use scrapy to parse many website, more than 100 000 webpages. I've created a pipeline that define if page contains image, and if, it download picture and create thumbail on same path. Use it because if creation of thumbail fail, I've "big" version of image.

我使用scrapy来解析许多网站,超过10万个网页。我已经创建了一个管道,用于定义页面是否包含图像,如果,它下载图片并在同一路径上创建缩略图。使用它,因为如果创建缩略图失败,我就是“大”版本的图像。

Here some code

这里有一些代码

from PIL import Image
from slugify import slugify

class DownloadImageOnDisk( object ):
    def process_item( self, item, spider ):
        try:
            # If image on page
            if item[ 'image' ]:
                img     = item[ 'image' ]
                # Get extension of image
                ext     = img.split( '.' )
                ext     = ext[ -1 ].split('?')
                ext     = ext[0]
                key     = self.remove_accents( item[ 'imagetitle' ] ).encode( 'utf-8', 'replace' )
                path    = settings[ 'IMG_PATH' ] + item[ 'website' ] + '/' + key + '.' + ext

                # Create dir
                if not os.path.exists( settings['IMG_PATH'] + item['website'] ):
                    os.makedirs( settings[ 'IMG_PATH' ] + item[ 'website' ] )

                # Check if image not already exist
                if not os.path.isfile( path ):
                    # Download big image
                    urllib.urlretrieve( img, path )
                    if os.path.isfile( path ):
                        # Create thumb
                        self.optimize_image( path )

                item[ 'image' ] = item[ 'website' ] + '/' + key + '.' + ext

            return item
        except Exception as exc:
            pass

    # Slugify path
    def remove_accents( self, input_str ):
        try:
            return slugify( input_str )
        except Exception as exc:
            raise DropItem( exc )

    # Create thumb
    def optimize_image( self, path ):
        try:
            image = Image.open( path )
            image.thumbnail( ( 200,200 ), Image.ANTIALIAS )
            image.save( path, optimize=True, quality=85 )
        except IOError  as exc:
            raise DropItem( exc )
        except Exception as exc:
            raise DropItem( exc )

But sometimes, not regulary (one for 100 items I thinks) I've this error

但有时候,不是常规的(我认为100个项目之一)我有这个错误

cannot identify image file '/PATH/NAME.jpg'

On optimize_image function. When I check on disk I image exist, it already do.

关于optimize_image函数。当我检查磁盘时我存在图像,它已经存在了。

I really not understand..

我真的不明白..

I you've any suggestion.

我有什么建议。

Thanks in advance

提前致谢

1 个解决方案

#1


Not sure but it seems to be resolve with

不确定,但似乎是解决

import requests
import io
...
response = requests.get( img )
image = Image.open(io.BytesIO(response.content))
image.thumbnail( ( 200,200 ), Image.ANTIALIAS )
image.save( path, optimize=True, quality=85 )

I continue my test

我继续我的考试

#1


Not sure but it seems to be resolve with

不确定,但似乎是解决

import requests
import io
...
response = requests.get( img )
image = Image.open(io.BytesIO(response.content))
image.thumbnail( ( 200,200 ), Image.ANTIALIAS )
image.save( path, optimize=True, quality=85 )

I continue my test

我继续我的考试