“UnicodeDecodeError:‘utf-8’codec无法解码字节0x80”，使用pydrive在谷歌并行环境下加载pickle文件

I am new to using google colaboratory (colab) and pydrive along with it. I am trying to load data in 'CAS_num_strings' which was written in a pickle file in a specific directory on my google drive using colab as:

我是第一次使用谷歌colaboratory (colab)和pydrive。我正在尝试在'CAS_num_strings'中加载数据，'CAS_num_strings'是用colab作为:

pickle.dump(CAS_num_strings,open('CAS_num_strings.p', 'wb'))
dump_meta = {'title': 'CAS.pkl', 'parents': [{'id':'1UEqIADV_tHic1Le0zlT25iYB7T6dBpBj'}]} 
pkl_dump = drive.CreateFile(dump_meta)
pkl_dump.SetContentFile('CAS_num_strings.p')
pkl_dump.Upload()
print(pkl_dump.get('id'))

Where 'id':'1UEqIADV_tHic1Le0zlT25iYB7T6dBpBj' makes sure that it has a specific parent folder with this given by this id. The last print command gives me the output:

其中'id':'1UEqIADV_tHic1Le0zlT25iYB7T6dBpBj'确保它有一个特定的父文件夹，由这个id提供。最后一个print命令输出:

'1ZgZfEaKgqGnuBD40CY8zg0MCiqKmi1vH'

Hence, I am able to create and dump the pickle file whose id is '1ZgZfEaKgqGnuBD40CY8zg0MCiqKmi1vH'. Now, I want to load this pickle file in another colab script for a different purpose. In order to load, I use the command set:

因此，我可以创建并转储id为“1ZgZfEaKgqGnuBD40CY8zg0MCiqKmi1vH”的pickle文件。现在，我想以不同的目的在另一个colab脚本中加载这个pickle文件。为了加载，我使用命令集:

cas_strings = drive.CreateFile({'id':'1ZgZfEaKgqGnuBD40CY8zg0MCiqKmi1vH'})
print('title: %s, mimeType: %s' % (cas_strings['title'], cas_strings['mimeType']))
print('Downloaded content "{}"'.format(cas_strings.GetContentString()))

This gives me the output:

这给了我输出:

title: CAS.pkl, mimeType: text/x-pascal

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-9-a80d9de0fecf> in <module>()
     30 cas_strings = drive.CreateFile({'id':'1ZgZfEaKgqGnuBD40CY8zg0MCiqKmi1vH'})
     31 print('title: %s, mimeType: %s' % (cas_strings['title'], cas_strings['mimeType']))
---> 32 print('Downloaded content "{}"'.format(cas_strings.GetContentString()))
     33 
     34 

/usr/local/lib/python3.6/dist-packages/pydrive/files.py in GetContentString(self, mimetype, encoding, remove_bom)
    192                     self.has_bom == remove_bom:
    193       self.FetchContent(mimetype, remove_bom)
--> 194     return self.content.getvalue().decode(encoding)
    195 
    196   def GetContentFile(self, filename, mimetype=None, remove_bom=False):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

As you can see, it finds the file CAS.pkl but cannot decode the data. I want to be able to resolve this error. I understand that the normal utf-8 encoding/decoding works smoothly during normal pickle dumping and loading with the 'wb' and 'rb' options. However in the present case, after dumping I can't seem to load it from the pickle file in google drive created in the previous step. The error exists somewhere in me not being able to specify how to decode the data at "return self.content.getvalue().decode(encoding)". I can't seem to find from here (https://developers.google.com/drive/v2/reference/files#resource-representations) which keywords/metadata tags to modify. Any help is appreciated. Thanks

如您所见，它找到了文件CAS。但无法解码数据。我希望能够解决这个错误。我理解正常的utf-8编码/解码在正常的泡菜倾倒和装载时使用“wb”和“rb”选项时工作得很顺利。但是在目前的情况下，在转储之后，我似乎无法从前面步骤中创建的谷歌驱动器中的pickle文件中加载它。我的某个地方存在错误，无法在“return self.content.getvalue().decode(编码)”中指定如何解码数据。我似乎在这里找不到要修改的关键字/元数据标签(https://developers.google.com/drive/v2/reference/files#resource-representation)。任何帮助都是感激。谢谢

2 个解决方案

#1

The problem is that GetContentString only works if the contents are a valid UTF-8 string (docs), and your pickle is not.

问题是，GetContentString只在内容是有效的UTF-8字符串(docs)时才工作，而pickle则不工作。

Unfortunately, you'll have to do a little extra work, since there's no GetContentBytes -- you have to save the contents to a file and read them back out. Here's a working example: https://colab.research.google.com/drive/1gmh21OrJL0Dv49z28soYq_YcqKEnaQ1X

不幸的是，您将不得不做一些额外的工作，因为没有GetContentBytes——您必须将内容保存到一个文件中并将它们读回来。这里有一个工作示例:https://colab.research.google.com/drive/1gmh21OrJL0Dv49z28soYq_YcqKEnaQ1X

#2

Actually, I found an elegant answer with a little help from my friends. Instead of GetContentStrings, I use GetContentFile, which is the counterpart of the SetContentFile. This loads the file in the current workspace from which I can read it like any pickle file. Finally, the data gets loaded into cas_nums all well.

事实上，我在朋友们的帮助下找到了一个优雅的答案。我使用的是GetContentFile，而不是GetContentStrings，它是SetContentFile的对等物。这将加载当前工作区中的文件，我可以从其中读取它，就像读取任何pickle文件一样。最后，数据被加载到cas_nums中。

cas_strings = drive.CreateFile({'id':'1ZgZfEaKgqGnuBD40CY8zg0MCiqKmi1vH'})
print('title: %s, mimeType: %s' % (cas_strings['title'], cas_strings['mimeType']))
cas_strings.GetContentFile(cas_strings['title'])
cas_nums = pickle.load(open(cas_strings['title'],'rb'))

More details about this can be found in the pydrive documentation in the section download file content - http://pythonhosted.org/PyDrive/filemanagement.html#download-file-content

有关这一点的更多细节可以在pydrive文档的下载文件内容部分找到——http://pythonhosted.org/PyDrive/filemanagement.html#download-file-content

#1