UnicodeDecodeError: 'utf-8' codec不能解码字节错误。

时间:2023-01-04 20:53:45

I'm trying to get a response from urllib and decode it to a readable format. The text is in Hebrew and also contains characters like { and /

我试图从urllib得到一个响应,并将其解码为可读的格式。文本是用希伯来文写的,也包含了{和/。

top page coding is:

前页面编码是:

# -*- coding: utf-8 -*-

raw string is:

原始字符串:

b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'

Now I'm trying to decode it using:

现在我试着用:

 data = data.decode()

and I get the following error:

我得到了以下错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

2 个解决方案

#1


15  

Your problem is that that is not UTF-8. You have UTF-16 encoded data, decode it as such:

你的问题是那不是UTF-8。你有UTF-16编码的数据,解码如下:

>>> data = b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'
>>> data.decode('utf16')
'{ \r\n"id" : "1404830064696",\r\n"title" : "פיקוד העורף התרעה במרחב ",\r\n"data" : []\r\n}\r\n\r\n'
>>> import json
>>> json.loads(data.decode('utf16'))
{'title': 'פיקוד העורף התרעה במרחב ', 'id': '1404830064696', 'data': []}

If you loaded this from a website with urllib.request, the Content-Type header should contain a charset parameter telling you this; if response is the returned urllib.request response object, then use:

如果你从一个有urllib的网站上下载。请求,内容类型标头应该包含一个charset参数,告诉您这个;如果响应是返回的urllib。请求响应对象,然后使用:

codec = response.info().get_content_charset('utf-8')

This defaults to UTF-8 when no charset parameter has been set, which is the appropriate default for JSON data.

当没有设置charset参数时,默认为UTF-8,这是JSON数据的适当默认值。

Alternatively, use the requests library to load the JSON response, it handles decoding automatically (including UTF-codec autodetection specific to JSON responses).

或者,使用请求库加载JSON响应,它会自动处理解码(包括UTF-codec自动检测特定于JSON响应)。

One further note: the PEP 263 source code codec comment is used only to interpret your source code, including string literals. It has nothing to do with encodings of external sources (files, network data, etc.).

还有一个注意事项:PEP 263源代码codec注释仅用于解释源代码,包括字符串文本。它与外部源的编码(文件、网络数据等)无关。

#2


0  

I got this error in Django with Python 3.4. I was trying to get this to work with django-rest-framework.

我在Django中使用了Python 3.4。我试着让它和django-rest框架一起工作。

This was my code that fixed the error UnicodeDecodeError: 'utf-8' codec can't decode byte error.

这是我的代码,修复了错误的UnicodeDecodeError: 'utf-8' codec不能解码字节错误。

This is the passing test:

这是通过测试:

import os
from os.path import join, dirname
import uuid
from rest_framework.test import APITestCase

class AttachmentTests(APITestCase):

    def setUp(self):
        self.base_dir = dirname(dirname(dirname(__file__)))

        self.image = join(self.base_dir, "source/test_in/aaron.jpeg")
        self.image_filename = os.path.split(self.image)[1]

    def test_create_image(self):
        id = str(uuid.uuid4())
        with open(self.image, 'rb') as data:
            # data = data.read()
            post_data = {
                'id': id,
                'filename': self.image_filename,
                'file': data
            }

            response = self.client.post("/api/admin/attachments/", post_data)

            self.assertEqual(response.status_code, 201)

#1


15  

Your problem is that that is not UTF-8. You have UTF-16 encoded data, decode it as such:

你的问题是那不是UTF-8。你有UTF-16编码的数据,解码如下:

>>> data = b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'
>>> data.decode('utf16')
'{ \r\n"id" : "1404830064696",\r\n"title" : "פיקוד העורף התרעה במרחב ",\r\n"data" : []\r\n}\r\n\r\n'
>>> import json
>>> json.loads(data.decode('utf16'))
{'title': 'פיקוד העורף התרעה במרחב ', 'id': '1404830064696', 'data': []}

If you loaded this from a website with urllib.request, the Content-Type header should contain a charset parameter telling you this; if response is the returned urllib.request response object, then use:

如果你从一个有urllib的网站上下载。请求,内容类型标头应该包含一个charset参数,告诉您这个;如果响应是返回的urllib。请求响应对象,然后使用:

codec = response.info().get_content_charset('utf-8')

This defaults to UTF-8 when no charset parameter has been set, which is the appropriate default for JSON data.

当没有设置charset参数时,默认为UTF-8,这是JSON数据的适当默认值。

Alternatively, use the requests library to load the JSON response, it handles decoding automatically (including UTF-codec autodetection specific to JSON responses).

或者,使用请求库加载JSON响应,它会自动处理解码(包括UTF-codec自动检测特定于JSON响应)。

One further note: the PEP 263 source code codec comment is used only to interpret your source code, including string literals. It has nothing to do with encodings of external sources (files, network data, etc.).

还有一个注意事项:PEP 263源代码codec注释仅用于解释源代码,包括字符串文本。它与外部源的编码(文件、网络数据等)无关。

#2


0  

I got this error in Django with Python 3.4. I was trying to get this to work with django-rest-framework.

我在Django中使用了Python 3.4。我试着让它和django-rest框架一起工作。

This was my code that fixed the error UnicodeDecodeError: 'utf-8' codec can't decode byte error.

这是我的代码,修复了错误的UnicodeDecodeError: 'utf-8' codec不能解码字节错误。

This is the passing test:

这是通过测试:

import os
from os.path import join, dirname
import uuid
from rest_framework.test import APITestCase

class AttachmentTests(APITestCase):

    def setUp(self):
        self.base_dir = dirname(dirname(dirname(__file__)))

        self.image = join(self.base_dir, "source/test_in/aaron.jpeg")
        self.image_filename = os.path.split(self.image)[1]

    def test_create_image(self):
        id = str(uuid.uuid4())
        with open(self.image, 'rb') as data:
            # data = data.read()
            post_data = {
                'id': id,
                'filename': self.image_filename,
                'file': data
            }

            response = self.client.post("/api/admin/attachments/", post_data)

            self.assertEqual(response.status_code, 201)