Python 3 CSV文件提供UnicodeDecodeError: 'utf-8'编解码器在我打印时无法解码字节错误

时间:2023-01-04 20:29:48

I have the following code in Python 3, which is meant to print out each line in a csv file.

我有以下Python 3中的代码,这意味着要打印csv文件中的每一行。

import csv
with open('my_file.csv', 'r', newline='') as csvfile:
    lines = csv.reader(csvfile, delimiter = ',', quotechar = '|')
    for line in lines:
        print(' '.join(line))

But when I run it, it gives me this error:

但是当我运行它时,它会给我这个错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte

I looked through the csv file, and it turns out that if I take out a single ñ (little n with a tilde on top), every line prints out fine.

我查看了csv文件,发现如果取出一个n(上面有一个倾斜的小n),每一行都打印得很好。

My problem is that I've looked through a bunch of different solutions to similar problems, but I still have no idea how to fix this, what to decode/encode, etc. Simply taking out the ñ character in the data is NOT an option.

我的问题是,我已经通过一堆不同的解决方案来解决类似的问题,但我仍然不知道如何解决这个问题,如何解码/编码等等。简单地去掉数据中的n个字符不是一个选项。

3 个解决方案

#1


29  

We know the file contains the byte b'\x96' since it is mentioned in the error message:

我们知道该文件包含字节b'\x96',因为它在错误消息中提到:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte

Now we can write a little script to find out if there are any encodings where b'\x96' decodes to ñ:

现在我们可以编写一个小脚本,看看是否有任何编码可以让b'\x96'解码到n:

import pkgutil
import encodings
import os

def all_encodings():
    modnames = set([modname for importer, modname, ispkg in pkgutil.walk_packages(
        path=[os.path.dirname(encodings.__file__)], prefix='')])
    aliases = set(encodings.aliases.aliases.values())
    return modnames.union(aliases)

text = b'\x96'
for enc in all_encodings():
    try:
        msg = text.decode(enc)
    except Exception:
        continue
    if msg == 'ñ':
        print('Decoding {t} with {enc} is {m}'.format(t=text, enc=enc, m=msg))

which yields

的收益率

Decoding b'\x96' with mac_roman is ñ
Decoding b'\x96' with mac_farsi is ñ
Decoding b'\x96' with mac_croatian is ñ
Decoding b'\x96' with mac_arabic is ñ
Decoding b'\x96' with mac_romanian is ñ
Decoding b'\x96' with mac_iceland is ñ
Decoding b'\x96' with mac_turkish is ñ

Therefore, try changing

因此,尝试改变

with open('my_file.csv', 'r', newline='') as csvfile:

to one of those encodings, such as:

给其中一个编码,例如:

with open('my_file.csv', 'r', encoding='mac_roman', newline='') as csvfile:

#2


1  

For others who hit the same error shown in the subject, watch out for the file encoding of your csv file. Its possible it is not utf-8. I just noticed that LibreOffice created a utf-16 encoded file for me today without prompting me although I could not reproduce this.

对于那些在主题中遇到相同错误的人,请注意csv文件的文件编码。它可能不是utf-8。我刚刚注意到LibreOffice今天为我创建了一个utf-16编码文件,没有提示我,虽然我无法复制它。

If you try to open a utf-16 encoded document using open(... encoding='utf-8'), you will get the error:

如果您试图使用open(…)打开一个utf-16编码的文档。编码='utf-8'),会得到错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

UnicodeDecodeError:“utf-8”编解码器在0位置无法解码字节0xff:无效的开始字节

To fix either specify 'utf-16' encoding or change the encoding of the csv.

要修复“utf-16”编码或更改csv的编码。

#3


0  

with open('my_file.csv', 'r', newline='', encoding='utf-8') as csvfile:

Try opening the file like above

尝试像上面那样打开文件

#1


29  

We know the file contains the byte b'\x96' since it is mentioned in the error message:

我们知道该文件包含字节b'\x96',因为它在错误消息中提到:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte

Now we can write a little script to find out if there are any encodings where b'\x96' decodes to ñ:

现在我们可以编写一个小脚本,看看是否有任何编码可以让b'\x96'解码到n:

import pkgutil
import encodings
import os

def all_encodings():
    modnames = set([modname for importer, modname, ispkg in pkgutil.walk_packages(
        path=[os.path.dirname(encodings.__file__)], prefix='')])
    aliases = set(encodings.aliases.aliases.values())
    return modnames.union(aliases)

text = b'\x96'
for enc in all_encodings():
    try:
        msg = text.decode(enc)
    except Exception:
        continue
    if msg == 'ñ':
        print('Decoding {t} with {enc} is {m}'.format(t=text, enc=enc, m=msg))

which yields

的收益率

Decoding b'\x96' with mac_roman is ñ
Decoding b'\x96' with mac_farsi is ñ
Decoding b'\x96' with mac_croatian is ñ
Decoding b'\x96' with mac_arabic is ñ
Decoding b'\x96' with mac_romanian is ñ
Decoding b'\x96' with mac_iceland is ñ
Decoding b'\x96' with mac_turkish is ñ

Therefore, try changing

因此,尝试改变

with open('my_file.csv', 'r', newline='') as csvfile:

to one of those encodings, such as:

给其中一个编码,例如:

with open('my_file.csv', 'r', encoding='mac_roman', newline='') as csvfile:

#2


1  

For others who hit the same error shown in the subject, watch out for the file encoding of your csv file. Its possible it is not utf-8. I just noticed that LibreOffice created a utf-16 encoded file for me today without prompting me although I could not reproduce this.

对于那些在主题中遇到相同错误的人,请注意csv文件的文件编码。它可能不是utf-8。我刚刚注意到LibreOffice今天为我创建了一个utf-16编码文件,没有提示我,虽然我无法复制它。

If you try to open a utf-16 encoded document using open(... encoding='utf-8'), you will get the error:

如果您试图使用open(…)打开一个utf-16编码的文档。编码='utf-8'),会得到错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

UnicodeDecodeError:“utf-8”编解码器在0位置无法解码字节0xff:无效的开始字节

To fix either specify 'utf-16' encoding or change the encoding of the csv.

要修复“utf-16”编码或更改csv的编码。

#3


0  

with open('my_file.csv', 'r', newline='', encoding='utf-8') as csvfile:

Try opening the file like above

尝试像上面那样打开文件