UnicodeDecodeError: ('utf-8' codec)读取csv文件[duplicate]

时间:2021-12-28 20:38:38

This question already has an answer here:

这个问题已经有了答案:

what i am trying is reading a csv to make a dataframe---making changes in a column---again updating/reflecting changed value into same csv(to_csv)- again trying to read that csv to make another dataframe...there i am getting an error

我正在尝试的是读取一个csv来创建一个dataframe---在列中进行更改---再次更新/反映已更改的值到相同的csv(to_csv)- --再次尝试读取该csv以创建另一个dataframe…这里有个错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte

my code is

我的代码是

 import pandas as pd
 df = pd.read_csv("D:\ss.csv")
 df.columns  #o/p is Index(['CUSTOMER_MAILID', 'False', 'True'], dtype='object')
 df['True'] = df['True'] + 2     #making changes to one column of type float
 df.to_csv("D:\ss.csv")       #updating that .csv    
 df1 = pd.read_csv("D:\ss.csv")   #again trying to read that csv

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte

So please suggest how can i avoid the error and be able to read that csv again to a dataframe.

所以请建议我如何避免错误,并能再次读取csv到一个dataframe。

I know somewhere i am missing "encode = some codec type" or "decode = some type" while reading and writing to csv.

在读写csv时,我知道在某个地方我漏掉了“encode =某个编解码类型”或“decode =某个类型”。

But i don't know what exactly should be changed.so need help.

但我不知道究竟应该改变什么。所以需要帮助。

6 个解决方案

#1


28  

Known encoding

If you know the encoding of the file you want to read in, you can use

如果您知道要读入的文件的编码,可以使用。

pd.read_csv('filename.txt', encoding='encoding')

These are the possible encodings: https://docs.python.org/3/library/codecs.html#standard-encodings

这些是可能的编码:https://docs.python.org/3/library/codecs.html#standard-encodings

Unknown encoding

If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.

如果您不知道编码,您可以尝试使用chardet,但是这并不能保证有效。这更像是一个猜测。

import chardet
import pandas as pd

with open('filename.csv', 'rb') as f:
    result = chardet.detect(f.read())  # or readline if the file is large


pd.read_csv('filename.csv', encoding=result['encoding'])

#2


7  

One simple solution is you can open the csv file in an editor like Sublime Text and save it with 'utf-8' encoding. Then we can easily read the file through pandas.

一个简单的解决方案是,您可以在编辑器中打开csv文件,比如Sublime Text,然后使用“utf-8”编码将其保存。然后我们可以很容易地通过熊猫来阅读文件。

#3


5  

Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.

这个错误是发生在你第一次读数据的时候,还是在你写出来再读回来的时候?我的猜测是,它实际上是在数据的第一次读取时发生的,因为您的CSV的编码不是UTF-8。

Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).

尝试在Notepad+、Excel或LibreOffice中打开这个CSV文件。你的数据源中有c (c)和cedilla吗?如果是这样,那么您所看到的0xE7字节很可能是在Latin-1或Windows-1252(在Python中称为“cp1252”)中编码的c。

Looking at the documentation for the Pandas read_csv() function, I see it has an encoding parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252" to your read_csv() call, as follows:

查看panda read_csv()函数的文档,我看到它有一个编码参数,它应该是您期望的CSV文件所在的编码的名称。因此,尝试将编码=“cp1252”添加到read_csv()调用中,如下所示:

df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")

Note that I added the character r in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv to new-ss.csv, where the string D:\new-ss.csv would be read as D, :, newline character, e, w, etc.

注意,我在文件名前面添加了字符r,因此它将被视为“原始字符串”,反斜杠将不会被特殊处理。这样,当您将文件名从ss.csv更改为new ss时,您不会感到惊讶。csv,字符串D:\new-ss。csv将被读取为D、:、换行字符、e、w等。

Anyway, try that encoding parameter on your first read_csv() call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)

无论如何,在第一个read_csv()调用中尝试这个编码参数,看看它是否有效。(这只是猜测,因为我不知道你的真实数据。如果数据文件不是私有的,也不是太大的,那么尝试发布数据文件,这样我们就可以看到它的内容——这比猜测要好。

#4


4  

Yes you'll get this error. I have work around with this problem, by opening csv file in notepad++ and changing the encoding throught Encoding menu -> convert to UTF-8. Then saving the file. Then again running python program over it.

是的,你会得到这个错误。我已经解决了这个问题,在notepad++中打开csv文件,并通过编码菜单->转换为UTF-8来改变编码。然后保存文件。然后再在上面运行python程序。

Other solution is using codecs module in python for encoding-decoding of files. I haven't used that.

另一种解决方案是使用python中的codecs模块对文件进行编码解码。我没有使用过。

#5


3  

Above method used by importing and then detecting file type works import chardet

上述方法用于导入然后检测文件类型的工作

import pandas as pd
import chardet
with open('filename.csv', 'rb') as f:
    result = chardet.detect(f.read())  # or readline if the file is large


pd.read_csv('filename.csv', encoding=result['encoding'])

#6


1  

I am new to python. Ran into this exact issue when I manually changed the extension on my excel file to .csv and tried to read it with read_csv. However, if I opened the excel file and saved as csv file instead it seemed to work.

我对python不熟。当我手动将excel文件的扩展名更改为.csv并尝试使用read_csv读取时,遇到了这个问题。但是,如果我打开excel文件并保存为csv文件,它似乎可以工作。

#1


28  

Known encoding

If you know the encoding of the file you want to read in, you can use

如果您知道要读入的文件的编码,可以使用。

pd.read_csv('filename.txt', encoding='encoding')

These are the possible encodings: https://docs.python.org/3/library/codecs.html#standard-encodings

这些是可能的编码:https://docs.python.org/3/library/codecs.html#standard-encodings

Unknown encoding

If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.

如果您不知道编码,您可以尝试使用chardet,但是这并不能保证有效。这更像是一个猜测。

import chardet
import pandas as pd

with open('filename.csv', 'rb') as f:
    result = chardet.detect(f.read())  # or readline if the file is large


pd.read_csv('filename.csv', encoding=result['encoding'])

#2


7  

One simple solution is you can open the csv file in an editor like Sublime Text and save it with 'utf-8' encoding. Then we can easily read the file through pandas.

一个简单的解决方案是,您可以在编辑器中打开csv文件,比如Sublime Text,然后使用“utf-8”编码将其保存。然后我们可以很容易地通过熊猫来阅读文件。

#3


5  

Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.

这个错误是发生在你第一次读数据的时候,还是在你写出来再读回来的时候?我的猜测是,它实际上是在数据的第一次读取时发生的,因为您的CSV的编码不是UTF-8。

Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).

尝试在Notepad+、Excel或LibreOffice中打开这个CSV文件。你的数据源中有c (c)和cedilla吗?如果是这样,那么您所看到的0xE7字节很可能是在Latin-1或Windows-1252(在Python中称为“cp1252”)中编码的c。

Looking at the documentation for the Pandas read_csv() function, I see it has an encoding parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252" to your read_csv() call, as follows:

查看panda read_csv()函数的文档,我看到它有一个编码参数,它应该是您期望的CSV文件所在的编码的名称。因此,尝试将编码=“cp1252”添加到read_csv()调用中,如下所示:

df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")

Note that I added the character r in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv to new-ss.csv, where the string D:\new-ss.csv would be read as D, :, newline character, e, w, etc.

注意,我在文件名前面添加了字符r,因此它将被视为“原始字符串”,反斜杠将不会被特殊处理。这样,当您将文件名从ss.csv更改为new ss时,您不会感到惊讶。csv,字符串D:\new-ss。csv将被读取为D、:、换行字符、e、w等。

Anyway, try that encoding parameter on your first read_csv() call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)

无论如何,在第一个read_csv()调用中尝试这个编码参数,看看它是否有效。(这只是猜测,因为我不知道你的真实数据。如果数据文件不是私有的,也不是太大的,那么尝试发布数据文件,这样我们就可以看到它的内容——这比猜测要好。

#4


4  

Yes you'll get this error. I have work around with this problem, by opening csv file in notepad++ and changing the encoding throught Encoding menu -> convert to UTF-8. Then saving the file. Then again running python program over it.

是的,你会得到这个错误。我已经解决了这个问题,在notepad++中打开csv文件,并通过编码菜单->转换为UTF-8来改变编码。然后保存文件。然后再在上面运行python程序。

Other solution is using codecs module in python for encoding-decoding of files. I haven't used that.

另一种解决方案是使用python中的codecs模块对文件进行编码解码。我没有使用过。

#5


3  

Above method used by importing and then detecting file type works import chardet

上述方法用于导入然后检测文件类型的工作

import pandas as pd
import chardet
with open('filename.csv', 'rb') as f:
    result = chardet.detect(f.read())  # or readline if the file is large


pd.read_csv('filename.csv', encoding=result['encoding'])

#6


1  

I am new to python. Ran into this exact issue when I manually changed the extension on my excel file to .csv and tried to read it with read_csv. However, if I opened the excel file and saved as csv file instead it seemed to work.

我对python不熟。当我手动将excel文件的扩展名更改为.csv并尝试使用read_csv读取时,遇到了这个问题。但是,如果我打开excel文件并保存为csv文件,它似乎可以工作。