如何使用Python读取utf-8编码的文本文件

时间:2023-01-05 21:17:13

I need to analyse a textfile in tamil (utf-8 encoded). Im using nltk package of Python on the interface IDLE. when i try to read the text file on the interface, this is the error i get. how do i avoid this?

我需要分析泰米尔文本文件(utf-8编码)。我在IDLE接口上使用了nltk Python包。当我尝试读取界面上的文本文件时,这是我得到的错误。我怎么避免这个?

corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
  File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to <undefined>

1 个解决方案

#1


7  

Since you are using Python 3, just add the encoding parameter to open():

由于您使用的是Python 3,只需将编码参数添加到open():

corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt',
              encoding='utf-8').read()

#1


7  

Since you are using Python 3, just add the encoding parameter to open():

由于您使用的是Python 3,只需将编码参数添加到open():

corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt',
              encoding='utf-8').read()