UnicodeDecodeError:'utf8'编解码器无法解码位置3131中的字节0x80:无效的起始字节

时间:2023-01-04 20:48:57

I am trying to read twitter data from json file using python 2.7.12.

我试图使用python 2.7.12从json文件中读取twitter数据。

Code I used is such:

我使用的代码是这样的:

    import json
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')

    def get_tweets_from_file(file_name):
        tweets = []
        with open(file_name, 'rw') as twitter_file:
            for line in twitter_file:
                if line != '\r\n':
                    line = line.encode('ascii', 'ignore')
                    tweet = json.loads(line)
                    if u'info' not in tweet.keys():
                        tweets.append(tweet)
    return tweets

Result I got:

结果我得到了:

    Traceback (most recent call last):
      File "twitter_project.py", line 100, in <module>
        main()                  
      File "twitter_project.py", line 95, in main
        tweets = get_tweets_from_dir(src_dir, dest_dir)
      File "twitter_project.py", line 59, in get_tweets_from_dir
        new_tweets = get_tweets_from_file(file_name)
      File "twitter_project.py", line 71, in get_tweets_from_file
        line = line.encode('ascii', 'ignore')
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

I went through all the answers from similar issues and came up with this code and it worked last time. I have no clue why it isn't working now...I would appreciate any help!

我查看了类似问题的所有答案,并提出了这段代码,并且上次工作了。我不知道为什么它现在不工作......我将不胜感激任何帮助!

3 个解决方案

#1


11  

It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code. See https://*.com/a/34378962/1554386 for more information

你有sys.setdefaultencoding('utf-8')没有帮助,这会使事情更加混乱 - 这是一个讨厌的黑客,你需要从你的代码中删除它。有关更多信息,请参阅https://*.com/a/34378962/1554386

The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.

发生错误是因为line是一个字符串而你正在调用encode()。 encode()只有在字符串是Unicode时才有意义,因此Python首先尝试使用默认编码将其转换为Unicode,在您的情况下,默认编码为UTF-8,但应为ASCII。无论哪种方式,0x80都无效ASCII或UTF-8因此失败。

0x80 is valid in some characters sets. In windows-1252/cp1252 it's .

0x80在某些字符集中有效。在windows-1252 / cp1252中它是€。

The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.

这里的技巧是通过代码了解数据的编码。目前,你太过机会了。 Unicode字符串类型是一种方便的Python功能,允许您解码编码的字符串并忘记编码,直到您需要编写或传输数据。

Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.

使用io模块以文本模式打开文件并按原样解码文件 - 不再是.decode()!您需要确保传入数据的编码是一致的。您可以在外部重新编码,也可以更改脚本中的编码。这是我将编码设置为windows-1252。

with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
    for line in twitter_file:
        # line is now a <type 'unicode'>
        tweet = json.loads(line)

The io module also provide Universal Newlines. This means \r\n are detected as newlines, so you don't have to watch for them.

io模块还提供Universal Newlines。这意味着\ r \ n被检测为换行符,因此您不必注意它们。

#2


15  

In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.

在我的情况下(mac os),我的数据文件夹中有.DS_store文件,这是一个隐藏的自动生成的文件,它导致了这个问题。删除后我能够解决问题。

#3


-2  

The error occurs when you are trying to read a tweet containing sentence like

当您尝试阅读包含句子的推文时会发生错误

"@Mike http:\www.google.com \A8&^)((&() how are&^%()( you ". Which cannot be read as a String instead you are suppose to read it as raw String . but Converting to raw String Still gives error so i better i suggest you to

“@Mike http:\ www.google.com \ A8&^)((&()如何&^%()(你”。它不能作为字符串读取而是你想把它作为原始字符串读取。但转换原始字符串仍然给出错误所以我建议你更好

read a json file something like this:

读一个像这样的json文件:

import codecs
import json
    with codecs.open('tweetfile','rU','utf-8') as f:
             for line in f:
                data=json.loads(line)
                print data["tweet"]
keys.append(data["id"])
            fulldata.append(data["tweet"])

which will get you the data load from json file .

这将从json文件中获取数据加载。

You can also write it to a csv using Pandas.

您也可以使用Pandas将其写入csv。

import pandas as pd
output = pd.DataFrame( data={ "tweet":fulldata,"id":keys} )
output.to_csv( "tweets.csv", index=False, quoting=1 )

Then read from csv to avoid the encoding and decoding problem

然后从csv读取以避免编码和解码问题

hope this will help you solving you problem.

希望这能帮助你解决问题。

Midhun

#1


11  

It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code. See https://*.com/a/34378962/1554386 for more information

你有sys.setdefaultencoding('utf-8')没有帮助,这会使事情更加混乱 - 这是一个讨厌的黑客,你需要从你的代码中删除它。有关更多信息,请参阅https://*.com/a/34378962/1554386

The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.

发生错误是因为line是一个字符串而你正在调用encode()。 encode()只有在字符串是Unicode时才有意义,因此Python首先尝试使用默认编码将其转换为Unicode,在您的情况下,默认编码为UTF-8,但应为ASCII。无论哪种方式,0x80都无效ASCII或UTF-8因此失败。

0x80 is valid in some characters sets. In windows-1252/cp1252 it's .

0x80在某些字符集中有效。在windows-1252 / cp1252中它是€。

The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.

这里的技巧是通过代码了解数据的编码。目前,你太过机会了。 Unicode字符串类型是一种方便的Python功能,允许您解码编码的字符串并忘记编码,直到您需要编写或传输数据。

Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.

使用io模块以文本模式打开文件并按原样解码文件 - 不再是.decode()!您需要确保传入数据的编码是一致的。您可以在外部重新编码,也可以更改脚本中的编码。这是我将编码设置为windows-1252。

with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
    for line in twitter_file:
        # line is now a <type 'unicode'>
        tweet = json.loads(line)

The io module also provide Universal Newlines. This means \r\n are detected as newlines, so you don't have to watch for them.

io模块还提供Universal Newlines。这意味着\ r \ n被检测为换行符,因此您不必注意它们。

#2


15  

In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.

在我的情况下(mac os),我的数据文件夹中有.DS_store文件,这是一个隐藏的自动生成的文件,它导致了这个问题。删除后我能够解决问题。

#3


-2  

The error occurs when you are trying to read a tweet containing sentence like

当您尝试阅读包含句子的推文时会发生错误

"@Mike http:\www.google.com \A8&^)((&() how are&^%()( you ". Which cannot be read as a String instead you are suppose to read it as raw String . but Converting to raw String Still gives error so i better i suggest you to

“@Mike http:\ www.google.com \ A8&^)((&()如何&^%()(你”。它不能作为字符串读取而是你想把它作为原始字符串读取。但转换原始字符串仍然给出错误所以我建议你更好

read a json file something like this:

读一个像这样的json文件:

import codecs
import json
    with codecs.open('tweetfile','rU','utf-8') as f:
             for line in f:
                data=json.loads(line)
                print data["tweet"]
keys.append(data["id"])
            fulldata.append(data["tweet"])

which will get you the data load from json file .

这将从json文件中获取数据加载。

You can also write it to a csv using Pandas.

您也可以使用Pandas将其写入csv。

import pandas as pd
output = pd.DataFrame( data={ "tweet":fulldata,"id":keys} )
output.to_csv( "tweets.csv", index=False, quoting=1 )

Then read from csv to avoid the encoding and decoding problem

然后从csv读取以避免编码和解码问题

hope this will help you solving you problem.

希望这能帮助你解决问题。

Midhun