使用pandas读取Excel XML .xls文件

I'm aware of a number of previously asked questions, but none of the solutions given work on the reproducible example that I provide below.

我知道一些以前提出过的问题，但是没有一个解决方案能够解决我在下面提供的可重现的例子。

I am trying to read in .xls files from http://www.eia.gov/coal/data.cfm#production -- specifically the Historical detailed coal production data (1983-2013) coalpublic2012.xls file that's freely available via the dropdown. Pandas cannot read it.

我正在尝试阅读来自http://www.eia.gov/coal/data.cfm#production的.xls文件 - 特别是历史详细的煤炭生产数据（1983-2013）coalpublic2012.xls文件，可以通过落下。熊猫无法读懂它。

In contrast, the file for the most recent year available, 2013, coalpublic2013.xls file, works without a problem:

相比之下，2013年最新一年的文件，coalpublic2013.xls文件，运行没有问题：

import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")

but the next decade of .xls files (2004-2012) do not load. I have looked at these files with Excel, and they open, and are not corrupted.

但未来十年的.xls文件（2004-2012）不会加载。我用Excel查看了这些文件，它们打开，并且没有损坏。

The error that I get from pandas is:

我从熊猫那里得到的错误是：

---------------------------------------------------------------------------
XLRDError                                 Traceback (most recent call last)
<ipython-input-28-0da33766e9d2> in <module>()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
    161 
    162     if not isinstance(io, ExcelFile):
--> 163         io = ExcelFile(io, engine=engine)
    164 
    165     return io._parse_excel(

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
    204                 self.book = xlrd.open_workbook(file_contents=data)
    205             else:
--> 206                 self.book = xlrd.open_workbook(io)
    207         elif engine == 'xlrd' and isinstance(io, xlrd.Book):
    208             self.book = io

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    433         formatting_info=formatting_info,
    434         on_demand=on_demand,
--> 435         ragged_rows=ragged_rows,
    436         )
    437     return bk

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
     89         t1 = time.clock()
     90         bk.load_time_stage_1 = t1 - t0
---> 91         biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
     92         if not biff_version:
     93             raise XLRDError("Can't determine file's BIFF version")

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
   1228             bof_error('Expected BOF record; met end of file')
   1229         if opcode not in bofcodes:
-> 1230             bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
   1231         length = self.get2bytes()
   1232         if length == MY_EOF:

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
   1222         if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
   1223         def bof_error(msg):
-> 1224             raise XLRDError('Unsupported format, or corrupt file: ' + msg)
   1225         savpos = self._position
   1226         opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'

And I have tried various other things:

我尝试了其他各种事情：

df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")

to no avail. My pandas version: 0.17.0

无济于事。我的熊猫版：0.17.0

I've also submitted this as a bug to the pandas github issues list.

我也将此作为bug提交给pandas github问题列表。

4 个解决方案

#1

You can convert this Excel XML file programmatically. Requirement: only python and pandas.

您可以通过编程方式转换此Excel XML文件。要求：只有python和pandas。

import pandas as pd
from xml.sax import ContentHandler, parse

# Reference https://goo.gl/KaOBG3
class ExcelHandler(ContentHandler):
    def __init__(self):
        self.chars = [  ]
        self.cells = [  ]
        self.rows = [  ]
        self.tables = [  ]
    def characters(self, content):
        self.chars.append(content)
    def startElement(self, name, atts):
        if name=="Cell":
            self.chars = [  ]
        elif name=="Row":
            self.cells=[  ]
        elif name=="Table":
            self.rows = [  ]
    def endElement(self, name):
        if name=="Cell":
            self.cells.append(''.join(self.chars))
        elif name=="Row":
            self.rows.append(self.cells)
        elif name=="Table":
            self.tables.append(self.rows)

excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])

#2

The problem is that while the 2013 data is an actual Excel file, the 2012 data is an XML document, something which seems to not be supported in Python. I would say your best bet is to open it in Excel, and save a copy as either a proper Excel file, or as a CSV.

问题是虽然2013年的数据是一个实际的Excel文件，但2012年的数据是一个XML文档，这似乎在Python中不受支持。我想说最好的办法是在Excel中打开它，并将副本保存为正确的Excel文件或CSV格式。

#3

You can convert this Excel XML file programmatically. Requirement: Windows, Office installed.

您可以通过编程方式转换此Excel XML文件。要求：Windows，Office已安装。

1.Create in Notepad ExcelToCsv.vbs script:

1.在Notepad ExcelToCsv.vbs脚本中创建：

if WScript.Arguments.Count < 3 Then
    WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
    Wscript.Quit
End If

csv_format = 6

Set objFSO = CreateObject("Scripting.FileSystemObject")

src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))

Dim oExcel
Set oExcel = CreateObject("Excel.Application")

Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate

oBook.SaveAs dest_file, csv_format

oBook.Close False
oExcel.Quit

Convert the Excel XML file in CSV:
以CSV格式转换Excel XML文件：

$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1

$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1

Open the CSV file with pandas
用pandas打开CSV文件

>>> df1 = pd.read_csv('coalpublic2012.csv', skiprows=3)

>>> df1 = pd.read_csv（'coalpublic2012.csv'，skiprows = 3）

Reference: Faster way to read Excel files to pandas dataframe

参考：更快速地将Excel文件读取到pandas数据帧

#4

@JBWhitmore I have run the following code:

@JBWhitmore我运行了以下代码：

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.

这会成功读取文件而不会出现任何错误。但是，它以所提到的确切格式提供所有数据。因此，您可能需要做额外的努力才能在成功读取数据后处理数据。

#1