用Ruby解析XLS和XLSX (MS Excel)文件?

Are there any gems able to parse XLS and XLSX files? I've found Spreadsheet and ParseExcel, but they both don't understand XLSX format :( Any ideas?

是否有可以解析XLS和XLSX文件的gem ?我找到了电子表格和ParseExcel，但是他们都不理解XLSX格式(有什么想法吗?)

Thank you.

谢谢你！

10 个解决方案

#1

Just found roo, that might do the job - works for my requirements, reading a basic spreadsheet.

刚刚找到了roo，它可以完成我的要求，阅读一个基本的电子表格。

#2

I recently needed to parse some Excel files with Ruby. The abundance of libraries and options turned out to be confusing, so I wrote a blog post about it.

我最近需要用Ruby解析一些Excel文件。大量的库和选项被证明是令人困惑的，所以我写了一篇关于它的博文。

Here is a table of different Ruby libraries and what they support:

这里有一个不同Ruby库的表格，以及它们支持的内容:

If you care about performance, here is how the xlsx libraries compare:

如果您关心性能，下面是xlsx库的比较:

I have sample code to read xlsx files with each supported library here

我有示例代码，可以在这里使用每个支持的库读取xlsx文件。

Here are some examples for reading xlsx files with some different libraries:

下面是一些使用不同库读取xlsx文件的示例:

rubyXL

rubyXL

require 'rubyXL'

workbook = RubyXL::Parser.parse './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.sheet_name}"
  num_rows = 0
  worksheet.each do |row|
    row_cells = row.cells.map{ |cell| cell.value }
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

roo

roo

require 'roo'

workbook = Roo::Spreadsheet.open './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet}"
  num_rows = 0
  workbook.sheet(worksheet).each_row_streaming do |row|
    row_cells = row.map { |cell| cell.value }
    num_rows += 1
  end
  puts "Read #{num_rows} rows" 
end

creek

溪

require 'creek'

workbook = Creek::Book.new './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row.values
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

simple_xlsx_reader

simple_xlsx_reader

require 'simple_xlsx_reader'

workbook = SimpleXlsxReader.open './sample_excel_files/xlsx_500000_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

Here is an example of reading a legacy xls file using the spreadsheet library:

下面是一个使用电子表格库读取遗留xls文件的示例:

spreadsheet

电子表格

require 'spreadsheet'

# Note: spreadsheet only supports .xls files (not .xlsx)
workbook = Spreadsheet.open './sample_excel_files/xls_500_rows.xls'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row.to_a.map{ |v| v.methods.include?(:value) ? v.value : v }
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

#3

The roo gem works great for Excel (.xls and .xlsx) and it's being actively developed.

roo gem非常适合Excel(。它正在积极开发中。

I agree the syntax is not great nor ruby-like. But that can be easily achieved with something like:

我同意语法不太好也不像ruby。但这很容易实现，比如:

class Spreadsheet
  def initialize(file_path)
    @xls = Roo::Spreadsheet.open(file_path)
  end

  def each_sheet
    @xls.sheets.each do |sheet|
      @xls.default_sheet = sheet
      yield sheet
    end
  end

  def each_row
    0.upto(@xls.last_row) do |index|
      yield @xls.row(index)
    end
  end

  def each_column
    0.upto(@xls.last_column) do |index|
      yield @xls.column(index)
    end
  end
end

#4

I'm using creek which uses nokogiri. It is fast. Used 8.3 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is a hash of row and column name to cell content: {"A1"=>"a cell", "B1"=>"another cell"} The hash makes no guarantee that the keys will be in the original column order. https://github.com/pythonicrubyist/creek

我用的是叫nokogiri的小溪。这是太快了。在我的Macbook Air上使用了一个21x11250 xlsx桌面8.3秒。让它在ruby 1.9.3+上运行。每一行的输出格式是行和列名称的散列，即单元格内容:{“A1”=>“一个单元”，“B1”=>“另一个单元”}哈希不能保证键将在原来的列顺序中。https://github.com/pythonicrubyist/creek

dullard is another great one that uses nokogiri. It is super fast. Used 6.7 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 2.0.0+. The output format for each row is an array: ["a cell", "another cell"] https://github.com/thirtyseven/dullard

dullard是另一个使用nokogiri的伟大作品。这是非常快。在我的Macbook Air上的21x11250 xlsx表上使用了6.7秒。让它在ruby 2.0.0+上工作。每一行的输出格式是一个数组:[“一个单元”，“另一个单元”]https://github.com/thirtyseven/dullard

simple_xlsx_reader which has been mentioned is great, a bit slow. Used 91 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is an array: ["a cell", "another cell"] https://github.com/woahdae/simple_xlsx_reader

上面提到的simple_xlsx_reader非常棒，有点慢。在我的Macbook Air上使用了21×11250 xlsx桌面91秒。让它在ruby 1.9.3+上运行。每一行的输出格式是一个数组:[“一个单元”，“另一个单元”]https://github.com/woahdae/simple_xlsx_reader

Another interesting one is oxcelix. It uses ox's SAX parser which supposedly faster than both nokogiri's DOM and SAX parser. It supposedly outputs a Matrix. I could not get it to work. Also, there were some dependency issues with rubyzip. Would not recommend it.

另一个有趣的例子是oxcelix。它使用了ox的SAX解析器，据说它比nokogiri的DOM和SAX解析器都要快。它应该输出一个矩阵。我无法让它工作。此外，rubyzip还有一些依赖问题。不建议。

In conclusion, if using a ruby version lower than 2.0.0, use creek. If using ruby 2.0.0+, use dullard because it's faster and retains the table column order.

总之，如果使用低于2.0.0的ruby版本，请使用creek。如果使用ruby 2.0.0+，则使用dullard，因为它速度更快，并且保留了表列顺序。

#5

If you're looking for more modern libraries, take a look at Spreadsheet: http://spreadsheet.rubyforge.org/GUIDE_txt.html. I can't tell if it supports XLSX files, but considering that it is actively developed, I'm guessing it does (I'm not on Windows, or with Office, so I can't test).

如果您正在寻找更现代的库，请查看电子表格:http://spreadsheet.rubyforge.org/GUIDE_txt.html。我不知道它是否支持XLSX文件，但考虑到它是积极开发的，我猜测它支持(我不在Windows上，也不在Office上，所以我不能测试)。

At this point, it looks like roo is a good option again. It supports XLSX, allows (some) iteration by just using times with cell access. I admit, it's not pretty though.

在这一点上，roo看起来是一个不错的选择。它支持XLSX，允许(一些)迭代，只需使用带有单元格访问的时间。我承认，这并不好看。

Also, RubyXL can now give you a sort of iteration using their extract_data method, which gives you a 2d array of data, which can be easily iterated over.

此外，RubyXL现在可以使用提取t_data方法为您提供一种迭代，该方法为您提供一个二维数据数组，可以轻松地进行迭代。

Alternatively, if you're trying to work with XLSX files on Windows, you can use Ruby's Win32OLE library that allows you to interface with OLE objects, like the ones provided by Word and Excel. However, as @PanagiotisKanavos mentioned in the comments, this has a few major drawbacks:

或者，如果您正在尝试使用Windows上的XLSX文件，您可以使用Ruby的Win32OLE库，它允许您与OLE对象进行交互，就像Word和Excel提供的那样。然而，正如@PanagiotisKanavos在评论中提到的，这有几个主要的缺点:

Excel must be installed
Excel必须安装
A new Excel instance is started for each document
为每个文档启动一个新的Excel实例
Memory and other resource consumption is far more than what is necessary for simple XLSX document manipulation.
内存和其他资源消耗远远超过简单的XLSX文档操作所需要的。

But if you choose to use it, you can choose not to display Excel, load your XLSX file, and access it through it. I'm not sure if it supports iteration, however, I don't think it would be too hard to build around the supplied methods, as it is the full Microsoft OLE API for Excel. Here's the documentation: http://support.microsoft.com/kb/222101 Here's the gem: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/win32ole/rdoc/WIN32OLE.html

但是如果您选择使用它，您可以选择不显示Excel、加载XLSX文件并通过它访问它。我不确定它是否支持迭代，但是，我不认为围绕所提供的方法构建太困难，因为它是针对Excel的完整的Microsoft OLE API。这里是文档:http://support.microsoft.com/kb/222101这里是gem: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/win32ole/rdoc/WIN32OLE.html

Again, the options don't look much better, but there isn't much else out there, I'm afraid. it's hard to parse a file format that is a black box. And those few who managed to break it didn't do it that visibly. Google Docs is closed source, and LibreOffice is thousands of lines of harry C++.

再说一遍，这些选择看起来并不是很好，但恐怕没有什么别的选择了。很难解析一个黑框的文件格式。而那些试图打破它的人并没有明显地这么做。谷歌文档是封闭源代码的，而LibreOffice则是数千行harry c++的代码。

#6

The rubyXL gem parses XLSX files beautifully.

rubyXL宝石漂亮地解析XLSX文件。

#7

I've been working heavily with both Spreadsheet and rubyXL these past couple weeks and I must say that both are great tools. However, one area that both suffer is the lack of examples on actually implementing anything useful. Currently I'm building a crawler and using rubyXL to parse xlsx files and Spreadsheet for anything xls. I hope the code below can serve as a helpful example and show just how effective these tools can be.

过去几周，我一直在大量使用电子表格和rubyXL，我必须说这两种工具都是很棒的工具。然而，两者都有一个方面，那就是缺乏实例来实际实现任何有用的东西。目前我正在构建一个爬行器，并使用rubyXL解析xlsx文件和任何xls的电子表格。我希望下面的代码可以作为一个有用的示例，并展示这些工具是多么有效。

require 'find'
require 'rubyXL'

count = 0

Find.find('/Users/Anconia/crawler/') do |file|             # begin iteration of each file of a specified directory
  if file =~ /\b.xlsx$\b/                                  # check if file is xlsx format
    workbook = RubyXL::Parser.parse(file).worksheets       # creates an object containing all worksheets of an excel workbook
    workbook.each do |worksheet|                           # begin iteration over each worksheet
      data = worksheet.extract_data.to_s                   # extract data of a given worksheet - must be converted to a string in order to match a regex
      if data =~ /regex/
        puts file
        count += 1
      end      
    end
  end
end

puts "#{count} files were found"

require 'find'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'

count = 0

Find.find('/Users/Anconia/crawler/') do |file|             # begin iteration of each file of a specified directory
  if file =~ /\b.xls$\b/                                   # check if a given file is xls format
    workbook =  Spreadsheet.open(file).worksheets          # creates an object containing all worksheets of an excel workbook
    workbook.each do |worksheet|                           # begin iteration over each worksheet
      worksheet.each do |row|                              # begin iteration over each row of a worksheet
        if row.to_s =~ /regex/                             # rows must be converted to strings in order to match the regex
          puts file
          count += 1
        end
      end
    end
  end
end

puts "#{count} files were found"

#8

I couldn't find a satisfactory xlsx parser. RubyXL doesn't do date typecasting, Roo tried to typecast a number as a date, and both are a mess both in api and code.

我找不到一个令人满意的xlsx解析器。RubyXL不做日期类型转换，Roo尝试将数字类型转换为日期，这两者在api和代码中都很混乱。

So, I wrote simple_xlsx_reader. You'd have to use something else for xls, though, so maybe it's not the full answer you're looking for.

所以,我simple_xlsx_reader写道。你必须用别的东西来表示xls，所以它可能不是你想要的全部答案。

#9

Most of the online examples including the author's website for the Spreadsheet gem demonstrate reading the entire contents of an Excel file into RAM. That's fine if your spreadsheet is small.

大多数在线示例(包括作者的电子表格gem网站)演示如何将Excel文件的全部内容读入RAM。如果你的电子表格很小，那也没关系。

xls = Spreadsheet.open(file_path)

For anyone working with very large files, a better way is to stream-read the contents of the file. The Spreadsheet gem supports this--albeit not well documented at this time (circa 3/2015).

对于任何使用非常大的文件的人来说，更好的方法是对文件的内容进行流读。电子表格gem支持这一点——尽管目前还没有很好的文档(大约在3/2015年)。

Spreadsheet.open(file_path).worksheets.first.rows do |row|
  # do something with the array of CSV data
end

CITE: https://github.com/zdavatz/spreadsheet

引用:https://github.com/zdavatz/spreadsheet

#10

The RemoteTable library uses roo internally. It makes it easy to read spreadsheets of different formats (XLS, XLSX, CSV, etc. possibly remote, possibly stored inside a zip, gz, etc.):

RemoteTable库在内部使用roo。它可以方便地读取不同格式的电子表格(XLS、XLSX、CSV等等，可能是远程的，可能存储在zip、gz等文件中):

require 'remote_table'
r = RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/02data.zip', :filename => 'guide_jan28.xls'
r.each do |row|
  puts row.inspect
end

Output:

输出:

{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.0", "cyl"=>"6.0", "trans"=>"Auto(S4)", "drv"=>"R", "bidx"=>"60.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"20.0", "ucty"=>"19.1342", "uhwy"=>"30.2", "ucmb"=>"22.9121", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1238.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"2MODE", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.2", "cyl"=>"6.0", "trans"=>"Manual(M6)", "drv"=>"R", "bidx"=>"65.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"19.0", "ucty"=>"18.7", "uhwy"=>"30.4", "ucmb"=>"22.6171", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1302.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ASTON MARTIN", "carline name"=>"ASTON MARTIN VANQUISH", "displ"=>"5.9", "cyl"=>"12.0", "trans"=>"Auto(S6)", "drv"=>"R", "bidx"=>"1.0", "cty"=>"12.0", "hwy"=>"19.0", "cmb"=>"14.0", "ucty"=>"13.55", "uhwy"=>"24.7", "ucmb"=>"17.015", "fl"=>"P", "G"=>"G", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1651.0", "eng dscr"=>"GUZZLER", "trans dscr"=>"CLKUP", "vpc"=>"4.0", "cls"=>"1.0"}

#1