如何从Scala Spark中的Excel (xls,xlsx)文件构造Dataframe ?

时间:2022-10-08 18:17:38

I have a large Excel(xlsx and xls) file with multiple sheet and I need convert it to RDD or Dataframe so that it can be joined to other dataframe later. I was thinking of using Apache POI and save it as a CSV and then read csv in dataframe. But if there is any libraries or API that can help in this Process would be easy. Any help is highly appreciated.

我有一个很大的Excel(xlsx和xls)文件,有多个表,我需要将它转换为RDD或Dataframe,以便以后可以连接到其他Dataframe。我在考虑使用Apache POI并将其保存为CSV,然后在dataframe中读取CSV。但是,如果有任何库或API可以在这个过程中提供帮助,那就很容易了。非常感谢您的帮助。

3 个解决方案

#1


10  

The solution to your problem is to use Spark Excel dependency in your project.

解决问题的方法是在项目中使用Spark Excel依赖项。

Spark Excel has flexible options to play with.

Spark Excel有灵活的选择。

I have tested the following code to read from excel and convert it to dataframe and it just works perfect

我测试了下面的代码,以便从excel中读取并将其转换为dataframe,它工作得非常好

def readExcel(file: String): DataFrame = sqlContext.read
    .format("com.crealytics.spark.excel")
    .option("location", file)
    .option("useHeader", "true")
    .option("treatEmptyValuesAsNulls", "true")
    .option("inferSchema", "true")
    .option("addColorColumns", "False")
    .load()

val data = readExcel("path to your excel file")

data.show(false)

you can give sheetname as option if your excel sheet has multiple sheets

如果您的excel表有多个表,您可以将sheetname作为选项

.option("sheetName", "Sheet2")

I hope its helpful

我希望它有帮助

#2


4  

Here are read and write examples to read from and write into excel with full set of options...

这里有一些阅读和写作的例子,可以从excel中读取并写入完整的选项……

Source spark-excel from crealytics

源spark-excel crealytics

Scala API Spark 2.0+:

Scala API 2.0 +火花:

Create a DataFrame from an Excel file

从Excel文件创建一个DataFrame

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.crealytics.spark.excel")
    .option("sheetName", "Daily") // Required
    .option("useHeader", "true") // Required
    .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
    .option("inferSchema", "false") // Optional, default: false
    .option("addColorColumns", "true") // Optional, default: false
    .option("startColumn", 0) // Optional, default: 0
    .option("endColumn", 99) // Optional, default: Int.MaxValue
    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
    .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    .schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
    .load("Worktime.xlsx")

Write a DataFrame to an Excel file

向Excel文件写一个DataFrame

df.write
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Daily")
  .option("useHeader", "true")
  .option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
  .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
  .mode("overwrite")
  .save("Worktime2.xlsx")

Note: Instead of sheet1 or sheet2 you can use their names as well.. in this example given above Daily is sheet name.

注意:不用sheet1或sheet2,你也可以用他们的名字。在上面给出的示例中,每日是表名。

  • If you want to use it from spark shell...
  • 如果你想用火花外壳……

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

可以使用-packages命令行选项添加此包来触发。例如,在启动火花外壳时包括它:

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:0.9.8
  • Dependencies needs to be added (in case of maven etc...):
  • 需要添加依赖项(如maven等):
groupId: com.crealytics
artifactId: spark-excel_2.11
version: 0.9.8

Tip : This is very useful approach particularly for writing maven test cases where you can place excel sheets with sample data in excel src/main/resources folder and you can access them in your unit test cases(scala/java), which creates DataFrame[s] out of excel sheet...

提示:这是一种非常有用的方法,特别是对于编写maven测试用例,您可以在excel src/main/resources文件夹中放置带有示例数据的excel表,并可以在单元测试用例(scala/java)中访问它们,这些用excel表创建DataFrame[s]…

  • Another option you could consider is spark-hadoopoffice-ds
  • 您可以考虑的另一个选项是spark-hadoop - officeds

A Spark datasource for the HadoopOffice library. This Spark datasource assumes at least Spark 2.0.1. However, the HadoopOffice library can also be used directly from Spark 1.x. Currently this datasource supports the following formats of the HadoopOffice library:

hadoop办公室库的一个数据源。这个Spark数据源至少假设Spark 2.0.1。然而,HadoopOffice库也可以直接从Spark 1.x中使用。目前这个数据源支持HadoopOffice库的以下格式:

Excel Datasource format: org.zuinnote.spark.office.Excel Loading and Saving of old Excel (.xls) and new Excel (.xlsx) This datasource is available on Spark-packages.org and on Maven Central.

Excel数据源格式:org.zuinnote.spark.office。Excel加载和保存旧的Excel (.xls)和新的Excel (.xlsx)这个数据源可以在Spark-packages.org和Maven中心上使用。

#3


2  

Alternatively, you can use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki), which supports also encrypted Excel documents and linked workbooks, amongst other features. Of course Spark is also supported.

或者,您也可以使用HadoopOffice库(https://github.com/ZuInnoTe/hadoopoffice/wiki),它还支持加密的Excel文档和链接的工作簿等功能。当然也支持Spark。

#1


10  

The solution to your problem is to use Spark Excel dependency in your project.

解决问题的方法是在项目中使用Spark Excel依赖项。

Spark Excel has flexible options to play with.

Spark Excel有灵活的选择。

I have tested the following code to read from excel and convert it to dataframe and it just works perfect

我测试了下面的代码,以便从excel中读取并将其转换为dataframe,它工作得非常好

def readExcel(file: String): DataFrame = sqlContext.read
    .format("com.crealytics.spark.excel")
    .option("location", file)
    .option("useHeader", "true")
    .option("treatEmptyValuesAsNulls", "true")
    .option("inferSchema", "true")
    .option("addColorColumns", "False")
    .load()

val data = readExcel("path to your excel file")

data.show(false)

you can give sheetname as option if your excel sheet has multiple sheets

如果您的excel表有多个表,您可以将sheetname作为选项

.option("sheetName", "Sheet2")

I hope its helpful

我希望它有帮助

#2


4  

Here are read and write examples to read from and write into excel with full set of options...

这里有一些阅读和写作的例子,可以从excel中读取并写入完整的选项……

Source spark-excel from crealytics

源spark-excel crealytics

Scala API Spark 2.0+:

Scala API 2.0 +火花:

Create a DataFrame from an Excel file

从Excel文件创建一个DataFrame

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.crealytics.spark.excel")
    .option("sheetName", "Daily") // Required
    .option("useHeader", "true") // Required
    .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
    .option("inferSchema", "false") // Optional, default: false
    .option("addColorColumns", "true") // Optional, default: false
    .option("startColumn", 0) // Optional, default: 0
    .option("endColumn", 99) // Optional, default: Int.MaxValue
    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
    .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    .schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
    .load("Worktime.xlsx")

Write a DataFrame to an Excel file

向Excel文件写一个DataFrame

df.write
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Daily")
  .option("useHeader", "true")
  .option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
  .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
  .mode("overwrite")
  .save("Worktime2.xlsx")

Note: Instead of sheet1 or sheet2 you can use their names as well.. in this example given above Daily is sheet name.

注意:不用sheet1或sheet2,你也可以用他们的名字。在上面给出的示例中,每日是表名。

  • If you want to use it from spark shell...
  • 如果你想用火花外壳……

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

可以使用-packages命令行选项添加此包来触发。例如,在启动火花外壳时包括它:

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:0.9.8
  • Dependencies needs to be added (in case of maven etc...):
  • 需要添加依赖项(如maven等):
groupId: com.crealytics
artifactId: spark-excel_2.11
version: 0.9.8

Tip : This is very useful approach particularly for writing maven test cases where you can place excel sheets with sample data in excel src/main/resources folder and you can access them in your unit test cases(scala/java), which creates DataFrame[s] out of excel sheet...

提示:这是一种非常有用的方法,特别是对于编写maven测试用例,您可以在excel src/main/resources文件夹中放置带有示例数据的excel表,并可以在单元测试用例(scala/java)中访问它们,这些用excel表创建DataFrame[s]…

  • Another option you could consider is spark-hadoopoffice-ds
  • 您可以考虑的另一个选项是spark-hadoop - officeds

A Spark datasource for the HadoopOffice library. This Spark datasource assumes at least Spark 2.0.1. However, the HadoopOffice library can also be used directly from Spark 1.x. Currently this datasource supports the following formats of the HadoopOffice library:

hadoop办公室库的一个数据源。这个Spark数据源至少假设Spark 2.0.1。然而,HadoopOffice库也可以直接从Spark 1.x中使用。目前这个数据源支持HadoopOffice库的以下格式:

Excel Datasource format: org.zuinnote.spark.office.Excel Loading and Saving of old Excel (.xls) and new Excel (.xlsx) This datasource is available on Spark-packages.org and on Maven Central.

Excel数据源格式:org.zuinnote.spark.office。Excel加载和保存旧的Excel (.xls)和新的Excel (.xlsx)这个数据源可以在Spark-packages.org和Maven中心上使用。

#3


2  

Alternatively, you can use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki), which supports also encrypted Excel documents and linked workbooks, amongst other features. Of course Spark is also supported.

或者,您也可以使用HadoopOffice库(https://github.com/ZuInnoTe/hadoopoffice/wiki),它还支持加密的Excel文档和链接的工作簿等功能。当然也支持Spark。