在Spark中使用sqlContext.read读取.csv文件时出错

时间:2021-04-19 21:53:40

I am trying to read a csv file into a dataframe in Spark as follow:

我正在尝试将csv文件读入Spark中的dataframe,如下所示:

  1. I run spark shell like:

    我运行spark shell如下:

    spark-shell --jars .\spark-csv_2.11-1.4.0.jar;.\commons-csv-1.2.jar (I cannot directly download those dependency that's why I am using --jars)

    spark-shell——罐子。\ spark-csv_2.11-1.4.0.jar;\ commons - csv - 1.2。jar(我不能直接下载这些依赖关系,这就是为什么我使用-jar)

  2. Use the following command to read a csv file:

    使用以下命令读取csv文件:

val df_1 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("2008.csv")

val df_1 = sqlContext.read.format(“com.databricks.spark.csv”)。选项(“头”,“真正的”).load(“2008. csv”)

But, here is the error message that I get:

但是,这是我得到的错误信息:

scala> val df_1 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("2008.csv")
java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
        at com.databricks.spark.csv.package$.<init>(package.scala:27)
        at com.databricks.spark.csv.package$.<clinit>(package.scala)
        at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:235)
        at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:73)
        at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:162)
        at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:44)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:30)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:41)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:43)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
        at $iwC$$iwC$$iwC.<init>(<console>:47)
        at $iwC$$iwC.<init>(<console>:49)
        at $iwC.<init>(<console>:51)
        at <init>(<console>:53)
        at .<init>(<console>:57)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
        at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop
.scala:997)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:
945)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:
945)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.csv.CSVFormat
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 57 more

After doing the first proposed solution:

在做了第一个提议的解决方案之后:

PS C:\Users\319413696\Desktop\graphX> spark-shell --packages com.databricks:spark-csv_2.11:1.4.0
Ivy Default Cache set to: C:\Users\319413696\.ivy2\cache
The jars for the packages stored in: C:\Users\319413696\.ivy2\jars
:: loading settings :: url = jar:file:/C:/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar!/org/apache
/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found com.databricks#spark-csv_2.11;1.4.0 in local-m2-cache
        found org.apache.commons#commons-csv;1.1 in local-m2-cache
        found com.univocity#univocity-parsers;1.5.1 in local-m2-cache
downloading file:/C:/Users/319413696/.m2/repository/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4.0.jar ...
        [SUCCESSFUL ] com.databricks#spark-csv_2.11;1.4.0!spark-csv_2.11.jar (0ms)
downloading file:/C:/Users/319413696/.m2/repository/org/apache/commons/commons-csv/1.1/commons-csv-1.1.jar ...
        [SUCCESSFUL ] org.apache.commons#commons-csv;1.1!commons-csv.jar (0ms)
downloading file:/C:/Users/319413696/.m2/repository/com/univocity/univocity-parsers/1.5.1/univocity-parsers-1.5.1.jar ..
.
        [SUCCESSFUL ] com.univocity#univocity-parsers;1.5.1!univocity-parsers.jar (15ms)
:: resolution report :: resolve 671ms :: artifacts dl 31ms
        :: modules in use:
        com.databricks#spark-csv_2.11;1.4.0 from local-m2-cache in [default]
        com.univocity#univocity-parsers;1.5.1 from local-m2-cache in [default]
        org.apache.commons#commons-csv;1.1 from local-m2-cache in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   3   |   3   |   3   |   0   ||   3   |   3   |
        ---------------------------------------------------------------------

:: problems summary ::
:::: ERRORS
        Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4
.0-sources.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.11/1.4.0/spark-
csv_2.11-1.4.0-sources.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4
.0-src.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.11/1.4.0/spark-
csv_2.11-1.4.0-src.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4
.0-javadoc.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.11/1.4.0/spark-
csv_2.11-1.4.0-javadoc.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url https://repo1.maven.org/maven2/org/apache/apache/15/apache-15.jar (java.net.SocketExc
eption: Permission denied: connect)

        Server access error at url http://dl.bintray.com/spark-packages/maven/org/apache/apache/15/apache-15.jar (java.n
et.SocketException: Permission denied: connect)

        Server access error at url https://repo1.maven.org/maven2/org/apache/commons/commons-parent/35/commons-parent-35
.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url http://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/35/commo
ns-parent-35.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url https://repo1.maven.org/maven2/org/apache/commons/commons-csv/1.1/commons-csv-1.1-sou
rces.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url http://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-csv/1.1/commons
-csv-1.1-sources.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url https://repo1.maven.org/maven2/org/apache/commons/commons-csv/1.1/commons-csv-1.1-src
.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url http://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-csv/1.1/commons
-csv-1.1-src.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url https://repo1.maven.org/maven2/org/apache/commons/commons-csv/1.1/commons-csv-1.1-jav
adoc.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url http://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-csv/1.1/commons
-csv-1.1-javadoc.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url https://repo1.maven.org/maven2/com/univocity/univocity-parsers/1.5.1/univocity-parser
s-1.5.1-sources.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url http://dl.bintray.com/spark-packages/maven/com/univocity/univocity-parsers/1.5.1/univ
ocity-parsers-1.5.1-sources.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url https://repo1.maven.org/maven2/com/univocity/univocity-parsers/1.5.1/univocity-parser
s-1.5.1-src.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url http://dl.bintray.com/spark-packages/maven/com/univocity/univocity-parsers/1.5.1/univ
ocity-parsers-1.5.1-src.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url https://repo1.maven.org/maven2/com/univocity/univocity-parsers/1.5.1/univocity-parser
s-1.5.1-javadoc.jar (java.net.SocketException: Permission denied: connect)

        Server access error at url http://dl.bintray.com/spark-packages/maven/com/univocity/univocity-parsers/1.5.1/univ
ocity-parsers-1.5.1-javadoc.jar (java.net.SocketException: Permission denied: connect)

4 个解决方案

#1


1  

  1. Give the full path of jars and separate them with , instead of ;

    把罐子的全部路径都给他们,并把他们分开,而不是;

    spark-shell --jars spark-shell --jars fullpath\spark-csv_2.11-1.4.0.jar,fullpath\commons-csv-1.2.jar

    jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar - \common -csv-1.2 jar

  2. Be sure that you have permissions in the folders (DFS) that tempory files will be written.

    确保在文件夹(DFS)中具有写入临时文件的权限。

#2


0  

Download spark-csv to your .m2 directiory, then use spark-shell --packages com.databricks:spark-csv_2.11:1.4.0

下载spark-csv到.m2目录,然后使用spark-shell -packages com.databricks:spark-csv_2.11:1.4.0

If you can't download spark-csv directly, download it in other system and the copy all .m2 directory to your computer.

如果不能直接下载spark-csv,请在其他系统中下载,并将所有.m2目录复制到您的计算机中。

#3


0  

Instead of using sqlContext.read, I used the following code to turn my .csv file into a dataframe. Suppose the .csv file has 5 columns as follow:

而不是使用sqlContext。我使用以下代码将.csv文件转换为dataframe。假设.csv文件有以下5列:

case class Flight(arrDelay: Int, depDelay: Int, origin: String, dest: String, distance: Int)

Then:

然后:

val flights=sc.textFile("2008.csv").map(_.split(",")).map(p => Flight(p(0).trim.toInt, p(1).trim.toInt, p(2)
, p(3), p(4).trim.toInt)).toDF()

#4


0  

Milad Khajavi saved the day for me. After days of battling with getting spark-csv to work on a cluster where there is no internet access, I finally took his idea, downloaded the package on a VM. Then I copied .ivy2 directory from the VM to other cluster. Now its working without any issues.

Milad Khajavi为我保留了这一天。经过几天的努力,我终于接受了他的想法,在VM上下载了这个包。然后我将.ivy2目录从VM复制到其他集群。现在它的工作没有任何问题。

Download spark-csv to your .m2 directiory, then use spark-shell --packages com.databricks:spark-csv_2.11:1.4.0

下载spark-csv到.m2目录,然后使用spark-shell -packages com.databricks:spark-csv_2.11:1.4.0

If you can't download spark-csv directly, download it in other system and the copy all .m2 directory to your computer.

如果不能直接下载spark-csv,请在其他系统中下载,并将所有.m2目录复制到您的计算机中。

#1


1  

  1. Give the full path of jars and separate them with , instead of ;

    把罐子的全部路径都给他们,并把他们分开,而不是;

    spark-shell --jars spark-shell --jars fullpath\spark-csv_2.11-1.4.0.jar,fullpath\commons-csv-1.2.jar

    jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar -jar - \common -csv-1.2 jar

  2. Be sure that you have permissions in the folders (DFS) that tempory files will be written.

    确保在文件夹(DFS)中具有写入临时文件的权限。

#2


0  

Download spark-csv to your .m2 directiory, then use spark-shell --packages com.databricks:spark-csv_2.11:1.4.0

下载spark-csv到.m2目录,然后使用spark-shell -packages com.databricks:spark-csv_2.11:1.4.0

If you can't download spark-csv directly, download it in other system and the copy all .m2 directory to your computer.

如果不能直接下载spark-csv,请在其他系统中下载,并将所有.m2目录复制到您的计算机中。

#3


0  

Instead of using sqlContext.read, I used the following code to turn my .csv file into a dataframe. Suppose the .csv file has 5 columns as follow:

而不是使用sqlContext。我使用以下代码将.csv文件转换为dataframe。假设.csv文件有以下5列:

case class Flight(arrDelay: Int, depDelay: Int, origin: String, dest: String, distance: Int)

Then:

然后:

val flights=sc.textFile("2008.csv").map(_.split(",")).map(p => Flight(p(0).trim.toInt, p(1).trim.toInt, p(2)
, p(3), p(4).trim.toInt)).toDF()

#4


0  

Milad Khajavi saved the day for me. After days of battling with getting spark-csv to work on a cluster where there is no internet access, I finally took his idea, downloaded the package on a VM. Then I copied .ivy2 directory from the VM to other cluster. Now its working without any issues.

Milad Khajavi为我保留了这一天。经过几天的努力,我终于接受了他的想法,在VM上下载了这个包。然后我将.ivy2目录从VM复制到其他集群。现在它的工作没有任何问题。

Download spark-csv to your .m2 directiory, then use spark-shell --packages com.databricks:spark-csv_2.11:1.4.0

下载spark-csv到.m2目录,然后使用spark-shell -packages com.databricks:spark-csv_2.11:1.4.0

If you can't download spark-csv directly, download it in other system and the copy all .m2 directory to your computer.

如果不能直接下载spark-csv,请在其他系统中下载,并将所有.m2目录复制到您的计算机中。