pyspark ImportError：无法导入名称累加器

Goal: I am trying to get apache-spark pyspark to be appropriately interpreted within my pycharm IDE.

目标:我正在尝试在我的pycharm IDE中对apache-spark pyspark进行适当的解释。

Problem: I currently receive the following error:

问题:我目前收到以下错误:

ImportError: cannot import name accumulators

I was following the following blog to help me through the process. http://renien.github.io/blog/accessing-pyspark-pycharm/

我正在关注以下博客来帮助我完成整个过程。 http://renien.github.io/blog/accessing-pyspark-pycharm/

Due to the fact my code was taking the except path I personally got rid of the try: except: just to see what the exact error was.

由于我的代码采用了except路径,我个人已经摆脱了尝试:除了:只是为了看看确切的错误是什么。

Prior to this I received the following error:

在此之前,我收到以下错误:

ImportError: No module named py4j.java_gateway

This was fixed simply by typing '$sudo pip install py4j' in bash.

只需在bash中输入'$ sudo pip install py4j'就可以解决这个问题。

My code currently looks like the following chunk:

我的代码目前看起来像以下块:

import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME']="[MY_HOME_DIR]/spark-1.2.0"

# Append pyspark to Python Path
sys.path.append("[MY_HOME_DIR]/spark-1.2.0/python/")

try:
    from pyspark import SparkContext
    print ("Successfully imported Spark Modules")

except ImportError as e:
    print ("Can not import Spark Modules", e)
    sys.exit(1)

My Questions:
1. What is the source of this error? What is the cause? 2. How do I remedy the issue so I can run pyspark in my pycharm editor.

我的问题:1。这个错误的来源是什么?原因是什么? 2.如何解决问题,以便在pycharm编辑器中运行pyspark。

NOTE: The current interpreter I use in pycharm is Python 2.7.8 (~/anaconda/bin/python)

注意:我在pycharm中使用的当前解释器是Python 2.7.8(〜/ anaconda / bin / python)

Thanks ahead of time!

提前谢谢!

Don

11 个解决方案

#1

Firstly, set your environment var

首先,设置你的环境变量

export SPARK_HOME=/home/.../Spark/spark-2.0.1-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH
PATH="$PATH:$JAVA_HOME/bin:$SPARK_HOME/bin:$PYTHONPATH"

make sure that you use your own version name

确保使用自己的版本名称

and then, restart! it is important to validate you setting.

然后,重启!验证您的设置非常重要。

#2

It is around the variable PYTHONPATH, which specifies python module searching path.

它位于变量PYTHONPATH周围,它指定python模块搜索路径。

Cause mostly pyspark runs well, you could refer to the shell script pyspark, and see the PYTHONPATH setting is like as below.

因为大多数pyspark运行良好,你可以参考shell脚本pyspark,并看到PYTHONPATH设置如下所示。

PYTHONPATH=/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark/python.

My environment is Cloudera Qickstart VM 5.3.

我的环境是Cloudera Qickstart VM 5.3。

Hope this helps.

希望这可以帮助。

#3

This looks to me like a circular-dependency bug.

这看起来像一个循环依赖性bug。

In MY_HOME_DIR]/spark-1.2.0/python/pyspark/context.py remove or comment-out the line

在MY_HOME_DIR中] /spark-1.2.0/python/pyspark/context.py删除或注释掉该行

from pyspark import accumulators.

来自pyspark进口累加器。

It's about 6 lines of code from the top.

这是从顶部开始的大约6行代码。

I filed an issue with the Spark project here:

我在这里向Spark项目提出了一个问题:

https://issues.apache.org/jira/browse/SPARK-4974

#4

I came across the same error. I just installed py4j.

我遇到了同样的错误。我刚刚安装了py4j。

sudo pip install py4j

No necessity to set bashrc.

没有必要设置bashrc。

#5

I ran into the same issue using cdh 5.3

我使用cdh 5.3遇到了同样的问题

in the end this actually turned out to be pretty easy to resolve. I noticed that the script /usr/lib/spark/bin/pyspark has variables defined for ipython

最后,这实际上很容易解决。我注意到脚本/ usr / lib / spark / bin / pyspark有为ipython定义的变量

I installed anaconda to /opt/anaconda

我安装了anaconda到/ opt / anaconda

export PATH=/opt/anaconda/bin:$PATH
#note that the default port 8888 is already in use so I used a different port
export IPYTHON_OPTS="notebook --notebook-dir=/home/cloudera/ipython-notebook --pylab inline --ip=* --port=9999"

then finally....

executed

/usr/bin/pyspark

which now functions as expected.

现在按预期运作。

#6

I ran into this issue as well. To solve it, I commented out line 28 in ~/spark/spark/python/pyspark/context.py, the file which was causing the error:

我也遇到了这个问题。为了解决这个问题,我在〜/ spark / spark / python / pyspark / context.py中注释了第28行,该文件导致了错误:

# from pyspark import accumulators
from pyspark.accumulators import Accumulator

As the accumulator import seems to be covered by the following line (29), there doesn't seem to be an issue. Spark is now running fine (after pip install py4j).

由于累加器导入似乎由以下行(29)涵盖,似乎没有问题。 Spark现在正常运行(在pip install py4j之后)。

#7

In Pycharm, before running above script, ensure that you have unzipped the py4j*.zip file. and add its reference in script sys.path.append("path to spark*/python/lib")

在Pycharm中,在运行上面的脚本之前,请确保已解压缩py4j * .zip文件。并在脚本sys.path.append(“path to spark * / python / lib”)中添加其引用

It worked for me.

它对我有用。

#8

To get rid of **ImportError: No module named py4j.java_gateway** you need to add following lines 

import os
import sys


os.environ['SPARK_HOME'] = "D:\python\spark-1.4.1-bin-hadoop2.4"


sys.path.append("D:\python\spark-1.4.1-bin-hadoop2.4\python")
sys.path.append("D:\python\spark-1.4.1-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf

    print ("success")

except ImportError as e:
    print ("error importing spark modules", e)
    sys.exit(1)

#9

I was able to find a fix for this on Windows, but not really sure the root cause of it.

我能够在Windows上找到解决方案,但不确定它的根本原因。

If you open accumulators.py, then you see that first there is a header comment, followed by help text and then the import statements. move one or more of the import statements just after the comment block and before the help text. This worked on my system and i was able to import pyspark without any issues.

如果你打开accumulators.py,那么你会看到首先有一个标题注释,然后是帮助文本,然后是import语句。在注释块之后和帮助文本之前移动一个或多个import语句。这适用于我的系统,我能够导入pyspark,没有任何问题。

#10

If you have just upgraded to a new spark version, make sure the new version of py4j is in your PATH since each new spark version comes with a new py4j version.

如果你刚刚升级到新的spark版本,请确保新版本的py4j在你的PATH中,因为每个新的spark版本都带有一个新的py4j版本。

In my case it is: "$SPARK_HOME/python/lib/py4j-0.10.3-src.zip" for spark 2.0.1 instead of the old "$SPARK_HOME/python/lib/py4j-0.10.1-src.zip" for spark 2.0.0

在我的情况下,它是:“$ SPARK_HOME / python / lib / py4j-0.10.3-src.zip”用于spark 2.0.1而不是旧的“$ SPARK_HOME / python / lib / py4j-0.10.1-src.zip” “对于火花2.0.0

#11

Only thing worked out for me is, go to base folder of spark. then go to accumulators.py

唯一能解决的问题是,去火花的基础文件夹。然后去accumulators.py

In beginning, there was wrong multi line command used. remove everything.

一开始,使用了错误的多行命令。删除一切。

you're good to go!

你很高兴去!

#1