如何将Parquet文件读入Pandas DataFrame?

时间:2022-10-29 13:37:57

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

如何在不设置Hadoop或Spark等集群计算基础架构的情况下,将适当大小的Parquet数据集读入内存中的Pandas DataFrame?这只是我想在笔记本电脑上使用简单的Python脚本在内存中读取的适量数据。数据不驻留在HDFS上。它可以在本地文件系统上,也可以在S3中。我不想启动和配置其他服务,如Hadoop,Hive或Spark。

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

我认为Blaze / Odo会让这成为可能:Odo文档提到了Parquet,但是这些示例似乎都是通过外部Hive运行时进行的。

2 个解决方案

#1


32  

pandas 0.21 introduces new functions for Parquet:

pandas 0.21为Parquet引入了新功能:

pd.read_parquet('example_pa.parquet', engine='pyarrow')

or

pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:

以上链接说明:

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

这些引擎非常相似,应该读/写几乎相同的镶木地板格式文件。这些库的不同之处在于具有不同的底层依赖关系(fastparquet使用numba,而pyarrow使用c-library)。

#2


12  

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/

更新:自从我回答这个问题以来,对于Apache Arrow的这种看法有了很多工作,可以更好地读取和编写镶木地板。另外:http://wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python

有一个蟒蛇拼花阅读器相对较好:https://github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.

它将创建python对象,然后您必须将它们移动到Pandas DataFrame,因此该过程将比pd.read_csv慢。

#1


32  

pandas 0.21 introduces new functions for Parquet:

pandas 0.21为Parquet引入了新功能:

pd.read_parquet('example_pa.parquet', engine='pyarrow')

or

pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:

以上链接说明:

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

这些引擎非常相似,应该读/写几乎相同的镶木地板格式文件。这些库的不同之处在于具有不同的底层依赖关系(fastparquet使用numba,而pyarrow使用c-library)。

#2


12  

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/

更新:自从我回答这个问题以来,对于Apache Arrow的这种看法有了很多工作,可以更好地读取和编写镶木地板。另外:http://wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python

有一个蟒蛇拼花阅读器相对较好:https://github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.

它将创建python对象,然后您必须将它们移动到Pandas DataFrame,因此该过程将比pd.read_csv慢。