*有效地*使用RPy(或其他方式)将数据帧从Pandas移动到R

时间:2021-04-11 22:55:54

I have a dataframe in Pandas, and I want to do some statistics on it using R functions. No problem! RPy makes it easy to send a dataframe from Pandas into R:

我在Pandas中有一个数据帧,我想用R函数对它做一些统计。没问题! RPy可以轻松地将数据帧从Pandas发送到R:

import pandas as pd
df = pd.DataFrame(index=range(100000),columns=range(100))
from rpy2 import robjects as ro
ro.globalenv['df'] = df

And if we're in IPython:

如果我们在IPython中:

%load_ext rmagic
%R -i df

For some reason the ro.globalenv route is slightly slower than the rmagic route, but no matter. What matters is this: The dataframe I will ultimately be using is ~100GB. This presents a few problems:

由于某种原因,ro.globalenv路线比rmagic路线略慢,但无论如何。重要的是:我最终将使用的数据帧是~100GB。这提出了一些问题:

  1. Even with just 1GB of data, the transfer is rather slow.
  2. 即使只有1GB的数据,传输速度也相当慢。

  3. If I understand correctly, this creates two copies of the dataframe in memory: one in Python, and one in R. That means I'll have just doubled my memory requirements, and I haven't even gotten to running statistical tests!
  4. 如果我理解正确,这将在内存中创建两个数据帧副本:一个在Python中,一个在R中。这意味着我只需要将内存需求增加一倍,而且我甚至没有运行统计测试!

Is there any way to:

有没有办法:

  1. transfer a large dataframe between Python and R more quickly?
  2. 在Python和R之间更快地传输大型数据帧?

  3. Access the same object in memory? I suspect this asking for the moon.
  4. 访问内存中的同一个对象?我怀疑这要求登月。

2 个解决方案

#1


5  

rpy2 is using a conversion mechanism that is trying to avoid copying objects when moving between Python and R. However, this is currently only working in the direction R -> Python.

rpy2正在使用一种转换机制,试图避免在Python和R之间移动时复制对象。但是,目前这只是在R - > Python的方向上工作。

Python has an interface called the "buffer interface" that is used by rpy2 and that lets it minimize the number of copies for the C-level compatible between R and Python (see http://rpy.sourceforge.net/rpy2/doc-2.5/html/numpy.html#from-rpy2-to-numpy - the doc seems outdated as the __array_struct__ interface is no longer the primary choice).

Python有一个名为“缓冲区接口”的接口,由rpy2使用,它可以最大限度地减少R和Python之间兼容的C级副本的数量(参见http://rpy.sourceforge.net/rpy2/doc- 2.5 / html / numpy.html#from-rpy2-to-numpy - 由于__array_struct__接口不再是主要选择,因此doc似乎过时了。

There is no equivalent to the buffer interface in R, and the current concern holding me back from providing an equivalent functionality in rpy2 is the handling of borrowed references during garbage collection (and the lack of time to think sufficiently carefully about it).

R中没有与缓冲区接口等效的东西,目前让我无法在rpy2中提供等效功能的做法是在垃圾收集过程中处理借用的引用(并且没有足够时间仔细思考它)。

So in summary there is a way to share data between Python and R without copying but this will require to have the data created in R.

总而言之,有一种方法可以在Python和R之间共享数据而无需复制,但这需要在R中创建数据。

#2


3  

Currently, feather seems to be the most efficient option for data-interchange between DataFrame of R and pandas.

目前,羽毛似乎是R和Pandas的DataFrame之间数据交换的最有效选择。

#1


5  

rpy2 is using a conversion mechanism that is trying to avoid copying objects when moving between Python and R. However, this is currently only working in the direction R -> Python.

rpy2正在使用一种转换机制,试图避免在Python和R之间移动时复制对象。但是,目前这只是在R - > Python的方向上工作。

Python has an interface called the "buffer interface" that is used by rpy2 and that lets it minimize the number of copies for the C-level compatible between R and Python (see http://rpy.sourceforge.net/rpy2/doc-2.5/html/numpy.html#from-rpy2-to-numpy - the doc seems outdated as the __array_struct__ interface is no longer the primary choice).

Python有一个名为“缓冲区接口”的接口,由rpy2使用,它可以最大限度地减少R和Python之间兼容的C级副本的数量(参见http://rpy.sourceforge.net/rpy2/doc- 2.5 / html / numpy.html#from-rpy2-to-numpy - 由于__array_struct__接口不再是主要选择,因此doc似乎过时了。

There is no equivalent to the buffer interface in R, and the current concern holding me back from providing an equivalent functionality in rpy2 is the handling of borrowed references during garbage collection (and the lack of time to think sufficiently carefully about it).

R中没有与缓冲区接口等效的东西,目前让我无法在rpy2中提供等效功能的做法是在垃圾收集过程中处理借用的引用(并且没有足够时间仔细思考它)。

So in summary there is a way to share data between Python and R without copying but this will require to have the data created in R.

总而言之,有一种方法可以在Python和R之间共享数据而无需复制,但这需要在R中创建数据。

#2


3  

Currently, feather seems to be the most efficient option for data-interchange between DataFrame of R and pandas.

目前,羽毛似乎是R和Pandas的DataFrame之间数据交换的最有效选择。