如果索引操作返回视图还是副本没有定义,那么熊猫中的视图有什么意义呢?

时间:2021-07-23 15:53:11

I have switched from R to pandas. I routinely get SettingWithCopyWarnings, when I do something like

我已经从R转到熊猫了。当我做一些类似的事情时,我通常会设置copywarning

df_a = pd.DataFrame({'col1': [1,2,3,4]})    

# Filtering step, which may or may not return a view
df_b = df_a[df_a['col1'] > 1]

# Add a new column to df_b
df_b['new_col'] = 2 * df_b['col1']

# SettingWithCopyWarning!!

I think I understand the problem, though I'll gladly learn what I got wrong. In the given example, it is undefined whether df_b is a view on df_a or not. Thus, the effect of assigning to df_b is unclear: does it affect df_a? The problem can be solved by explicitly making a copy when filtering:

我想我理解这个问题,尽管我很高兴知道我错了什么。在给定的示例中,df_b是否为df_a的视图是未定义的。因此,分配给df_b的效果不清楚:它会影响df_a吗?问题可以通过过滤时显式复制来解决:

df_a = pd.DataFrame({'col1': [1,2,3,4]})    

# Filtering step, definitely a copy now
df_b = df_a[df_a['col1'] > 1].copy()

# Add a new column to df_b
df_b['new_col'] = 2 * df_b['col1']

# No Warning now

I think there is something that I am missing: if we can never really be sure whether we create a view or not, what are views good for? From the pandas documentation (http://pandas-docs.github.io/pandas-docs-travis/indexing.html?highlight=view#indexing-view-versus-copy)

我认为我遗漏了一些东西:如果我们无法真正确定我们是否创建了一个视图,那么视图有什么用呢?从熊猫文件(http://pandas-docs.github.io/pandas- docs-travis/index.html?

Outside of simple cases, it’s very hard to predict whether it [getitem] will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)

除了简单的情况外,很难预测它(getitem)将返回一个视图还是一个副本(它取决于数组的内存布局,熊猫对此并不保证)

Similar warnings can be found for different indexing methods.

对于不同的索引方法可以找到类似的警告。

I find it very cumbersome and errorprone to sprinkle .copy() calls throughout my code. Am I using the wrong style for manipulating my DataFrames? Or is the performance gain so high that it justifies the apparent awkwardness?

我发现它非常麻烦并且容易在整个代码中撒下.copy()调用。我是否使用了错误的样式来处理数据爆炸?还是说,业绩增长如此之高,足以证明这种明显的尴尬是合理的?

2 个解决方案

#1


10  

Great question!

好问题!

The short answer is: this is a flaw in pandas that's being remedied.

简短的回答是:这是熊猫的一个缺陷,正在被纠正。

You can find a longer discussion of the nature of the problem here, but the main take-away is that we're now moving to a "copy-on-write" behavior in which any time you slice, you get a new copy, and you never have to think about views. The fix will soon come through this refactoring project. I actually tried to fix it directly (see here), but it just wasn't feasible in the current architecture.

你可以在这里找到关于问题本质的更长的讨论,但主要的结论是,我们现在正在转向一个“复制-写”的行为,在这个行为中,任何时候,只要你切片,你就会得到一个新的副本,而且你永远不需要考虑视图。修复将很快通过这个重构项目实现。我实际上试图直接修复它(参见这里),但是在当前的体系结构中它是不可行的。

In truth, we'll keep views in the background -- they make pandas SUPER memory efficient and fast when they can be provided -- but we'll end up hiding them from users so, from the user perspective, if you slice, index, or cut a DataFrame, what you get back will effectively be a new copy.

事实上,我们会在后台保持的观点——他们使大熊猫超级记忆高效和快速当他们可以提供,但我们最终会隐藏他们从用户,从用户的角度来看,如果你片,指数,或削减DataFrame,你得到有效地将一个新副本。

(This is accomplished by creating views when the user is only reading data, but whenever an assignment operation is used, the view will be converted to a copy before the assignment takes place.)

(这是通过在用户只读取数据时创建视图来实现的,但是每当使用赋值操作时,视图将在赋值发生之前转换为副本。)

Best guess is the fix will be in within a year -- in the mean time, I'm afraid some .copy() may be necessary, sorry!

最好的猜测是修复工作将在一年内完成——与此同时,恐怕有些.copy()可能是必要的,对不起!

#2


2  

I agree this is a bit funny. My current practice is to look for a "functional" method for whatever I want to do (in my experience these almost always exist with the exception of renaming columns and series). Sometimes it makes the code more elegant, sometimes it makes it worse (I don't like assign with lambda), but at least I don't have to worry about mutability.

我同意这有点好笑。我目前的做法是寻找一种“功能”的方法来做任何我想做的事情(在我的经验中,除了重命名列和系列之外几乎总是存在这些方法)。有时它使代码更优雅,有时它使代码更糟糕(我不喜欢用lambda赋值),但至少我不必担心可变性。

So for indexing, instead of using the slice notation, you can use query which will return a copy by default:

因此,对于索引,您可以使用查询而不是使用切片表示法,查询将返回默认的副本:

In [5]: df_a.query('col1 > 1')
Out[5]:
   col1
1     2
2     3
3     4

I expand on it a little in this blog post.

我在这篇博文中对此做了一些扩展。

Edit: As raised in the comments, it looks like I'm wrong about query returning a copy by default, however if you use the assign style, then assign will make a copy before returning your result, and you're all good:

编辑:在评论中提到,在默认情况下,查询返回一个副本是错误的,但是如果您使用的是赋值样式,那么在返回结果之前,赋值将复制一个副本,并且您都是优秀的:

df_b = (df_a.query('col1 > 1')
            .assign(newcol = 2*df_a['col1']))

#1


10  

Great question!

好问题!

The short answer is: this is a flaw in pandas that's being remedied.

简短的回答是:这是熊猫的一个缺陷,正在被纠正。

You can find a longer discussion of the nature of the problem here, but the main take-away is that we're now moving to a "copy-on-write" behavior in which any time you slice, you get a new copy, and you never have to think about views. The fix will soon come through this refactoring project. I actually tried to fix it directly (see here), but it just wasn't feasible in the current architecture.

你可以在这里找到关于问题本质的更长的讨论,但主要的结论是,我们现在正在转向一个“复制-写”的行为,在这个行为中,任何时候,只要你切片,你就会得到一个新的副本,而且你永远不需要考虑视图。修复将很快通过这个重构项目实现。我实际上试图直接修复它(参见这里),但是在当前的体系结构中它是不可行的。

In truth, we'll keep views in the background -- they make pandas SUPER memory efficient and fast when they can be provided -- but we'll end up hiding them from users so, from the user perspective, if you slice, index, or cut a DataFrame, what you get back will effectively be a new copy.

事实上,我们会在后台保持的观点——他们使大熊猫超级记忆高效和快速当他们可以提供,但我们最终会隐藏他们从用户,从用户的角度来看,如果你片,指数,或削减DataFrame,你得到有效地将一个新副本。

(This is accomplished by creating views when the user is only reading data, but whenever an assignment operation is used, the view will be converted to a copy before the assignment takes place.)

(这是通过在用户只读取数据时创建视图来实现的,但是每当使用赋值操作时,视图将在赋值发生之前转换为副本。)

Best guess is the fix will be in within a year -- in the mean time, I'm afraid some .copy() may be necessary, sorry!

最好的猜测是修复工作将在一年内完成——与此同时,恐怕有些.copy()可能是必要的,对不起!

#2


2  

I agree this is a bit funny. My current practice is to look for a "functional" method for whatever I want to do (in my experience these almost always exist with the exception of renaming columns and series). Sometimes it makes the code more elegant, sometimes it makes it worse (I don't like assign with lambda), but at least I don't have to worry about mutability.

我同意这有点好笑。我目前的做法是寻找一种“功能”的方法来做任何我想做的事情(在我的经验中,除了重命名列和系列之外几乎总是存在这些方法)。有时它使代码更优雅,有时它使代码更糟糕(我不喜欢用lambda赋值),但至少我不必担心可变性。

So for indexing, instead of using the slice notation, you can use query which will return a copy by default:

因此,对于索引,您可以使用查询而不是使用切片表示法,查询将返回默认的副本:

In [5]: df_a.query('col1 > 1')
Out[5]:
   col1
1     2
2     3
3     4

I expand on it a little in this blog post.

我在这篇博文中对此做了一些扩展。

Edit: As raised in the comments, it looks like I'm wrong about query returning a copy by default, however if you use the assign style, then assign will make a copy before returning your result, and you're all good:

编辑:在评论中提到,在默认情况下,查询返回一个副本是错误的,但是如果您使用的是赋值样式,那么在返回结果之前,赋值将复制一个副本,并且您都是优秀的:

df_b = (df_a.query('col1 > 1')
            .assign(newcol = 2*df_a['col1']))