大熊猫索引的重点是什么?

时间:2022-01-11 19:58:33

Can someone point me to a link or provide an explanation of the benefits of indexing in pandas? I routinely deal with tables and join them based on columns, and this joining/merging process seems to re-index things anyway, so it's a bit cumbersome to apply index criteria considering I don't think I need to.

有人能指出我的链接或解释大熊猫索引的好处吗?我经常处理表并根据列加入它们,这个加入/合并过程似乎无论如何重新索引事物,所以考虑到我认为不需要,应用索引标准有点麻烦。

Any thoughts on best-practices around indexing?

有关索引的最佳实践的任何想法?

1 个解决方案

#1


43  

Like a dict, a DataFrame's index is backed by a hash table. Looking up rows based on index values is like looking up dict values based on a key.

像dict一样,DataFrame的索引由哈希表支持。根据索引值查找行就像查找基于键的dict值一样。

In contrast, the values in a column are like values in a list.

相反,列中的值类似于列表中的值。

Looking up rows based on index values is faster than looking up rows based on column values.

基于索引值查找行比基于列值查找行更快。

For example, consider

例如,考虑一下

df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)})
df_with_index = df.set_index(['index'])

Here is how you could look up any row where the df['index'] column equals 999. Pandas has to loop through every value in the column to find the ones equal to 999.

以下是如何查找df ['index']列等于999的任何行.Pandas必须循环遍历列中的每个值才能找到等于999的值。

df[df['index'] == 999]

#           foo  index
# 999  0.375489    999

Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:

以下是如何查找索引等于999的任何行。使用索引,Pandas使用哈希值来查找行:

df_with_index.loc[999]
# foo        0.375489
# index    999.000000
# Name: 999, dtype: float64

Looking up rows by index is much faster than looking up rows by column value:

按索引查找行比按列值查找行要快得多:

In [254]: %timeit df[df['index'] == 999]
1000 loops, best of 3: 368 µs per loop

In [255]: %timeit df_with_index.loc[999]
10000 loops, best of 3: 57.7 µs per loop

Note however, it takes time to build the index:

但请注意,构建索引需要时间:

In [220]: %timeit df.set_index(['index'])
1000 loops, best of 3: 330 µs per loop

So having the index is only advantageous when you have many lookups of this type to perform.

因此,当您要执行此类型的许多查找时,拥有索引是有利的。

Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index, stack, unstack, pivot, pivot_table, melt, lreshape, and crosstab, all use or manipulate the index. Sometimes we want the DataFrame in a different shape for presentation purposes, or for join, merge or groupby operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join, merge and groupby take advantage of fast index lookups when possible.

有时索引在重塑DataFrame中起作用。许多函数,例如set_index,stack,unstack,pivot,pivot_table,melt,lreshape和crosstab,都使用或操纵索引。有时我们希望DataFrame具有不同的形状以用于演示目的,或者用于连接,合并或组合操作。 (正如您所说,加入也可以基于列值完成,但基于索引的加入更快。)在幕后,加入,合并和groupby尽可能利用快速索引查找。

Time series have resample, asfreq and interpolate methods whose underlying implementations take advantage of fast index lookups too.

时间序列具有resample,asfreq和interpolate方法,其底层实现也利用快速索引查找。

So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash lookups.

所以最后,我认为索引的有用性的起源,为什么它出现在如此多的函数中,是由于它能够执行快速哈希查找。

#1


43  

Like a dict, a DataFrame's index is backed by a hash table. Looking up rows based on index values is like looking up dict values based on a key.

像dict一样,DataFrame的索引由哈希表支持。根据索引值查找行就像查找基于键的dict值一样。

In contrast, the values in a column are like values in a list.

相反,列中的值类似于列表中的值。

Looking up rows based on index values is faster than looking up rows based on column values.

基于索引值查找行比基于列值查找行更快。

For example, consider

例如,考虑一下

df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)})
df_with_index = df.set_index(['index'])

Here is how you could look up any row where the df['index'] column equals 999. Pandas has to loop through every value in the column to find the ones equal to 999.

以下是如何查找df ['index']列等于999的任何行.Pandas必须循环遍历列中的每个值才能找到等于999的值。

df[df['index'] == 999]

#           foo  index
# 999  0.375489    999

Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:

以下是如何查找索引等于999的任何行。使用索引,Pandas使用哈希值来查找行:

df_with_index.loc[999]
# foo        0.375489
# index    999.000000
# Name: 999, dtype: float64

Looking up rows by index is much faster than looking up rows by column value:

按索引查找行比按列值查找行要快得多:

In [254]: %timeit df[df['index'] == 999]
1000 loops, best of 3: 368 µs per loop

In [255]: %timeit df_with_index.loc[999]
10000 loops, best of 3: 57.7 µs per loop

Note however, it takes time to build the index:

但请注意,构建索引需要时间:

In [220]: %timeit df.set_index(['index'])
1000 loops, best of 3: 330 µs per loop

So having the index is only advantageous when you have many lookups of this type to perform.

因此,当您要执行此类型的许多查找时,拥有索引是有利的。

Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index, stack, unstack, pivot, pivot_table, melt, lreshape, and crosstab, all use or manipulate the index. Sometimes we want the DataFrame in a different shape for presentation purposes, or for join, merge or groupby operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join, merge and groupby take advantage of fast index lookups when possible.

有时索引在重塑DataFrame中起作用。许多函数,例如set_index,stack,unstack,pivot,pivot_table,melt,lreshape和crosstab,都使用或操纵索引。有时我们希望DataFrame具有不同的形状以用于演示目的,或者用于连接,合并或组合操作。 (正如您所说,加入也可以基于列值完成,但基于索引的加入更快。)在幕后,加入,合并和groupby尽可能利用快速索引查找。

Time series have resample, asfreq and interpolate methods whose underlying implementations take advantage of fast index lookups too.

时间序列具有resample,asfreq和interpolate方法,其底层实现也利用快速索引查找。

So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash lookups.

所以最后,我认为索引的有用性的起源,为什么它出现在如此多的函数中,是由于它能够执行快速哈希查找。