Can someone point me to a link or provide an explanation of the benefits of indexing in pandas? I routinely deal with tables and join them based on columns, and this joining/merging process seems to re-index things anyway, so it's a bit cumbersome to apply index criteria considering I don't think I need to.
有人能指出我的链接或解释大熊猫索引的好处吗?我经常处理表并根据列加入它们,这个加入/合并过程似乎无论如何重新索引事物,所以考虑到我认为不需要,应用索引标准有点麻烦。
Any thoughts on best-practices around indexing?
有关索引的最佳实践的任何想法?
1 个解决方案
#1
43
Like a dict, a DataFrame's index is backed by a hash table. Looking up rows based on index values is like looking up dict values based on a key.
像dict一样,DataFrame的索引由哈希表支持。根据索引值查找行就像查找基于键的dict值一样。
In contrast, the values in a column are like values in a list.
相反,列中的值类似于列表中的值。
Looking up rows based on index values is faster than looking up rows based on column values.
基于索引值查找行比基于列值查找行更快。
For example, consider
例如,考虑一下
df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)})
df_with_index = df.set_index(['index'])
Here is how you could look up any row where the df['index']
column equals 999. Pandas has to loop through every value in the column to find the ones equal to 999.
以下是如何查找df ['index']列等于999的任何行.Pandas必须循环遍历列中的每个值才能找到等于999的值。
df[df['index'] == 999]
# foo index
# 999 0.375489 999
Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:
以下是如何查找索引等于999的任何行。使用索引,Pandas使用哈希值来查找行:
df_with_index.loc[999]
# foo 0.375489
# index 999.000000
# Name: 999, dtype: float64
Looking up rows by index is much faster than looking up rows by column value:
按索引查找行比按列值查找行要快得多:
In [254]: %timeit df[df['index'] == 999]
1000 loops, best of 3: 368 µs per loop
In [255]: %timeit df_with_index.loc[999]
10000 loops, best of 3: 57.7 µs per loop
Note however, it takes time to build the index:
但请注意,构建索引需要时间:
In [220]: %timeit df.set_index(['index'])
1000 loops, best of 3: 330 µs per loop
So having the index is only advantageous when you have many lookups of this type to perform.
因此,当您要执行此类型的许多查找时,拥有索引是有利的。
Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index
, stack
, unstack
, pivot
, pivot_table
, melt
, lreshape
, and crosstab
, all use or manipulate the index. Sometimes we want the DataFrame in a different shape for presentation purposes, or for join
, merge
or groupby
operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join
, merge
and groupby
take advantage of fast index lookups when possible.
有时索引在重塑DataFrame中起作用。许多函数,例如set_index,stack,unstack,pivot,pivot_table,melt,lreshape和crosstab,都使用或操纵索引。有时我们希望DataFrame具有不同的形状以用于演示目的,或者用于连接,合并或组合操作。 (正如您所说,加入也可以基于列值完成,但基于索引的加入更快。)在幕后,加入,合并和groupby尽可能利用快速索引查找。
Time series have resample
, asfreq
and interpolate
methods whose underlying implementations take advantage of fast index lookups too.
时间序列具有resample,asfreq和interpolate方法,其底层实现也利用快速索引查找。
So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash lookups.
所以最后,我认为索引的有用性的起源,为什么它出现在如此多的函数中,是由于它能够执行快速哈希查找。
#1
43
Like a dict, a DataFrame's index is backed by a hash table. Looking up rows based on index values is like looking up dict values based on a key.
像dict一样,DataFrame的索引由哈希表支持。根据索引值查找行就像查找基于键的dict值一样。
In contrast, the values in a column are like values in a list.
相反,列中的值类似于列表中的值。
Looking up rows based on index values is faster than looking up rows based on column values.
基于索引值查找行比基于列值查找行更快。
For example, consider
例如,考虑一下
df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)})
df_with_index = df.set_index(['index'])
Here is how you could look up any row where the df['index']
column equals 999. Pandas has to loop through every value in the column to find the ones equal to 999.
以下是如何查找df ['index']列等于999的任何行.Pandas必须循环遍历列中的每个值才能找到等于999的值。
df[df['index'] == 999]
# foo index
# 999 0.375489 999
Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:
以下是如何查找索引等于999的任何行。使用索引,Pandas使用哈希值来查找行:
df_with_index.loc[999]
# foo 0.375489
# index 999.000000
# Name: 999, dtype: float64
Looking up rows by index is much faster than looking up rows by column value:
按索引查找行比按列值查找行要快得多:
In [254]: %timeit df[df['index'] == 999]
1000 loops, best of 3: 368 µs per loop
In [255]: %timeit df_with_index.loc[999]
10000 loops, best of 3: 57.7 µs per loop
Note however, it takes time to build the index:
但请注意,构建索引需要时间:
In [220]: %timeit df.set_index(['index'])
1000 loops, best of 3: 330 µs per loop
So having the index is only advantageous when you have many lookups of this type to perform.
因此,当您要执行此类型的许多查找时,拥有索引是有利的。
Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index
, stack
, unstack
, pivot
, pivot_table
, melt
, lreshape
, and crosstab
, all use or manipulate the index. Sometimes we want the DataFrame in a different shape for presentation purposes, or for join
, merge
or groupby
operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join
, merge
and groupby
take advantage of fast index lookups when possible.
有时索引在重塑DataFrame中起作用。许多函数,例如set_index,stack,unstack,pivot,pivot_table,melt,lreshape和crosstab,都使用或操纵索引。有时我们希望DataFrame具有不同的形状以用于演示目的,或者用于连接,合并或组合操作。 (正如您所说,加入也可以基于列值完成,但基于索引的加入更快。)在幕后,加入,合并和groupby尽可能利用快速索引查找。
Time series have resample
, asfreq
and interpolate
methods whose underlying implementations take advantage of fast index lookups too.
时间序列具有resample,asfreq和interpolate方法,其底层实现也利用快速索引查找。
So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash lookups.
所以最后,我认为索引的有用性的起源,为什么它出现在如此多的函数中,是由于它能够执行快速哈希查找。