NumPy的ufuncs也可以操作pandas对象

>>> frame

   one  two  three  four

a    0    1      2     3

b    4    5      6     7

c    8    9     10    11

d   12   13     14    15

>>> np.square(frame)#求平方

   one  two  three  four

a    0    1      4     9

b   16   25     36    49

c   64   81    100   121

d  144  169    196   225

>>>

用DataFrame的apply方法，可以将函数应用到由各列或行所形成的一维数组中。

>>> frame

   one  two  three  four

a    0    1      2     3

b    4    5      6     7

c    8    9     10    11

d   12   13     14    15

>>> func = lambda x : x.max()-x.min()

>>> frame.apply(func)

one      12

two      12

three    12

four     12

dtype: int64

>>> frame.apply(func,axis = 1)

a    3

b    3

c    3

d    3

dtype: int64

用DataFrame的applymap方法，可以将函数应用到元素级的数据上。

>>> f = lambda x : x+1

>>> frame

   one  two  three  four

a    0    1      2     3

b    4    5      6     7

c    8    9     10    11

d   12   13     14    15

>>> frame.applymap(f)

   one  two  three  four

a    1    2      3     4

b    5    6      7     8

c    9   10     11    12

d   13   14     15    16

Series也有一个元素级函数应用的方法map

>>> frame['one'] #获取dataframe的列为一个Series对象

a     0

b     4

c     8

d    12

Name: one, dtype: int32

>>> frame['one'].map(f)

a     1

b     5

c     9

d    13

Name: one, dtype: int64

>>>

排序和排名

用sort_index对行或列进行排序，返回一个排序好的新对象

>>> obj = Series(range(4),index=['d','b','a','c'])

>>> new_obj = obj.sort_index()

>>> new_obj

a    2

b    1

c    3

d    0

dtype: int64

>>> obj

d    0

b    1

a    2

c    3

dtype: int64

>>>

>>> new_obj = obj.sort_index(ascending = False)#默认是升序，通过参数ascending可以设置降序
>>> new_obj
d 0
c 3
b 1
a 2
dtype: int64

对于DataFrame可以根据任意轴进行排序

>>> frame = DataFrame(np.random.randn(4,4),columns = ['c','a','d','b'],index=[3,1,4,2])

>>> frame

          c         a         d         b

3  0.004950 -1.272352  1.050491  0.823530

1  1.198348  0.647114  0.154131 -0.636497

4 -0.358309  0.525307 -1.868459  0.867197

2 -0.021764  0.140501  1.459700 -0.090884

>>> frame.sort_index()

          c         a         d         b

1  1.198348  0.647114  0.154131 -0.636497

2 -0.021764  0.140501  1.459700 -0.090884

3  0.004950 -1.272352  1.050491  0.823530

4 -0.358309  0.525307 -1.868459  0.867197

>>> frame.sort_index(axis =1)

          a         b         c         d

3 -1.272352  0.823530  0.004950  1.050491

1  0.647114 -0.636497  1.198348  0.154131

4  0.525307  0.867197 -0.358309 -1.868459

2  0.140501 -0.090884 -0.021764  1.459700

除了按照索引排序之外，还可以按照值排序

按值对Series进行排序的时候，用sort_values方法。在老版本中是order方法。

>>> obj = Series([3,4,1,6])

>>> obj

0    3

1    4

2    1

3    6

dtype: int64

>>> obj.sort_values()

2    1

0    3

1    4

3    6

dtype: int64

在排序时，缺失值会默认放到末尾。

在DataFrame中，可能希望按照一个或多个列中的值进行排序

>>> frame = DataFrame({'a':[4,7,-3,2],'b':[1,0,0,1]})

>>> frame

   a  b

0  4  1

1  7  0

2 -3  0

3  2  1

>>> frame.sort_index(by='a')#这个方法将在不久之后废弃，可以使用sort_values方法

__main__:1: FutureWarning: by argument to sort_index is deprecated, please use .sort_values(by=...)

   a  b

2 -3  0

3  2  1

0  4  1

1  7  0

>>> frame.sort_values(by='a')

   a  b

2 -3  0

3  2  1

0  4  1

1  7  0

>>>

根据多个列排序

>>> frame.sort_values(by=['b','a'])

   a  b

2 -3  0

1  7  0

3  2  1

0  4  1

排名跟排序有紧密的联系，首先根据值排序，然后增设一个排名值（从1开始，直到有效值的数量。如果两个值相等，都取两个排名的均值）

>>> obj = Series([7,-5,7,4,2,0,4])

>>> obj

0    7

1   -5

2    7

3    4

4    2

5    0

6    4

dtype: int64

>>> obj.rank()

0    6.5

1    1.0

2    6.5

3    4.5

4    3.0

5    2.0

6    4.5

dtype: float64

>>>

也可以根据值在原来数据中出现的顺序，进行排名。如果某几个值相等，现在数据中出现的排名靠前，这需要借助于method选项

>>> obj.rank(method='first')

0    6.0

1    1.0

2    7.0

3    4.0

4    3.0

5    2.0

6    5.0

dtype: float64

当然也支持降序排列，ascending=False即可

dataframe对象默认按照行排名，设置轴选项axis=1，就会按照列排名

method选项的值有

method	说明
average	默认：在相等分组中，为各个值分配平均排名
mix	使用整个分组的最大排名
min	使用整个分组的最小排名
first	按照值在原始数据中出现的顺序分配排名

带有重复值的轴索引

许多pandas函数需要标签唯一，但这并不是强制性的。

可以通过索引的is_unique去判断是否唯一

>>> obj =Series(range(5),index=['a','a','b','b','c'])

>>> obj

a    0

a    1

b    2

b    3

c    4

dtype: int64

>>> obj.index.is_unique
False

带有重复值索引，数据的选取时，如果索引对应多个值，返回一个Series，否则返回单个值

>>> obj['a']

a    0

a    1

dtype: int64

>>> obj['c']

4

对于DataFrame也是如此

如果索引对应多行，返回的依然是一个dataframe对象，否则是一个Series对象

>>> df = DataFrame(np.random.randn(5,3),index=['a','a','b','b','c'])

>>> df.ix['a']

          0         1         2

a -0.757846  0.713964 -0.674956

a  0.198044  1.093223 -0.342281

>>> df.ix['c']

0   -2.647372

1   -0.526367

2   -0.296859

Name: c, dtype: float64

>>> type(df.ix['a'])

<class 'pandas.core.frame.DataFrame'>

>>> type(df.ix['c'])

<class 'pandas.core.series.Series'>

秒客网

pandas（二）函数应用和映射

排序和排名

带有重复值的轴索引

相关文章