Python的工具包[1] -> pandas数据预处理 -> pandas 库及使用总结

pandas数据预处理 / pandas data pre-processing

关于 pandas
pandas 库
pandas 基本操作
pandas 计算
pandas 的 Series
pandas 常用函数
补充内容

1 关于pandas / About pandas

Pandas起源

Python Data Analysis Library或pandas是基于NumPy的一种工具，该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

Pandas是python的一个数据分析包，最初由AQR Capital Management于2008年4月开发，并于2009年底开源出来，目前由专注于Python数据包开发的PyData开发team继续开发和维护，属于PyData项目的一部分。Pandas最初被作为金融数据分析工具而开发出来，因此，pandas为时间序列分析提供了很好的支持。 Pandas的名称来自于面板数据（panel data）和python数据分析（data analysis）。panel data是经济学中关于多维数据集的一个术语，在Pandas中也提供了panel的数据类型。

Pandas中的数据结构

Series:

一维数组，与Numpy中的一维Array类似。二者与Python基本的数据结构List也很相近，其区别是，List中的元素可以是不同的数据类型，而Array和Series中则只允许存储相同的数据类型，这样可以更有效的使用内存，提高运算效率。

Time- Series:

以时间为索引的Series。

DataFrame:

二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主。

Panel:

三维的数组，可以理解为DataFrame的容器。

Pandas中一般的数据结构构成为DataFrame -> Series -> ndarray

2 pandas库 / pandas Library

环境安装:

pip install pandas

2.1 常量 / Constants

pass

2.2 函数 / Function

2.2.1 read_csv()函数

函数调用: info = pd.read_csv(filename)

函数功能:读取指定的csv文件，生成一个包含csv数据的DataFrame

传入参数: filename

filename: str类型，需要读取的文件名

返回参数: info

info: DataFrame类型，读取文件生成的DataFrame

类似方法还有: read_excel / read_json / read_sql / read_html等

2.2.2 isnull()函数

函数调用: bool = pd.isnull(obj)

函数功能:返回一个包含数据是否是null的信息数据

传入参数: obj

obj: DataFrame/Series类型，待判断的数据

返回参数: bool

bool: DataFrame/Series类型，返回的判断结果，True表示null，False则不是

2.2.3 to_datetime()函数

函数调用: date = pd.to_datetime(arg)

函数功能:将传入的数据转换成日期数据格式返回

传入参数: arg

arg: int/float/srting/datetime/list/tuple/1-d array/Series类型，argument，可传入一维数组或Series，0.18.1版本中加入DataFrame和dict-like结构

返回参数: date

date: 返回的数据类型由传入的参数确定

Note: pandas中通过to_datetime函数转换的而成的数据其dtype为datetime64[ns]，该数据存在的Series可以通过.dt.month/year/day获取所需要的日期信息

2.3 类 / Class

2.3.1 DataFrame类

类实例化：df = pd.DataFrame(data, index=) / pd.read_xxx(file_name)

类的功能:用于生成DataFrame

传入参数: data, index / file_name

data: ndarray类型，包含需要构建成DataFrame的数据(二维)

index: Series类型，决定作为索引的列参数

file_name: str类型，需要读取的文件名

返回参数: df

df: DataFrame类型，生成的DataFrame

2.3.1.1 dtypes属性

属性调用: fmt = df.dtypes

属性功能: 返回数据结构中每列的数据类型(由于是多个，使用dtypes，numpy中单个，使用dtype)

属性参数: fmt

fmt: Series类型，包含每个数据值的数据类型，index为列名，value为类型，其中，object类型相当于Python中的string

2.3.1.2 columns属性

属性调用: index_name = df.columns

属性功能: 返回数据结构中每列的列名

属性参数: index_name

Index_name: Index类型，<class 'pandas.core.indexes.base.Index'>，包含每列的列名

2.3.1.3 shape属性方法

属性调用: shp = df.shape

属性功能: 返回数据结构的行列参数

属性参数: shp

shp: tuple类型，(row, column)，返回行列数

2.3.1.4 loc属性

属性调用: index = df.loc

属性功能: 返回一个index的类

属性参数: index

index: obj类型，<class 'pandas.core.indexing._LocIndexer'>，可用于切片获取数据信息的DataFrame，如index[0]获取第一行，index[3:7]获取3-7行的数据

2.3.1.5 head()方法

函数调用: hdf = df.head(num=5)

函数功能: 返回csv列表中的前num行数据

传入参数: num

num: int类型，需要获取的行数

返回参数: hdf

hdf: DataFrame类型，原数据的前num行数据

2.3.1.6 tail()方法

函数调用: tdf = df.tail(num=5)

函数功能: 返回csv列表中的后num行数据

传入参数: num

num: int类型，需要获取的行数

返回参数: tdf

tdf: DataFrame类型，原数据的后num行数据

2.3.1.7 describe()方法

函数调用: ddf = df.describe()

函数功能: 返回csv列表中每个列的一些统计描述参数

返回参数: 无

返回参数: ddf

ddf: DataFrame类型，包括的信息有，每一列的数量count，均值mean，标准差std，最小值min，1/4位数25%，中位数50%，3/4位数75%，最大值max

2.3.1.8 sort_values()方法

函数调用: sdf = df.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)

函数功能: 返回按参数排序的DataFrame

传入参数: by, axis, ascending, inplace, kind, na_position

by: str类型，DataFrame的行/列名

axis: int类型，0按列(第一轴)sort，1按行(最后轴)sort

ascending: bool类型，True为升序排列， False为降序排列

inplace: bool类型，True则修改原DataFrame，False则返回新的DataFrame

kind: str类型，确定sort的排序算法，包括{‘quicksort’, ‘mergesort’, ‘heapsort’}

na_position: str类型，确定na数据存在的位置，‘first’/‘last’

返回参数: sdf

sdf: DataFrame类型，重排后的DataFrame

2.3.1.9 mean ()方法

函数调用: mdf = df.mean(axis=0)

函数功能: 返回存储所有非NaN的值的平均值DataFrame

传入参数: axis

axis: int类型，0按列(第一轴)sort，1按行(最后轴)sort

返回参数: mdf

mdf: DataFrame类型，存储均值的数据类型为float

2.3.1.10 pivot_table ()方法

函数调用: cdf = df.pivot_table(index=, values=, aggfunc=)

函数功能: 根据index将数据分组，对于values列的值（相同类型）执行aggfunc函数

传入参数: index, values, aggfunc

index: str类型，进行分组的列的列名

values: str/list类型，需要计算的列的列名，多个则使用list

aggfunc: method类型，需要调用的方法

返回参数: cdf

cdf: DataFrame类型，通过自定义函数运算后得到的DataFrame

2.3.1.11 dropna ()方法

函数调用: ddf = df.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)

函数功能: 根据要求删除带有NaN值的行列

传入参数: axis, how, thresh, subset, inplace

axis: int/str类型，搜索方向，0/‘index’为行搜索，1/‘columns’为列搜索

how: str类型，‘any’只要出现NA值就删除该行/列数据，‘all’所有值都是NA才删除

thresh: int/None类型，表示对有效数据数量的最小要求（为2则要求该行/列至少2个有效非NA数据存在）

subset: str/list类型，表示在特定子集中寻找NA

inplace: bool类型，表示是否在原数据操作，True修改原数据，False返回新数据

返回参数: cdf

cdf: DataFrame类型，通过删除NA函数运算后得到的DataFrame

2.3.1.12 reset_index ()方法

函数调用: rdf = df.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill=’’)

函数功能: 重置（一般为经过排序后的）DataFrame的序号

传入参数: level, drop, inplace, col_level, col_fill

level: int/str/tuple/list类型，Only remove the given levels from the index. Removes all levels by default

drop: bool类型，是否删除原始的index列，True删除，False保留

inplace: bool类型，是否在原数据上操作

col_level: int/str类型，If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level

col_fill: obj类型，If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.

返回参数: rdf

rdf: DataFrame类型，通过重排index后的DataFrame

2.3.1.13 set_index ()方法

函数调用: sdf = df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

函数功能: 根据现有的columns参数重新设置index索引

传入参数: keys, drop, append, inplace, verify_integrity

keys: str类型，需要作为索引的列名

drop: bool类型，是否删除作为索引的列，True删除，False保留

append: bool类型，是否添加默认的index(序号索引)

inplace: bool类型，是否在原数据上操作

verify_integrity: bool类型，Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method

返回参数: sdf

sdf: DataFrame类型，通过重设index后的DataFrame

2.3.1.14 apply ()方法

函数调用: re = df.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

函数功能: 对DataFrame相应的方向使用自定义函数

传入参数: func, axis, broadcast, raw, reduce, args, **kwds

func: method类型，用于各行/列的函数

axis: int/str类型，0/‘index’对每列使用函数，1/‘column’对每行使用函数

broadcast: bool类型，For aggregation functions, return object of same size with values propagated

raw: bool类型，If False, convert each row or column into a Series. If raw=True the passed function will receive ndarray objects instead. If you are just applying a NumPy eduction function this will achieve much better performance

reduce: bool/None类型，Try to apply reduction procedures. If the DataFrame is empty, apply will use reduce to determine whether the result should be a Series or a DataFrame. If reduce is None (the default), apply's return value will be guessed by calling func an empty Series (note: while guessing, exceptions raised by func will be ignored). If reduce is True a Series will always be returned, and if False a DataFrame will always be returned

args: tuple类型，Positional arguments to pass to function in addition to the array/series

**kwds: 其余关键字参数将会被当做参数传给调用函数

返回参数: rdf

rdf: DataFrame类型，通过重排index后的DataFrame

2.3.1.15 ix属性

属性调用: ix_obj = df.ix

属性功能: 返回一个index类的数据

属性参数: ix_obj

ix_obj: obj类型，<class 'pandas.core.indexing._IXIndexer'>

Note: 后续可通过ix_obj[rows, cols]获取DataFrame或Series，rows/cols可以是需要取的行索引/列名

2.3.2 Series类

类实例化：sr = pd.Series(data, index=) / df[colomn_name]

类的功能:用于生成Series

传入参数: data, index / column_name

data: ndarray类型，包含需要构建成Series的数据(一维)

index: Series类型，决定作为索引的列参数

column_name: str类型，需要获取Series的列名

返回参数: sr

sr: Series类型，生成的Series

2.3.2.1 values属性

属性调用: values = sr.values

属性功能: 返回Series的所有value值

属性参数: values

values: ndarray类型，Series的所有值形成的一维ndarray

2.3.2.2 tolist()方法

函数调用: list =sr.tolist()

函数功能:将Series或Index类的数据变成list形式返回

传入参数: 无

返回参数: list

list: list类型，返回的数据列表

2.3.2.3 max/min()方法

函数调用: value =sr.max/min()

函数功能:获取Series中的最大/最小值

传入参数: 无

返回参数: value

value: int/str等类型，返回的最值

2.3.2.4 sort_values()方法

函数调用: ssr = sr.sort_values(axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)

函数功能: 返回按参数排序的Series

传入参数: axis, ascending, inplace, kind, na_position

axis: int类型，0按列(第一轴)sort，1按行(最后轴)sort

ascending: bool类型，True为升序排列， False为降序排列

inplace: bool类型，True则修改原DataFrame，False则返回新的DataFrame

kind: str类型，确定sort的排序算法，包括{‘quicksort’, ‘mergesort’, ‘heapsort’}

na_position: str类型，确定na数据存在的位置，‘first’/‘last’

返回参数: ssr

ssr: Series类型，重排后的Series

2.3.2.5 mean ()方法

函数调用: msr = sr.mean()

函数功能: 返回存储所有非NaN的值的平均值Series

传入参数: 无

返回参数: msr

msr: Series类型，存储均值的数据类型为float

2.3.2.6 reset_index ()方法

函数调用: rsr = sr.reset_index(level=None, drop=False, name=None, inplace=False)

函数功能: 重置（一般为经过排序后的）Series的序号

传入参数: level, drop, name, inplace

level: int/str/tuple/list类型，Only remove the given levels from the index. Removes all levels by default

drop: bool类型，是否删除原始的index列，True删除，False保留

name: obj类型，The name of the column corresponding to the Series values

inplace: bool类型，是否在原数据上操作

返回参数: rsr

rsr: Series类型，通过重排index后的Series

2.3.2.7 value_counts ()方法

函数调用: csr = sr.value_counts(dropna=True)

函数功能: 计算Series中各个values值的数量

传入参数: dropna

dropna: bool类型，是否计算NA的数量，True不计算，False计算

返回参数: csr

csr: Series类型，各数据值为索引，数量为value的Series

3 pandas基本操作

首先对csv文件进行读取操作，利用read_csv函数，值得注意的是，存储的csv文件必须利用Excel另存为的方式获得，而不能以修改后缀名的方法得到。

 import pandas as pd

 # info = pd.read_csv('info.csv', encoding='latin1')

 # info = pd.read_csv('info.csv', encoding='ISO-8859-1')

 # info = pd.read_csv('info.csv', encoding='cp1252')

 info = pd.read_csv('info.csv')

 # Get the info of whole csv list, and the info of row and column

 print(info)

输出结果为

     No.       Type  Info      Number Rank Mark.

0   1001   BUTTER_1   NaN    4.000000    A    cc

1   1002   BUTTER_2   NaN         NaN    C    dd

2   1003   BUTTER_3   NaN         NaN  NaN    ff

3   1004   BUTTER_4   NaN         NaN  NaN   NaN

4   1005   BUTTER_5    df  543.000000    F    cx

5   1006   BUTTER_6    fa  345.000000    A    cc

6   1007   BUTTER_7   jhf   67.000000    S    dd

7   1008   BUTTER_8    ad  567.000000    S    ff

8   1009   BUTTER_9  gdfs   34.000000    C    aa

9   1010  BUTTER_10  vczx   34.000000    C    cx

10  1011  BUTTER_11    as   89.000000    E    cc

11  1012  BUTTER_12    cd   90.000000    D    dd

12  1013  BUTTER_13   qwe   14.000000    S    ff

13  1014    WATER_1   asd  186.635198    A    aa

14  1015    WATER_2    as  222.000000    B    cc

15  1016    WATER_3    fa  193.026806    A    cc

16  1017    WATER_4   jhf  196.222611    C    dd

17  1018    WATER_5    ad  199.418415    B    ff

18  1019    WATER_6  gdfs  202.614219    D    aa

19  1020    WATER_7  vczx  205.810023    F    cx

20  1021    WATER_8    as  209.005827    A    cc

21  1022    WATER_9    cd  212.201632    S    dd

22  1023   WATER_10   qwe  215.397436    S    ff

23  1024   WATER_11   asd  218.593240    C    aa

24  1025   WATER_12    df  221.789044    C    cx

25  1026   WATER_13    fa  224.984848    E    cc

26  1027   WATER_14   jhf  228.180653    D    dd

27  1028   WATER_15    ad  231.376457    S    ff

28  1029   WATER_16  gdfs  234.572261    A    aa

29  1030   WATER_17  vczx  237.768065    B    cx

..   ...        ...   ...         ...  ...   ...

70  1071  CHEESE_11    as  368.796037    E    cc

71  1072  CHEESE_12    cd  371.991842    D    dd

72  1073  CHEESE_13   qwe  375.187646    S    ff

73  1074  CHEESE_14   asd  378.383450    A    aa

74  1075  CHEESE_15    df  381.579254    B    cx

75  1076  CHEESE_16    fa  384.775058    A    cc

76  1077  CHEESE_17   jhf  387.970863    C    dd

77  1078  CHEESE_18    ad  391.166667    B    ff

78  1079  CHEESE_19  gdfs  394.362471    D    aa

79  1080  CHEESE_20  vczx  397.558275    F    cx

80  1081  CHEESE_21    as  400.754079    A    cc

81  1082  CHEESE_22    cd  403.949883    S    dd

82  1083  CHEESE_23   qwe  407.145688    S    ff

83  1084  CHEESE_24   asd  410.341492    C    aa

84  1085  CHEESE_25    df  413.537296    C    cx

85  1086     MILK_1    fa  416.733100    E    cc

86  1087     MILK_2   jhf  419.928904    D    dd

87  1088     MILK_3    ad  423.124709    S    ff

88  1089     MILK_4  gdfs  426.320513    A    aa

89  1090     MILK_5  vczx  429.516317    B    cx

90  1091     MILK_6    as  432.712121    A    cc

91  1092     MILK_7    cd  435.907925    C    dd

92  1093     MILK_8   qwe  439.103730    B    ff

93  1094     MILK_9   asd  442.299534    D    aa

94  1095    MILK_10    df  445.495338    F    cx

95  1096    MILK_11    fa  448.691142    A    cc

96  1097    MILK_12   jhf  451.886946    S    dd

97  1098    MILK_13    ad  455.082751    S    ff

98  1099    MILK_14  gdfs  458.278555    C    aa

99  1100    MILK_15  vczx  461.474359    C    cx

[100 rows x 6 columns]

可以看到，pandas已经将csv文件中的数据成功导入

接着可以查看导入的数据类型

 # Get the type of info

 print(type(info))       # <class 'pandas.core.frame.DataFrame'>

 print('-----------')

 # Get the type of each column(The object dtype equal to the string type in python)

 print(info.dtypes)      ''' No.         int64

                             Type       object

                             Info       object

                             Number    float64

                             Rank       object

                             Mark.      object

                             dtype: object '''

最后还可以利用基本函数获取前/后 n 行，列名信息以及基本描述等

 # Get the first x row of csv list, default is 5

 print(info.head(7))

 print('-----------')

 # Get the last x row of csv list, default is 5

 print(info.tail(7))

 print('-----------')

 # Get the name of each column

 print(info.columns)

 print('-----------')

 # Get the shape of csv list

 print(info.shape)

 print('-----------')

 # Get the statistics parameter of cvs list(for digit data)

 # Such as count, mean, standard deviation, min, 25%, 50%, 75%, max

 print(info.describe())

输出结果

    No.      Type Info  Number Rank Mark.

0  1001  BUTTER_1  NaN     4.0    A    cc

1  1002  BUTTER_2  NaN     NaN    C    dd

2  1003  BUTTER_3  NaN     NaN  NaN    ff

3  1004  BUTTER_4  NaN     NaN  NaN   NaN

4  1005  BUTTER_5   df   543.0    F    cx

5  1006  BUTTER_6   fa   345.0    A    cc

6  1007  BUTTER_7  jhf    67.0    S    dd

-----------

     No.     Type  Info      Number Rank Mark.

93  1094   MILK_9   asd  442.299534    D    aa

94  1095  MILK_10    df  445.495338    F    cx

95  1096  MILK_11    fa  448.691142    A    cc

96  1097  MILK_12   jhf  451.886946    S    dd

97  1098  MILK_13    ad  455.082751    S    ff

98  1099  MILK_14  gdfs  458.278555    C    aa

99  1100  MILK_15  vczx  461.474359    C    cx

-----------

Index(['No.', 'Type', 'Info', 'Number', 'Rank', 'Mark.'], dtype='object')

-----------

(100, 6)

-----------

               No.      Number

count   100.000000   97.000000

mean   1050.500000  309.401389

std      29.011492  110.975188

min    1001.000000    4.000000

25%    1025.750000  240.963869

50%    1050.500000  317.663170

75%    1075.250000  391.166667

max    1100.000000  567.000000

4 pandas计算

对于pandas，由于其基本结构是基于numpy的ndarray，因此numpy的基本计算操作对于pandas的DataFrame及Series也都适用。

下面是pandas的一些基本计算方法的示例，

完整代码

 import pandas as pd

 info = pd.read_csv('info.csv')

 # Get the certain row of csv list

 print(info.loc[0])

 print(info.loc[3:7])

 print('----------')

 # Get certain column(columns) by column name(name list)

 print(info['Type'])

 print(info[['Type', 'No.']])

 # Get the column name and save it as a list

 col_names = info.columns.tolist()

 print(col_names)

 # Filter off the column name that end with '.'

 dotList = []

 for n in col_names:

     if n.endswith('.'):

         dotList.append(n)

 newList = info[dotList]

 print(newList)

 # Operation for column will act to each element as numpy does

 print(info['Number'] * 10)

 # Operation for two csv with same shape will act each corresponding element

 x = info['Number']

 y = info['No.']

 print(x+y)

 # Act for string

 x = info['Rank']

 y = info['Mark.']

 print(x+y)

 # Add a column after the tail column(the dimension of new one should be same as origin)

 print(info.shape)

 info['New'] = x+y

 print(info.shape)

 print('----------')

 # Get the max/min value of a column

 print(info['Number'].max())

 print(info['Number'].min())

 num = info['Number']

 num_null_true = pd.isnull(num)

 # If these is a null value in DataFrame, the calculated result will be NaN

 print(sum(info['Number'])/len(info['Number'])) # return nan

 # Use the DataFrame == False to reverse the DataFrame

 good_value = info['Number'][num_null_true == False]

 print(sum(good_value)/len(good_value))

 print(good_value.mean())

 # mean method can filter off the missing data automatically

 print(info['Number'].mean())

 print('---------')

分段解释

首先导入pandas及数据文件，利用loc获取pandas的某行数据，可以使用类似list的切片操作

 import pandas as pd

 info = pd.read_csv('info.csv')

 # Get the certain row of csv list

 print(info.loc[0])

 print(info.loc[3:7])

 print('----------')

 # Get certain column(columns) by column name(name list)

 print(info['Type'])

 print(info[['Type', 'No.']])

结果如下，内容较长

No.           1001

Type      BUTTER_1

Info           NaN

Number           4

Rank             A

Mark.           cc

Name: 0, dtype: object

    No.      Type Info  Number Rank Mark.

3  1004  BUTTER_4  NaN     NaN  NaN   NaN

4  1005  BUTTER_5   df   543.0    F    cx

5  1006  BUTTER_6   fa   345.0    A    cc

6  1007  BUTTER_7  jhf    67.0    S    dd

7  1008  BUTTER_8   ad   567.0    S    ff

----------

0      BUTTER_1

1      BUTTER_2

2      BUTTER_3

3      BUTTER_4

4      BUTTER_5

5      BUTTER_6

6      BUTTER_7

7      BUTTER_8

8      BUTTER_9

9     BUTTER_10

10    BUTTER_11

11    BUTTER_12

12    BUTTER_13

13      WATER_1

14      WATER_2

15      WATER_3

16      WATER_4

17      WATER_5

18      WATER_6

19      WATER_7

20      WATER_8

21      WATER_9

22     WATER_10

23     WATER_11

24     WATER_12

25     WATER_13

26     WATER_14

27     WATER_15

28     WATER_16

29     WATER_17

        ...

70    CHEESE_11

71    CHEESE_12

72    CHEESE_13

73    CHEESE_14

74    CHEESE_15

75    CHEESE_16

76    CHEESE_17

77    CHEESE_18

78    CHEESE_19

79    CHEESE_20

80    CHEESE_21

81    CHEESE_22

82    CHEESE_23

83    CHEESE_24

84    CHEESE_25

85       MILK_1

86       MILK_2

87       MILK_3

88       MILK_4

89       MILK_5

90       MILK_6

91       MILK_7

92       MILK_8

93       MILK_9

94      MILK_10

95      MILK_11

96      MILK_12

97      MILK_13

98      MILK_14

99      MILK_15

Name: Type, Length: 100, dtype: object

         Type   No.

0    BUTTER_1  1001

1    BUTTER_2  1002

2    BUTTER_3  1003

3    BUTTER_4  1004

4    BUTTER_5  1005

5    BUTTER_6  1006

6    BUTTER_7  1007

7    BUTTER_8  1008

8    BUTTER_9  1009

9   BUTTER_10  1010

10  BUTTER_11  1011

11  BUTTER_12  1012

12  BUTTER_13  1013

13    WATER_1  1014

14    WATER_2  1015

15    WATER_3  1016

16    WATER_4  1017

17    WATER_5  1018

18    WATER_6  1019

19    WATER_7  1020

20    WATER_8  1021

21    WATER_9  1022

22   WATER_10  1023

23   WATER_11  1024

24   WATER_12  1025

25   WATER_13  1026

26   WATER_14  1027

27   WATER_15  1028

28   WATER_16  1029

29   WATER_17  1030

..        ...   ...

70  CHEESE_11  1071

71  CHEESE_12  1072

72  CHEESE_13  1073

73  CHEESE_14  1074

74  CHEESE_15  1075

75  CHEESE_16  1076

76  CHEESE_17  1077

77  CHEESE_18  1078

78  CHEESE_19  1079

79  CHEESE_20  1080

80  CHEESE_21  1081

81  CHEESE_22  1082

82  CHEESE_23  1083

83  CHEESE_24  1084

84  CHEESE_25  1085

85     MILK_1  1086

86     MILK_2  1087

87     MILK_3  1088

88     MILK_4  1089

89     MILK_5  1090

90     MILK_6  1091

91     MILK_7  1092

92     MILK_8  1093

93     MILK_9  1094

94    MILK_10  1095

95    MILK_11  1096

96    MILK_12  1097

97    MILK_13  1098

98    MILK_14  1099

99    MILK_15  1100

[100 rows x 2 columns]

获取pandas的列名

 # Get the column name and save it as a list

 col_names = info.columns.tolist()

 print(col_names)

结果如下

['No.', 'Type', 'Info', 'Number', 'Rank', 'Mark.']

过滤出所有以‘.’结尾的列

 # Filter off the column name that end with '.'

 dotList = []

 for n in col_names:

     if n.endswith('.'):

         dotList.append(n)

 newList = info[dotList]

 print(newList)

基本计算操作会作用于pandas的Series每个值

 # Operation for column will act to each element as numpy does

 print(info['Number'] * 10)

对两个结构形状相同的Series，其运算会作用到每个values上

 # Operation for two csv with same shape will act each corresponding element

 x = info['Number']

 y = info['No.']

 print(x+y)

 # Act for string

 x = info['Rank']

 y = info['Mark.']

 print(x+y)

创建出一个列名为‘New’的新列，值为两个列的值之和

 # Add a column after the tail column(the dimension of new one should be same as origin)

 print(info.shape)

 info['New'] = x+y

 print(info.shape)

 print('----------')

获取Series中的最值

 # Get the max/min value of a column

 print(info['Number'].max())

 print(info['Number'].min())

均值计算的两种方式，

直接求和平均，当计算中有NaN值时，计算的结果将会为NaN
利用mean函数进行计算，mean函数将会过自动滤掉NaN缺失数据

 num = info['Number']

 num_null_true = pd.isnull(num)

 # If these is a null value in DataFrame, the calculated result will be NaN

 print(sum(info['Number'])/len(info['Number'])) # return nan

 # Use the DataFrame == False to reverse the DataFrame

 good_value = info['Number'][num_null_true == False]

 print(sum(good_value)/len(good_value))

 print(good_value.mean())

 # mean method can filter off the missing data automatically

 print(info['Number'].mean())

 print('---------')

5 pandas的Series

下面介绍 pandas 中的数据类型 Series 的一些基本使用方法，

完整代码

 import pandas as pd

 info = pd.read_csv('info.csv')

 # Fetch a series from DataFrame

 rank_series = info['Rank']

 print(type(info)) # <class 'pandas.core.frame.DataFrame'>

 print(type(rank_series)) # <class 'pandas.core.series.Series'>

 print(rank_series[0:5])

 # New a series

 from pandas import Series

 # Build a rank series

 rank = rank_series.values

 print(rank)

 # DataFrame --> Series --> ndarray

 print(type(rank)) # <class 'numpy.ndarray'>

 # Build a type series

 type_series = info['Type']

 types = type_series.values

 # Build a new series based on former two(type and rank)

 # Series(values, index=)

 series_custom = Series(rank, index=types)

 print(series_custom)

 # Fetch Series by key name list

 print(series_custom[['MILK_14', 'MILK_15']])

 # Fetch Series by index

 print(series_custom[0:2])

 # Sorted to Series will return a list by sorted value

 print(sorted(series_custom, key=lambda x: 0 if isinstance(x, str) else x))

 # Re-sort by index for a Series

 original_index = series_custom.index.tolist()

 sorted_index = sorted(original_index)

 sorted_by_index = series_custom.reindex(sorted_index)

 print(sorted_by_index)

 # Series sort function

 print(series_custom.sort_index())

 print(series_custom.sort_values())

 import numpy as np

 # Add operation for Series will add the values for each row(if the dimensions of two series are same)

 print(np.add(series_custom, series_custom))

 # Apply sin funcion to each value

 print(np.sin(info['Number']))

 # Return the max value(return a single value not a Series)

 # If more than one max value exist, only return one

 print(np.max(filter(lambda x: isinstance(x, float), series_custom)))

 # Filter values in range

 criteria_one = series_custom > 'C'

 criteria_two = series_custom < 'S'

 print(series_custom[criteria_one & criteria_two])

分段解释

利用列名从DataFrame中获取一个Series

 import pandas as pd

 info = pd.read_csv('info.csv')

 # Fetch a series from DataFrame

 rank_series = info['Rank']

 print(type(info)) # <class 'pandas.core.frame.DataFrame'>

 print(type(rank_series)) # <class 'pandas.core.series.Series'>

 print(rank_series[0:5])

新建一个Series的方法，先获取一个作为index的列，在获取一个作为values的列，利用Series函数生成新的Series

 # New a series

 from pandas import Series

 # Build a rank series

 rank = rank_series.values

 print(rank)

 # DataFrame --> Series --> ndarray

 print(type(rank)) # <class 'numpy.ndarray'>

 # Build a type series

 type_series = info['Type']

 types = type_series.values

 # Build a new series based on former two(type and rank)

 # Series(values, index=)

 series_custom = Series(rank, index=types)

 print(series_custom)

利用列名列表或索引从DataFrame中获取多个Series

 # Fetch Series by key name list

 print(series_custom[['MILK_14', 'MILK_15']])

 # Fetch Series by index

 print(series_custom[0:2])

利用sorted函数根据values大小重排Series，返回值为一个list

 # Sorted to Series will return a list by sorted value

 print(sorted(series_custom, key=lambda x: 0 if isinstance(x, str) else x))

两种sort方法对Series进行排列

　　1. 获取index索引值，对索引值进行排列，再使用reindex函数获取新的Series

 # Re-sort by index for a Series

 original_index = series_custom.index.tolist()

 sorted_index = sorted(original_index)

 sorted_by_index = series_custom.reindex(sorted_index)

 print(sorted_by_index)

　　2.使用sort_index或sort_values函数

 # Series sort function

 print(series_custom.sort_index())

 print(series_custom.sort_values())

Series的相加/正余弦/max，利用numpy函数，将Series的对应values值进行处理

 import numpy as np

 # Add operation for Series will add the values for each row(if the dimensions of two series are same)

 print(np.add(series_custom, series_custom))

 # Apply sin funcion to each value

 print(np.sin(info['Number']))

 # Return the max value(return a single value not a Series)

 # If more than one max value exist, only return one

 print(np.max(filter(lambda x: isinstance(x, float), series_custom)))

利用True/False列表获取在范围内满足条件的Series

 # Filter values in range

 criteria_one = series_custom > 'C'

 criteria_two = series_custom < 'S'

 print(series_custom[criteria_one & criteria_two])

6 pandas常用函数

下面是一些pandas常用的函数示例

完整代码

 import pandas as pd

 import numpy as np

 info = pd.read_csv('info.csv')

 # Sort value by column

 # inplace is True will sort value base on origin, False will return a new DataFrame

 new = info.sort_values('Mark.', inplace=False, na_position='last')

 print(new)

 # Sorted by ascending order in default(ascending=True)

 # No matter ascending or descending sort, the NaN(NA, missing value) value will be placed at tail

 info.sort_values('Mark.', inplace=True, ascending=False)

 print(info)

 print('---------')

 # Filter off the null row

 num = info['Number']

 # isnull will return a list contains the status of null or not, True for null, False for not

 num_null_true = pd.isnull(num)

 print(num_null_true)

 num_null = num[num_null_true]

 print(num_null) # 12 NaN

 print('---------')

 # pivot_table function can calulate certain para that with same attribute group by using certain function

 # index tells the method which column to group by

 # value is the column that we want to apply the calculation to

 # aggfunc specifies the calculation we want to perform, default function is mean

 avg_by_rank = info.pivot_table(index='Rank', values='Number', aggfunc=np.sum)

 print(avg_by_rank)

 print('---------')

 # Operate to multi column

 sum_by_rank = info.pivot_table(index='Rank', values=['Number', 'No.'], aggfunc=np.sum)

 print(sum_by_rank)

 print('---------')

 # dropna function can drop any row/columns that have null values

 info = pd.read_csv('info.csv')

 # Drop the columns that contain NaN (axis=0 for row)

 drop_na_column = info.dropna(axis=1)

 print(drop_na_column)

 print('---------')

 # Drop the row that subset certains has NaN

 # thresh to decide how many valid value required

 drop_na_row = info.dropna(axis=0, thresh=1, subset=['Number', 'Info', 'Rank', 'Mark.'])

 print(drop_na_row)

 print('---------')

 # Locate to a certain value by its row number(plus 1 for No.) and column name

 print(info)

 row_77_Rank = info.loc[77, 'Rank']

 print(row_77_Rank)

 row_88_Info = info.loc[88, 'Info']

 print(row_88_Info)

 print('---------')

 # reset_index can reset the index for sorted DataFrame

 new_info = info.sort_values('Rank', ascending=False)

 print(new_info[0:10])

 print('---------')

 # drop=True will drop the index column, otherwise will keep former index colunn (default False)

 reset_new_info = new_info.reset_index(drop=True)

 print(reset_new_info[0:10])

 print('---------')

 # Define your own function for pandas

 # Use apply function to implement your own function

 def hundredth_row(col):

     hundredth_item = col.loc[99]

     return hundredth_item

 hundred_row = info.apply(hundredth_row, axis=0)

 print(hundred_row)

 print('---------')

 # Null count

 # The apply function will act to each column

 def null_count(column):

     column_null = pd.isnull(column)

     null = column[column_null]

     return len(null)

 # Passing in axis para 0 to iterate over rows instead of column

 # Note: 0 for act by row but passing by column, 1 for act by column but passing by row

 # Passing by column can act for each column then get row

 # Passing by row can act for each row than get column

 column_null_count = info.apply(null_count, axis=0)

 print(column_null_count)

 print('---------')

 # Example: classify the data by Rank, and calculate the sum for each

 def rank_sort(row):

     rank = row['Rank']

     if rank == 'S':

         return 'Excellent'

     elif rank == 'A':

         return 'Great'

     elif rank == 'B':

         return 'Good'

     elif rank == 'C':

         return 'Pass'

     else:

         return 'Failed'

 # Format a classified column

 rank_info = info.apply(rank_sort, axis=1)

 print(rank_info)

 print('---------')

 # Add the column to DataFrame

 info['Rank_Classfied'] = rank_info

 # Calculate the sum of 'Number' according to 'Rank_Classfied'

 new_rank_number = info.pivot_table(index='Rank_Classfied', values='Number', aggfunc=np.sum)

 print(new_rank_number)

 # set_index will return a new DataFrame that is indexed by values in the specified column

 # And will drop that column(default is True)

 # The column set to be index will not be dropped if drop=False

 index_type = info.set_index('Type', drop=False, append=True)

 print(index_type)

 print('---------')

 # Use string index to slice the DataFrame

 # Note: the index(key) should be unique

 print(index_type['MILK_1':'MILK_7'])

 print('---------')

 print(index_type.loc['MILK_1':'MILK_7'])

 # Value index is available too

 print('---------')

 print(index_type[-15:-8])

 print('---------')

 # Calculate the standard deviation for each element from two different index

 cal_list = info[['Number', 'No.']]

 # np.std([x, y]) --> std value

 # The lambda x is a Series

 # cal_list.apply(lambda x: print(type(x)), axis=1)

 print(cal_list.apply(lambda x: np.std(x), axis=1))

分段解释

首先导入模块，然后利用sort_values函数对DataFrame或Series进行排序操作

 mport pandas as pd

 import numpy as np

 info = pd.read_csv('info.csv')

 # Sort value by column

 # inplace is True will sort value base on origin, False will return a new DataFrame

 new = info.sort_values('Mark.', inplace=False, na_position='last')

 print(new)

 # Sorted by ascending order in default(ascending=True)

 # No matter ascending or descending sort, the NaN(NA, missing value) value will be placed at tail

 info.sort_values('Mark.', inplace=True, ascending=False)

 print(info)

 print('---------')

利用isnull函数对null值的数据进行过滤，可利用Series==False对isnull得到的序列进行反转

 # Filter off the null row

 num = info['Number']

 # isnull will return a list contains the status of null or not, True for null, False for not

 num_null_true = pd.isnull(num)

 print(num_null_true)

 num_null = num[num_null_true]

 print(num_null) # 12 NaN

 print('---------')

利用pivot_table函数对相同属性分组的数据进行指定函数的计算

 # pivot_table function can calulate certain para that with same attribute group by using certain function

 # index tells the method which column to group by

 # value is the column that we want to apply the calculation to

 # aggfunc specifies the calculation we want to perform, default function is mean

 avg_by_rank = info.pivot_table(index='Rank', values='Number', aggfunc=np.sum)

 print(avg_by_rank)

 print('---------')

 # Operate to multi column

 sum_by_rank = info.pivot_table(index='Rank', values=['Number', 'No.'], aggfunc=np.sum)

 print(sum_by_rank)

 print('---------')

利用dropna函数删除空值数据

 # dropna function can drop any row/columns that have null values

 info = pd.read_csv('info.csv')

 # Drop the columns that contain NaN (axis=0 for row)

 drop_na_column = info.dropna(axis=1)

 print(drop_na_column)

 print('---------')

 # Drop the row that subset certains has NaN

 # thresh to decide how many valid value required

 drop_na_row = info.dropna(axis=0, thresh=1, subset=['Number', 'Info', 'Rank', 'Mark.'])

 print(drop_na_row)

 print('---------')

利用loc对数据进行定位

 # Locate to a certain value by its row number(plus 1 for No.) and column name

 print(info)

 row_77_Rank = info.loc[77, 'Rank']

 print(row_77_Rank)

 row_88_Info = info.loc[88, 'Info']

 print(row_88_Info)

 print('---------')

利用reset_index函数对索引进行重排

 # reset_index can reset the index for sorted DataFrame

 new_info = info.sort_values('Rank', ascending=False)

 print(new_info[0:10])

 print('---------')

 # drop=True will drop the index column, otherwise will keep former index colunn (default False)

 reset_new_info = new_info.reset_index(drop=True)

 print(reset_new_info[0:10])

 print('---------')

利用apply函数运行自定义函数

 # Define your own function for pandas

 # Use apply function to implement your own function

 def hundredth_row(col):

     hundredth_item = col.loc[99]

     return hundredth_item

 hundred_row = info.apply(hundredth_row, axis=0)

 print(hundred_row)

 print('---------')

 # Null count

 # The apply function will act to each column

 def null_count(column):

     column_null = pd.isnull(column)

     null = column[column_null]

     return len(null)

 # Passing in axis para 0 to iterate over rows instead of column

 # Note: 0 for act by row but passing by column, 1 for act by column but passing by row

 # Passing by column can act for each column then get row

 # Passing by row can act for each row than get column

 column_null_count = info.apply(null_count, axis=0)

 print(column_null_count)

 print('---------')

 # Example: classify the data by Rank, and calculate the sum for each

 def rank_sort(row):

     rank = row['Rank']

     if rank == 'S':

         return 'Excellent'

     elif rank == 'A':

         return 'Great'

     elif rank == 'B':

         return 'Good'

     elif rank == 'C':

         return 'Pass'

     else:

         return 'Failed'

 # Format a classified column

 rank_info = info.apply(rank_sort, axis=1)

 print(rank_info)

 print('---------')

添加一个column到DataFrame并进行计算处理

 # Add the column to DataFrame

 info['Rank_Classfied'] = rank_info

 # Calculate the sum of 'Number' according to 'Rank_Classfied'

 new_rank_number = info.pivot_table(index='Rank_Classfied', values='Number', aggfunc=np.sum)

 print(new_rank_number)

利用set_index函数设置新的索引，利用索引进行切片操作，切片如果是列名字符串，将返回两个列名索引之间所有的数据

 # set_index will return a new DataFrame that is indexed by values in the specified column

 # And will drop that column(default is True)

 # The column set to be index will not be dropped if drop=False

 index_type = info.set_index('Type', drop=False, append=True)

 print(index_type)

 print('---------')

 # Use string index to slice the DataFrame

 # Note: the index(key) should be unique

 print(index_type['MILK_1':'MILK_7'])

 print('---------')

 print(index_type.loc['MILK_1':'MILK_7'])

 # Value index is available too

 print('---------')

 print(index_type[-15:-8])

 print('---------')

对两个不同索引内的元素分别进行标准差计算

 # Calculate the standard deviation for each element from two different index

 cal_list = info[['Number', 'No.']]

 # np.std([x, y]) --> std value

 # The lambda x is a Series

 # cal_list.apply(lambda x: print(type(x)), axis=1)

 print(cal_list.apply(lambda x: np.std(x), axis=1))

7 补充内容 / Complement

1. pandas许多函数底层是基于numpy进行的，pandas一个函数可能调用了numpy的多个函数进行实现;

2. object dtype 和 Python中的string相同;

3. pandas中如果不指定列名则默认文件中第一行为列名;

4. 基本结构包括DataFrame和Series，DataFrame可以分解为Series，DataFrame是由一系列的Series组成的，DataFrame相当于矩阵，Series相当于行或者列。

相关阅读

1. numpy 的使用

秒客网

Python的工具包[1] -> pandas数据预处理 -> pandas 库及使用总结

相关文章