Python数据分析-Pandas（Series与DataFrame）

Pandas介绍：

　　pandas是一个强大的Python数据分析的工具包，是基于NumPy构建的。

Pandas的主要功能：
　　1)具备对其功能的数据结构DataFrame、Series
　　2)集成时间序列功能
　　3)提供丰富的数学运算和操作
　　4)灵活处理缺失数据

python里面安装、引入方式：
　　安装方法：pip install pandas
　　引用方法：import pandas as pd

Series数组的创建：

创建空的的值

import pandas as pd

s = pd.Series()

print(s)  #Series([], dtype: float64)

传入一个列表

data=['a','b','c','d']

res=pd.Series(data)

print(res)

'''结果

0    a

1    b

2    c

3    d

这里没有传递任何索引，因此默认情况下，它分配了从0到len(data)-1的索引，即：0到3

'''

传一个字典

data = {'a' : 0, 'b' : 1, 'c' : 2}

s = pd.Series(data)

print(s)

'''结果

a    0

b    1

c    2

dtype: int64

注意 - 字典键用于构建索引。

'''

从标量创建索引：

如果数据是标量值，则必须提供索引。将按照索引重复该值进行匹配

res=pd.Series(0, index=['a','b','c','d'])

print(res)

'''结果

a    0

b    0

c    0

d    0

'''

自指定索引值：

res=pd.Series(['a','b','c','d'],index=['a_index','b_index','c_index','d_index'])

print(res)

'''结果

a_index    a

b_index    b

c_index    c

d_index    d

'''

从具有位置的系列中访问数据（取值）：

重点理解：数组是从零开始计数的，第一个位置存储再零位置)

查看index 、 values的值：

#查看数组的index值

print(res.index)

#查看数组的value值

print(res.values)

#取值（根据默认第零位开始取）

print(res[0])  #a

取前三个值（不包括定义的最后一个数值）

res=pd.Series(['a','b','c','d'],index=['a_index','b_index','c_index','d_index'])

#取前三个值（不包括3）

print(res[:3]) #是个对象可以 res[:3].values

'''结果

　　a_index a
　　b_index b
　　c_index c
　　dtype: object

'''

取后三个值：

print(res[-3:])

'''结果

b_index    b

c_index    c

d_index    d

dtype: object

'''

使用索引标签检索数据并设置数据：

修改value值

res=pd.Series(['a','b','c','d'],index=['a_index','b_index','c_index','d_index'])

print(res)

res['a_index']='new_a'

print(res)

'''结果

a_index    new_a

b_index        b

c_index        c

d_index        d

'''

copy复制数据并修改

sr1=pd.Series([12,13,14],index=['c','a','d'])

sr2=pd.Series([14,15,16],index=['d','c','a'])

#可以使用copy赋值数组再修改

sr3=sr1[1:].copy()

print(sr3)

sr3[0]=1888

print(sr3)

'''

a    13

d    14

dtype: int64

a    1888

d      14

dtype: int64

'''

运算：

初始构建2个数组

sr1=pd.Series([12,13,14],index=['c','a','d'])

sr2=pd.Series([14,15,16],index=['d','c','a'])

print(sr1+sr2)

'''结果

a    29

c    27

d    28

'''

求和运算

Pandas自动对齐功能，如果自定义了索引就会找原来索引，如果没有值就为NaN

sr1=pd.Series([12,13,14],index=['c','a','d'])

sr3=pd.Series([11,20,10,14], index=['d','c','a','b'])

print(sr3)

#求sr1+sr3和值

print(sr1+sr3)

'''结果

a    23.0

b     NaN  #一位sr1中没有索引b，所以显示空

c    32.0

d    25.0

Pandas自动对齐功能，如果自定义了索引就会找原来索引，如果没有值就为NaN

'''

针对Seires格式的数据，Pandas对其NaN值的处理如下：

#先构建一个缺失数据

sr1=pd.Series([12,13,14],index=['c','a','d'])

sr2=pd.Series([14,15,16],index=['d','c','a'])

sr3=pd.Series([11,20,10,14], index=['d','c','a','b'])

#合并生成一个缺失数据

sr4=sr1+sr3

print(sr4)

'''结果

a    23.0

b     NaN

c    32.0

d    25.0

dtype: float64

'''

第一步：格式为 pd.isnull（Series对象），isnull、notnull用于过滤、查找NaN的值

isnull，返回布尔数组，缺失值对应True

#isnull，返回布尔数组，缺失值对应True

res=pd.isnull(sr4)

print(res)

'''结果

a    False
b     True

c    False

d    False

'''

notnull,返回布尔数组，缺失值对应为False

#notnull,返回布尔数组，缺失值对应为False

res=pd.notnull(sr4)

print(res)

'''结果

a     True
b    False

c     True

d     True

dtype: bool

'''

第二步：格式为 pd.Series.dropna（series对象），删除有NaN的行，注意对于Series的数据格式使用dropna必须是

pd.Series.dropna(sr4)这个格式，不能使用pd.dropna()这个是无效的，

dropna,删除NaN的行(因为是Series数据格式只有行的概念)

#dropna,过滤掉有NaN的行

res=pd.Series.dropna(sr4)

print(res)

'''

a    23.0

c    32.0

d    25.0

dtype: float64

'''

第三步：格式为 Series对象.fillna（‘要填充为的数据内容’）

fillna,填充缺失的数据

#fillna,填充NaN缺失的数据

res=sr4.fillna('这是给NaN做填充的数据')

print(res)

'''数据结构

a              23

b    这是给NaN做填充的数据

c              32

d              25

dtype: object

'''

DataFrame数组创建

DataFrame是个二维数据结构，非常接近电子表格或者类似于mysql数据库的形式，是一个表格型的数据结构，含有一组有序的列。
DataFrame可以被看做是由Series组成的字典，并且共用一个索引。

创建数组

简单方式

data={'name':['google','baidu','yahho'],'marks':[100,200,300],'price':[1,2,3]}

res=DataFrame(data)

print(res)

'''结果(默认索引是0开始)

     name  marks  price

0  google    100      1

1   baidu    200      2

2   yahho    300      3

'''

补充，与Series结合方式

#与Series结合的方式

res=pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']), 'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})

print(res)

'''结果

   one  two

a  1.0    2

b  2.0    1

c  3.0    3

d  NaN    4

'''

数组属性、方法

　　1）index 获取索引
　　2）T 转置
　　3）columns 获取列索引
　　4）values 获取值数组
　　5) describe() 获取快速统计
　　6）sort_index(axis, …, ascending) 按行或列索引排序
　　7）sort_values(by, axis, ascending) 按值排序

data = {"name": ['google', 'baidu', 'yahoo'], "marks": [100, 200, 300], "price": [1, 2, 3]}

res=DataFrame(data)

print(res)

''' 依此来进行下面查询方法的验证

     name  marks  price

0  google    100      1

1   baidu    200      2

2   yahoo    300      3

'''

index获取索引

#index,查看索引

print(res.index)    #RangeIndex(start=0, stop=3, step=1)

columns查看列索引

#columns，查看列索引

print(res.columns)   #Index(['name', 'marks', 'price'], dtype='object')

values获取数组值

#values，查看值数组

print(res.values)

'''结果

[['google' 100 1]

 ['baidu' 200 2]

 ['yahoo' 300 3]]

'''

describe(),获取快速统计

#describe(),获取快速统计

# print(res.describe())

'''

     marks  price

count    3.0    3.0

mean   200.0    2.0

std    100.0    1.0

min    100.0    1.0

25%    150.0    1.5

50%    200.0    2.0

75%    250.0    2.5

max    300.0    3.0

'''

sort_index(),按行或列索引排序

参数说明，axis=0/1 ascending=True升序/降序默认是True

#axis=0，按照行索引排序

res=res.sort_index(axis=)

print(res)

'''索引排序结果

     name  marks  price
0  google    100      1
1   baidu    200      2
2   yahoo    300      3

'''

#axis=1，按照列索引排序

res=res.sort_index(axis=1,ascending=True)

print(res)

'''列索引排序结果
 marks    name  price

0    100  google      1

1    200   baidu      2

2    300   yahoo      3

'''

sort_values( by,axis,ascending ) 按值排序

#sort_values(by,axis,ascending) 按值排序

data = {"name": ['google', 'baidu', 'yahoo'], "marks": [100, 200, 300], "price": [1, 2, 3]}

res=DataFrame(data)

res=res.sort_values(by=['name'],axis=) #这里的axis只能是0，每一列的数值就是根据每个数值的大小顺序上下浮动排序的，参照的就是逐行去对比

print(res)

'''按照值排序结果

     name  marks  price

1   baidu    200      2

0  google    100      1

2   yahoo    300      3

'''

手动指定索引值

#手动指定索引值

res=DataFrame(data,columns=['name','marks','price',],index=['第一','第二','第三'])

print(res)

'''结果

      name  marks  price

第一  google    100      1

第二   baidu    200      2

第三   yahho    300      3

'''

取值（有行索引和列索引）

获取单列数据，例如获取name标签列数据

#获取单列

#1.获取名字标签列

res=DataFrame(data,columns=['name','marks','price',],index=['第一','第二','第三'])

res=res['name']

print(res)

'''结果

第一    google

第二     baidu

第三     yahho

Name: name, dtype: object

'''

获取price标签列数据

#2.获取价格标签列

# res=res['price']

# print(res)

'''

第一    1

第二    2

第三    3

Name: price, dtype: int64

'''

获取双列数据

同时获取2个标签(注意：同时获取两个标签时要双中括号引起来)

#同时获取2个标签(注意：同时获取两个标签时要双中括号引起来)

res=res[['name','price']]

print(res)

'''结果

      name  price

第一  google      1

第二   baidu      2

第三   yahho      3

'''

获取数据中单个值

#先从单列里面取第一列，再从取出的列中取出第一个值

res=res['name'][0]

print(res)  #google

取前两行值

#先取前2行

res=res[0:2]

print(res)

'''

      name  marks  price

第一  google    100      1

第二   baidu    200      2

'''

取前两行值后再从中取指定列

#先取前2行--再从中取指定列

res=res[0:2][['name','price']]

print(res)

'''结果：注意，取多个标签时要双括号

      name  price

第一  google      1

第二   baidu      2

'''

ix ，可以兼容下面loc、iloc用法，它可以根据行列标签又可以根据行列数，例如下面的（参数前：行索引后：列索引）

import pandas as pd

data = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]},index=["a","b","c"])

data

    A   B   C

a   1   4   7

b   2   5   8

c   3   6   9

：本文为博主原创文章，转载请附上博文链接！

比如要拿到5

方法1

data.ix[1,1]

data.ix['b':'c','B':'C']

方法2

data.ix[1:3,1:3]

data.ix['b':'c','B':'C']

loc，通过标签获取列

指定取某几个标签

#指定取某几个标签

res=DataFrame(data,columns=['name','marks','price',],index=['第一','第二','第三'])

res=res.loc[:,['name','marks']]

print(res)

'''

      name  marks

第一  google    100

第二   baidu    200

第三   yahho    300

'''

取指定范围内的标签

#取指定范围内的标签

res=res.loc[:,'name':'price']

print(res)

'''

      name  marks  price

第一  google    100      1

第二   baidu    200      2

第三   yahho    300      3

'''

索引+标签取值

#索引+标签取值

data = {"name": ['google', 'baidu', 'yahoo'], "marks": [100, 200, 300], "price": [1, 2, 3]}

res=DataFrame(data)

print(res)

'''初始结果

     name  marks  price

0  google    100      1

1   baidu    200      2

2   yahoo    300      3

'''

#搭配取值写法

res=res.loc[0,['name']]

print(res)

'''结果

name    google

Name: 0, dtype: object

'''

根据索引、标签范围配合取值（注意，0:1包含了1）

#根据索引、标签范围配合取值(注意，0:1包含了1)

res=res.loc[0:1,['marks','price']]

print(res)

'''结果

   marks  price

0    100      1

1    200      2

'''

iloc，通过位置获取行数据

获取单行数据

data = {"name": ['google', 'baidu', 'yahoo'], "marks": [100, 200, 300], "price": [1, 2, 3]}

res=DataFrame(data)

print(res)

'''初始状态

     name  marks  price

0  google    100      1

1   baidu    200      2

2   yahoo    300      3

'''

#获第一行数据

res=res.iloc[0]
print(res)

'''结果

name     google

marks       100

price         1

Name: 0, dtype: object

'''

获取多行数据

#获取多行数据

data = {"name": ['google', 'baidu', 'yahoo'], "marks": [100, 200, 300], "price": [1, 2, 3]}

res=DataFrame(data)

print(res.iloc[1]) # 先取到第2行： 1   baidu    200      2


res=res.iloc[1,2]  #res.iloc[1,2]再在得到的行上再根据索引取值

print(res)  #

获取行和列（根据范围取值，注意前提还用默认索引）

#取行和列（根据范围来取）

res=res.iloc[0:2,0:2]

print(res)

'''结果

     name  marks   1.先取前2行【0:2】即取0,1索引行

0  google    100   2.在1的基础上再取前2列【0,2】即取列索引为0,1

1   baidu    200

'''

取2、3行，然后打印这两行所有列的数据

res=res.iloc[1:3,:]

print(res)

'''结果

    name  marks  price

1  baidu    200      2

2  yahoo    300      3

'''

取1、2行，然后单一显示它的第1、3列

#取第1、2行，然后打印显示它的第1、3列

res=res.iloc[[0,1],[0,2]]

print(res)

'''结果

     name  price

0  google      1

1   baidu      2

'''

秒客网

Python数据分析-Pandas（Series与DataFrame）

相关文章