Pandas透视表和交叉表

时间:2023-03-09 23:32:48
Pandas透视表和交叉表

透视表

参数名 说明
values 待聚合的列的名称。默认聚合所有数值列
index 用于分组的列名或其他分组键,出现在结果透视表的行
columns 用于分组的列表或其他分组键,出现在结果透视表的列
aggfunc 聚合函数或函数列表,默认为'mean',可以是任何对groupby有效的函数
fill_value 用于替换结果表中的缺失值
margins 添加行/列小计和总计,默认为False
# pivot_table默认聚合分组平均数
tips = pd.read_csv('C:/Users/1/Desktop/tips.csv')
tips.pivot_table(index=['sex','smoker']) size tip total_bill
sex smoker
Female No 2.592593 2.773519 18.105185
Yes 2.242424 2.931515 17.977879
Male No 2.711340 3.113402 19.791237
Yes 2.500000 3.051167 22.284500 # 聚合小费的比例和人数的多少,根据天数进行分组
tips['tip_pct'] = tips.tip / tips.total_bill
tips.pivot_table(['tip_pct','size'],index=['sex','day'],columns='smoker') size tip_pct
smoker No Yes No Yes
sex day
Female Fri 2.500000 2.000000 0.165296 0.209129
Sat 2.307692 2.200000 0.147993 0.163817
Sun 3.071429 2.500000 0.165710 0.237075
Thur 2.480000 2.428571 0.155971 0.163073
Male Fri 2.000000 2.125000 0.138005 0.144730
Sat 2.656250 2.629630 0.162132 0.139067
Sun 2.883721 2.600000 0.158291 0.173964
Thur 2.500000 2.300000 0.165706 0.164417 # margins=True,会开启一列ALL的行和列,其值对应于单个等级中所有数据的分组统计。
tips.pivot_table(['tip_pct','size'],index=['sex','day'],columns='smoker',margins=True) size tip_pct
smoker No Yes All No Yes All
sex day
Female Fri 2.500000 2.000000 2.111111 0.165296 0.209129 0.199388
Sat 2.307692 2.200000 2.250000 0.147993 0.163817 0.156470
Sun 3.071429 2.500000 2.944444 0.165710 0.237075 0.181569
Thur2.480000 2.428571 2.468750 0.155971 0.163073 0.157525
Male Fri 2.000000 2.125000 2.100000 0.138005 0.144730 0.143385
Sat 2.656250 2.629630 2.644068 0.162132 0.139067 0.151577
Sun 2.883721 2.600000 2.810345 0.158291 0.173964 0.162344
Thur2.500000 2.300000 2.433333 0.165706 0.164417 0.165276
All 2.668874 2.408602 2.569672 0.159328 0.163196 0.160803 # aggfunc可以直接接受聚合函数,len可以统计分组大小的交叉表
# 根据性别,是否吸烟对每天的人数进行统计
tips.pivot_table('tip_pct',index=['sex','smoker'],columns='day',aggfunc=len,margins=True) day Fri Sat Sun Thur All
sex smoker
Female No 2.0 13.0 14.0 25.0 54.0
Yes 7.0 15.0 4.0 7.0 33.0
Male No 2.0 32.0 43.0 20.0 97.0
Yes 8.0 27.0 15.0 10.0 60.0
All 19.0 87.0 76.0 62.0 244.0 # 如果存在空的组合,可以设置空值fill_value
tips.pivot_table('size',index=['time','sex','smoker'],columns='day',aggfunc='sum',fill_value=0) day Fri Sat Sun Thur
time sex smoker
Dinner Female No 2 30 43 2
Yes 8 33 10 0
Male No 4 85 124 0
Yes 12 71 39 0
Lunch Female No 3 0 0 60
Yes 6 0 0 17
Male No 0 0 0 50
Yes 5 0 0 23

交叉表: crosstab

是一种用于计算分组频率的特殊透视表
data = [[1,'Female','Right-handed'],
[2,'Male','Left-handed'],
[3,'Female','Right-handed'],
[4,'Male','Right-handed'],
[5,'Male','Left-handed'],
[6,'Male','Right-handed'],
[7,'Female','Right-handed'],
[8,'Female','Left-handed'],
[9,'Male','Right-handed'],
[10,'Female','Right-handed']]
data = pd.DataFrame(data,columns=['Sample','Gender','Handedness']) # 前两个参数可以是数组、Series或数组列表
# 根据性别和用手习惯对这段数据进行统计汇总
pd.crosstab(data.Gender,data.Handedness,margins=True) Handedness Left-handed Right-handed All
Gender
Female 1 4 5
Male 2 3 5
All 3 7 10 # 这是传统的透视表,比较复杂点比交叉表
pd.pivot_table(data,index=['Gender'],columns=['Handedness'],aggfunc=len,margins=True) Sample
Handedness Left-handed Right-handed All
Gender
Female 1 4 5
Male 2 3 5
All 3 7 10 # 根据时间,星期几进行了对每天的是否吸烟人数分组统计
pd.crosstab([tips.time,tips.day],tips.smoker,margins=True) smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244