一文了解Pandas库的分组的使用_Python

分组模式及其对象

分组的一般模式

分组操作在日常生活中使用极其广泛，比如

依据性别分组，统计全国人口的寿命平均值
依据季节分组，对每个季节的温度进行组内的标准化
依据班级分子筛选组内数学分数的平均值超过80分的班级
因此，想要实现分组操作，必须明确三个要素：分组依据、数据来源、操作及其返回结果。

df.groupby(分组依据)[数据来源].使用操作

例如：按照性别统计身高中位数

df = pd.read_csv('data/learn_pandas.csv')
print(df.groupby('gender')['height'].median())

#gender
#female    159.6
#male      173.4
#name: height, dtype: float64

分组依据的本质

如果现在需要根据多个维度进行分组，只需在groupby中传入相应列名构成的列表即可。例如，现希望根据学校和性别进行分组，统计身高的均值

print(df.groupby(['school','gender'])['height'].mean())

#school                         gender
#fudan university               female    158.776923
#                               male      174.212500
#peking university              female    158.666667
#                               male      172.030000
#shanghai jiao tong university  female    159.122500
#                               male      176.760000
#tsinghua university            female    159.753333
#                               male      171.638889
#name: height, dtype: float64

目前为止，groupby的分组依据都是直接可以从列中按照名字获取的，那如果通过一定的复杂逻辑来分组，例如根据学生体重是否超过总体均值来分组，同样还是计算身高的均值

condition = df.weight > df.weight.mean()  #体重大于平均体重
print(df.groupby(condition)['height'].mean())

#weight
#false    159.034646
#true     172.705357
#name: height, dtype: float64

可以通过drop_duplicates知道具体的组类别

print(df[['school','gender']].drop_duplicates())

#                           school  gender
#0   shanghai jiao tong university  female
#1               peking university    male
#2   shanghai jiao tong university    male
#3                fudan university  female
#4                fudan university    male
#5             tsinghua university  female
#9               peking university  female
#16            tsinghua university    male

print(df.groupby([df['school'], df['gender']])['height'].mean())

#school                         gender
#fudan university               female    158.776923
#                               male      174.212500
#peking university              female    158.666667
#                               male      172.030000
#shanghai jiao tong university  female    159.122500
#                               male      176.760000
#tsinghua university            female    159.753333
#                               male      171.638889
#name: height, dtype: float64

groupby对象

具体做分组操作时，所调用的方法都来自于pandas中的groupby对象，这个对象上定义了许多方法。
通过ngroups属性，可以得到分组个数

gb = df.groupby(['school', 'grade'])
print(gb.ngroups)   #16

通过groups属性，可以返回从组名映射到组索引列表的字典

gb = df.groupby(['school', 'grade'])
res = gb.groups
print(res.keys())

#dict_keys([('fudan university', 'freshman'), ('fudan university', 'junior'), ('fudan university', 'senior'), ('fudan university', 'sophomore'), ('peking university', 'freshman'), ('peking university', 'junior'), ('peking university', 'senior'), ('peking university', 'sophomore'), ('shanghai jiao tong university', 'freshman'), ('shanghai jiao tong university', 'junior'), ('shanghai jiao tong university', 'senior'), ('shanghai jiao tong university', 'sophomore'), ('tsinghua university', 'freshman'), ('tsinghua university', 'junior'), ('tsinghua university', 'senior'), ('tsinghua university', 'sophomore')])

当size作为dataframe的属性时，返回的是表长乘以表宽的大小，但在groupby对象上表示统计每个组的元素个数。

print(gb.size())

#school                         grade    
#fudan university               freshman      9
#                               junior       12
#                               senior       11
#                               sophomore     8
#peking university              freshman     13
#                               junior        8
#                               senior        8
#                               sophomore     5
#shanghai jiao tong university  freshman     13
#                               junior       17
#                               senior       22
#                               sophomore     5
#tsinghua university            freshman     17
#                               junior       22
#                               senior       14
#                               sophomore    16
#dtype: int64

通过get_group方法可以直接获取所在组对应的行，此时必须知道组的具体名字

print(gb.get_group(('fudan university', 'freshman')))

#               school     grade  ...   test_date time_record
#15   fudan university  freshman  ...    2020/1/1     0:05:25
#28   fudan university  freshman  ...    2020/1/7     0:05:24
#63   fudan university  freshman  ...  2019/10/31     0:04:00
#70   fudan university  freshman  ...  2019/11/19     0:04:07
#73   fudan university  freshman  ...   2019/9/26     0:03:31
#105  fudan university  freshman  ...  2019/12/11     0:04:23
#108  fudan university  freshman  ...   2019/12/8     0:05:03
#157  fudan university  freshman  ...   2019/9/11     0:04:17
#186  fudan university  freshman  ...   2019/10/9     0:04:21
#
#[9 rows x 10 columns]

聚合函数

内置聚合函数

直接定义在groupby对象的聚合函数，根据返回标量值的原则，包括如下函数：max/min/mean/median/count/all/any/idxmax/idxmin/mad/nunique/skew/quantile/sum/std/var/sem/size/prod

gb = df.groupby('gender')['height']
print(gb.idxmin())
#gender
#female    143
#male      199
#name: height, dtype: int64
print(gb.quantile(0.95))
#gender
#female    166.8
#male      185.9
#name: height, dtype: float64

这些聚合函数当传入的数据来源包含多个列时，将按照列进行迭代计算：

gb = df.groupby('gender')[['height', 'weight']]
print(gb.max())
#        height  weight
#gender                
#female   170.2    63.0
#male     193.9    89.0

agg方法

虽然在groupby对象上定义了许多方便的函数，但仍然有以下不便之处：

无法同时使用多个函数
无法对特定的列使用特定的聚合函数
无法使用自定义的聚合函数
无法直接对结果的列名在聚合前进行自定义命名

使用多个函数
当使用多个聚合函数时，需要用列表的形式把内置聚合函数对应的字符串传入，先前提到的所有字符串都是合法的。

gb = df.groupby('gender')[['height', 'weight']]
print(gb.agg(['sum','idxmax','skew']))

#         height                   weight                 
#            sum idxmax      skew     sum idxmax      skew
#gender                                                   
#female  21014.0     28 -0.219253  6469.0     28 -0.268482
#male     8854.9    193  0.437535  3929.0      2 -0.332393

从结果看，此时的列索引为多级索引，第一层为数据源，第二层为使用的聚合方法，分别逐一对列使用聚合，因此结果为6列。

对特定的列使用特定的聚合函数
对于方法和列的特殊对应，可以通过构造字典传入agg中实现，其中字典以列名为键，以聚合字符串或字符串列表为值。

print(gb.agg({'height':['mean','max'], 'weight':'count'}))

#           height        weight
#             mean    max  count
#gender                         
#female  159.19697  170.2    135
#male    173.62549  193.9     54

使用自定义函数
在agg中可以使用具体的自定义函数，需要注意传入函数的参数是之前数据源中的列，逐列进行计算。

print(gb.agg(lambda x: x.mean()-x.min()))

#          height     weight
#gender                     
#female  13.79697  13.918519
#male    17.92549  21.759259

由于传入的是序列，因此序列上的方法和属性都是可以在函数中使用的，只需保证返回值是标量即可。

def my_func(s):
    res = 'high'
    if s.mean() <= df[s.name].mean():
        res = 'low'
    return res
print(gb.agg(my_func))

#       height weight
#gender              
#female    low    low
#male     high   high

聚合结果重命名
如果想要对聚合结果的列名进行重命名，只需要将上述函数的位置改写成元组，元组的第一个元素为新的名字，第二个位置为原来的函数，包括聚合字符串和自定义函数。

print(gb.agg([('range', lambda x: x.max()-x.min()), ('my_sum', 'sum')]))

#       height          weight        
#        range   my_sum  range  my_sum
#gender                               
#female   24.8  21014.0   29.0  6469.0
#male     38.2   8854.9   38.0  3929.0

注意，使用对一个或者多个列使用单个聚合的时候，重命名需要加方括号，否则就不知道是新的名字还是手误输错的内置函数字符串

print(gb.agg({'height': [('my_func', my_func), 'sum'], 'weight': [('range', lambda x:x.max())]}))

#        height          weight
#       my_func      sum  range
#gender                        
#female     low  21014.0   63.0
#male      high   8854.9   89.0

变换和过滤

变换函数与transform方法

变换函数的返回值为同长度的序列，最常用的内置变换函数是累计函数：cumcount/cumsum/cumprod/cummax/cummin，它们的使用方式和聚合函数类似，只不过完成的是组内累计操作。

print(gb.cummax().head())

#   height  weight
#0   158.9    46.0
#1   166.5    70.0
#2   188.9    89.0
#3     nan    46.0
#4   188.9    89.0

当用自定义变换时需要使用transform方法，被调用的自定义函数，其传入值为数据源的序列，与agg的传入类型是一致的，其最后的返回结果是行列索引与数据源一致的dataframe。

print(gb.transform(lambda x: (x-x.mean())/x.std()).head())

#     height    weight
#0 -0.058760 -0.354888
#1 -1.010925 -0.355000
#2  2.167063  2.089498
#3       nan -1.279789
#4  0.053133  0.159631

transform只能返回同长度的序列，但事实上还可以返回一个标量，这会使得结果被广播到其所在的整个组，这种标量广播的技巧在特征工程中是非常常见的。

print(gb.transform('mean').head())

#      height     weight
#0  159.19697  47.918519
#1  173.62549  72.759259
#2  173.62549  72.759259
#3  159.19697  47.918519
#4  173.62549  72.759259

组索引和过滤

组过滤作为行过滤的推广，指的是如果对一个组的全体所在行进行统计的结果返回true则会被保留，false则该组会被过滤，最后把所有未被过滤的组其对应的所在行拼接起来作为dataframe返回。
在groupby对象中，定义了filter方法进行组的筛选，其中自定义函数的输入参数为数据源构成的dataframe本身，在之前例子中定义的groupby对象中，传入的就是df[['height', 'weight']]，因此所有表方法和属性都可以在自定义函数中相应地使用，同时只需保证自定义函数的返回为布尔值即可。

print(gb.filter(lambda x:x.shape[0]>100).head())

#   height  weight
#0   158.9    46.0
#3     nan    41.0
#5   158.0    51.0
#6   162.5    52.0
#7   161.9    50.0

跨列分组

apply的使用
在设计上，apply的自定义函数传入参数与filter完全一致，只不过后者只允许返回布尔值。

def bmi(x):
    height = x['height']/100
    weight = x['weight']
    bmi_value = weight/height * 2
    return bmi_value.mean()
print(gb.apply(bmi))

#gender
#female    60.107477
#male      84.438489
#dtype: float64

除了返回标量之外，apply方法还可以返回一维series和二维dataframe，但它们产生的数据框维数和多级索引的层数应当如何变化？
标量情况：结果得到的是 series ，索引与 agg 的结果一致

gb = df.groupby(['gender','test_number'])[['height','weight']]
print(gb.apply(lambda x: 0))

#gender  test_number
#female  1              0
#        2              0
#        3              0
#male    1              0
#        2              0
#        3              0
#dtype: int64

print(gb.apply(lambda x: [0, 0]))

#gender  test_number
#female  1              [0, 0]
#        2              [0, 0]
#        3              [0, 0]
#male    1              [0, 0]
#        2              [0, 0]
#        3              [0, 0]
#dtype: object

series情况：得到的是dataframe，行索引与标量情况一致，列索引为series的索引

print(gb.apply(lambda x: pd.series([0,0],index=['a','b'])))

#                    a  b
#gender test_number      
#female 1            0  0
#       2            0  0
#       3            0  0
#male   1            0  0
#       2            0  0
#       3            0  0

dataframe情况：得到的是dataframe，行索引最内层在每个组原先agg的结果索引上，再加一层返回的dataframe行索引，同时分组结果dataframe的列索引和返回的dataframe列索引一致。

print(gb.apply(lambda x: pd.dataframe(np.ones((2,2)), index = ['a','b'], columns=pd.index([('w','x'),('y','z')]))))

#                        w    y
#                        x    z
#gender test_number            
#female 1           a  1.0  1.0
#                   b  1.0  1.0
#       2           a  1.0  1.0
#                   b  1.0  1.0
#       3           a  1.0  1.0
#                   b  1.0  1.0
#male   1           a  1.0  1.0
#                   b  1.0  1.0
#       2           a  1.0  1.0
#                   b  1.0  1.0
#       3           a  1.0  1.0
#                   b  1.0  1.0

到此这篇关于一文了解pandas库的分组的使用的文章就介绍到这了,更多相关pandas 分组内容请搜索代码网以前的文章或继续浏览下面的相关文章希望大家以后多多支持代码网！