数据分析-pandas如何选择数据子集
dataframe的数据中,选择某一列,某一行,或者某个子区域,该怎么办呢?
选择一个属性列维度
比如,titanic 数据表中,如果仅仅对乘客感兴趣,可以这样操作:
in [4]: ages = titanic["age"] in [5]: ages.head() out[5]: 0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 name: age, dtype: float64 in [6]: type(titanic["age"]) out[6]: pandas.core.series.series in [7]: titanic["age"].shape out[7]: (891,)
选择多个属性列维度
比如,titanic 数据表中,想选择多个属性进行组合研究,不仅仅对乘客感兴趣,还需要知道性别,可以这样操作:
in [8]: age_sex = titanic[["age", "sex"]] in [9]: age_sex.head() out[9]: age sex 0 22.0 male 1 38.0 female 2 26.0 female 3 35.0 female 4 35.0 male in [10]: type(titanic[["age", "sex"]]) out[10]: pandas.core.frame.dataframe in [11]: titanic[["age", "sex"]].shape out[11]: (891, 2)
筛选属性值集合
比如,titanic 数据表中,对乘客的年龄大于35岁的集合感兴趣
in [12]: above_35 = titanic[titanic["age"] > 35] in [13]: above_35.head() out[13]: passengerid survived pclass ... fare cabin embarked 1 2 1 1 ... 71.2833 c85 c 6 7 0 1 ... 51.8625 e46 s 11 12 1 1 ... 26.5500 c103 s 13 14 0 3 ... 31.2750 nan s 15 16 1 2 ... 16.0000 nan s [5 rows x 12 columns] in [15]: above_35.shape out[15]: (217, 12)
事实上,括号内的条件其实是一个真值列表:
in [14]: titanic["age"] > 35 out[14]: 0 false 1 true 2 false 3 false 4 false ... 886 false 887 false 888 false 889 false 890 false name: age, length: 891, dtype: bool
此外,还对乘客的座舱等级感兴趣,筛选等级2,3的,可以这样操作:
in [16]: class_23 = titanic[titanic["pclass"].isin([2, 3])] in [17]: class_23.head() out[17]: passengerid survived pclass ... fare cabin embarked 0 1 0 3 ... 7.2500 nan s 2 3 1 3 ... 7.9250 nan s 4 5 0 3 ... 8.0500 nan s 5 6 0 3 ... 8.4583 nan q 7 8 0 3 ... 21.0750 nan s [5 rows x 12 columns] # 等价于: in [18]: class_23 = titanic[(titanic["pclass"] == 2) | (titanic["pclass"] == 3)] in [19]: class_23.head() out[19]: passengerid survived pclass ... fare cabin embarked 0 1 0 3 ... 7.2500 nan s 2 3 1 3 ... 7.9250 nan s 4 5 0 3 ... 8.0500 nan s 5 6 0 3 ... 8.4583 nan q 7 8 0 3 ... 21.0750 nan s [5 rows x 12 columns]
此外,在数据清洗中经常用到,把na值或者非na值筛选出来,另做处理,可以这样操作:
in [20]: age_no_na = titanic[titanic["age"].notna()] in [21]: age_no_na.head() out[21]: passengerid survived pclass ... fare cabin embarked 0 1 0 3 ... 7.2500 nan s 1 2 1 1 ... 71.2833 c85 c 2 3 1 3 ... 7.9250 nan s 3 4 1 1 ... 53.1000 c123 s 4 5 0 3 ... 8.0500 nan s [5 rows x 12 columns] in [22]: age_no_na.shape out[22]: (714, 12)
筛选特定行和列维度集合
比如,titanic 数据表中,对乘客的年龄大于35岁的名字感兴趣,
in [23]: adult_names = titanic.loc[titanic["age"] > 35, "name"] in [24]: adult_names.head() out[24]: 1 cumings, mrs. john bradley (florence briggs th... 6 mccarthy, mr. timothy j 11 bonnell, miss. elizabeth 13 andersson, mr. anders johan 15 hewlett, mrs. (mary d kingcome) name: name, dtype: object
如果对第10-25行,3到5列感兴趣,可以这样操作:
in [25]: titanic.iloc[9:25, 2:5] out[25]: pclass name sex 9 2 nasser, mrs. nicholas (adele achem) female 10 3 sandstrom, miss. marguerite rut female 11 1 bonnell, miss. elizabeth female 12 3 saundercock, mr. william henry male 13 3 andersson, mr. anders johan male .. ... ... ... 20 2 fynney, mr. joseph j male 21 2 beesley, mr. lawrence male 22 3 mcgowan, miss. anna "annie" female 23 1 sloper, mr. william thompson male 24 3 palsson, miss. torborg danira female [16 rows x 3 columns]
以上代码只是一个简单示例,示例代码中的表达式和变量范围也可以根据实际问题进行修改。
到此这篇关于使用pandas选择数据子集的方法示例的文章就介绍到这了,更多相关pandas选择数据子集内容请搜索代码网以前的文章或继续浏览下面的相关文章希望大家以后多多支持代码网!
发表评论