共计 4896 个字符,预计需要花费 13 分钟才能阅读完成。
明天咱们讲一下 pandas
当中的数据过滤内容
上面小编会给出大略 20 个案例来具体阐明数据过滤的办法,首先咱们先建设要用到的数据集,代码如下:
import pandas as pd | |
df = pd.DataFrame({"name": ["John","Jane","Emily","Lisa","Matt"], | |
"note": [92,94,87,82,90], | |
"profession":["Electrical engineer","Mechanical engineer", | |
"Data scientist","Accountant","Athlete"], | |
"date_of_birth":["1998-11-01","2002-08-14","1996-01-12", | |
"2002-10-24","2004-04-05"], | |
"group":["A","B","B","A","C"] | |
}) |
output
name note profession date_of_birth group | |
0 John 92 Electrical engineer 1998-11-01 A | |
1 Jane 94 Mechanical engineer 2002-08-14 B | |
2 Emily 87 Data scientist 1996-01-12 B | |
3 Lisa 82 Accountant 2002-10-24 A | |
4 Matt 90 Athlete 2004-04-05 C |
筛选表格中的若干列
代码如下
df[["name","note"]]
output
name note | |
0 John 92 | |
1 Jane 94 | |
2 Emily 87 | |
3 Lisa 82 | |
4 Matt 90 |
再筛选出若干行
咱们基于下面搜寻出的后果之上,再筛选出若干行,代码如下
df.loc[:3, ["name","note"]]
output
name note | |
0 John 92 | |
1 Jane 94 | |
2 Emily 87 | |
3 Lisa 82 |
依据索引来过滤数据
这里咱们用到的是 iloc
办法,代码如下
df.iloc[:3, 2]
output
0 Electrical engineer | |
1 Mechanical engineer | |
2 Data scientist |
通过比拟运算符来筛选数据
df[df.note > 90]
output
name note profession date_of_birth group | |
0 John 92 Electrical engineer 1998-11-01 A | |
1 Jane 94 Mechanical engineer 2002-08-14 B |
dt
属性接口
dt
属性接口是用于解决工夫类型的数据的,当然首先咱们须要将字符串类型的数据,或者其余类型的数据转换成事件类型的数据,而后再解决,代码如下
df.date_of_birth = df.date_of_birth.astype("datetime64[ns]") | |
df[df.date_of_birth.dt.month==11] |
output
name note profession date_of_birth group | |
0 John 92 Electrical engineer 1998-11-01 A |
或者咱们也能够
df[df.date_of_birth.dt.year > 2000]
output
name note profession date_of_birth group | |
1 Jane 94 Mechanical engineer 2002-08-14 B | |
3 Lisa 82 Accountant 2002-10-24 A | |
4 Matt 90 Athlete 2004-04-05 C |
多个条件交加过滤数据
当咱们遇上多个条件,并且是交加的状况下过滤数据时,代码应该这么来写
df[(df.date_of_birth.dt.year > 2000) & | |
(df.profession.str.contains("engineer"))] |
output
name note profession date_of_birth group | |
1 Jane 94 Mechanical engineer 2002-08-14 B |
多个条件并集筛选数据
当多个条件是以并集的形式来过滤数据的时候,代码如下
df[(df.note > 90) | (df.profession=="Data scientist")]
output
name note profession date_of_birth group | |
0 John 92 Electrical engineer 1998-11-01 A | |
1 Jane 94 Mechanical engineer 2002-08-14 B | |
2 Emily 87 Data scientist 1996-01-12 B |
Query
办法过滤数据
Pandas
当中的 query
办法也能够对数据进行过滤,咱们将过滤的条件输出
df.query("note > 90")
output
name note profession date_of_birth group | |
0 John 92 Electrical engineer 1998-11-01 A | |
1 Jane 94 Mechanical engineer 2002-08-14 B |
又或者是
df.query("group=='A'and note > 89")
output
name note profession date_of_birth group | |
0 John 92 Electrical engineer 1998-11-01 A |
nsmallest
办法过滤数据
pandas
当中的 nsmallest
以及 nlargest
办法是用来找到数据集当中最大、最小的若干数据,代码如下
df.nsmallest(2, "note")
output
name note profession date_of_birth group | |
3 Lisa 82 Accountant 2002-10-24 A | |
2 Emily 87 Data scientist 1996-01-12 B |
df.nlargest(2, "note")
output
name note profession date_of_birth group | |
1 Jane 94 Mechanical engineer 2002-08-14 B | |
0 John 92 Electrical engineer 1998-11-01 A |
isna()
办法
isna()
办法性能在于过滤出那些是空值的数据,首先咱们将表格当中的某些数据设置成空值
df.loc[0, "profession"] = np.nan | |
df[df.profession.isna()] |
output
name note profession date_of_birth group | |
0 John 92 NaN 1998-11-01 A |
notna()
办法
notna()
办法下面的 isna()
办法正好相同的性能在于过滤出那些不是空值的数据,代码如下
df[df.profession.notna()]
output
name note profession date_of_birth group | |
1 Jane 94 Mechanical engineer 2002-08-14 B | |
2 Emily 87 Data scientist 1996-01-12 B | |
3 Lisa 82 Accountant 2002-10-24 A | |
4 Matt 90 Athlete 2004-04-05 C |
assign
办法
pandas
当中的 assign
办法作用是间接向数据集当中来增加一列
df_1 = df.assign(score=np.random.randint(0,100,size=5)) | |
df_1 |
output
name note profession date_of_birth group score | |
0 John 92 Electrical engineer 1998-11-01 A 19 | |
1 Jane 94 Mechanical engineer 2002-08-14 B 84 | |
2 Emily 87 Data scientist 1996-01-12 B 68 | |
3 Lisa 82 Accountant 2002-10-24 A 70 | |
4 Matt 90 Athlete 2004-04-05 C 39 |
explode
办法
explode()
办法直译的话,是爆炸的意思,咱们常常会遇到这样的数据集
Name Hobby | |
0 吕布 [打篮球, 玩游戏, 喝奶茶] | |
1 貂蝉 [敲代码, 看电影] | |
2 赵云 [听音乐, 健身] |
Hobby
列当中的每行数据都以列表的模式集中到了一起,而 explode()
办法则是将这些集中到一起的数据拆开来,代码如下
Name Hobby | |
0 吕布 打篮球 | |
0 吕布 玩游戏 | |
0 吕布 喝奶茶 | |
1 貂蝉 敲代码 | |
1 貂蝉 看电影 | |
2 赵云 听音乐 | |
2 赵云 健身 |
当然咱们会开展来之后,数据会存在反复的状况,
df.explode('Hobby').drop_duplicates().reset_index(drop=True)
output
Name Hobby | |
0 吕布 打篮球 | |
1 吕布 玩游戏 | |
2 吕布 喝奶茶 | |
3 貂蝉 敲代码 | |
4 貂蝉 看电影 | |
5 赵云 听音乐 | |
6 赵云 健身 |
好了,这就是明天分享的内容,如果你感觉文章还不错,欢送关注公众号:Python 编程学习圈,每日干货分享,发送“J”还可支付大量学习材料。或是返回编程学习网,理解更多编程技术常识。