明天咱们讲一下pandas
当中的数据过滤内容
上面小编会给出大略20个案例来具体阐明数据过滤的办法,首先咱们先建设要用到的数据集,代码如下:
import pandas as pddf = pd.DataFrame({ "name": ["John","Jane","Emily","Lisa","Matt"], "note": [92,94,87,82,90], "profession":["Electrical engineer","Mechanical engineer", "Data scientist","Accountant","Athlete"], "date_of_birth":["1998-11-01","2002-08-14","1996-01-12", "2002-10-24","2004-04-05"], "group":["A","B","B","A","C"]})
output
name note profession date_of_birth group0 John 92 Electrical engineer 1998-11-01 A1 Jane 94 Mechanical engineer 2002-08-14 B2 Emily 87 Data scientist 1996-01-12 B3 Lisa 82 Accountant 2002-10-24 A4 Matt 90 Athlete 2004-04-05 C
筛选表格中的若干列
代码如下
df[["name","note"]]
output
name note0 John 921 Jane 942 Emily 873 Lisa 824 Matt 90
再筛选出若干行
咱们基于下面搜寻出的后果之上,再筛选出若干行,代码如下
df.loc[:3, ["name","note"]]
output
name note0 John 921 Jane 942 Emily 873 Lisa 82
依据索引来过滤数据
这里咱们用到的是iloc
办法,代码如下
df.iloc[:3, 2]
output
0 Electrical engineer1 Mechanical engineer2 Data scientist
通过比拟运算符来筛选数据
df[df.note > 90]
output
name note profession date_of_birth group0 John 92 Electrical engineer 1998-11-01 A1 Jane 94 Mechanical engineer 2002-08-14 B
dt
属性接口
dt
属性接口是用于解决工夫类型的数据的,当然首先咱们须要将字符串类型的数据,或者其余类型的数据转换成事件类型的数据,而后再解决,代码如下
df.date_of_birth = df.date_of_birth.astype("datetime64[ns]")df[df.date_of_birth.dt.month==11]
output
name note profession date_of_birth group0 John 92 Electrical engineer 1998-11-01 A
或者咱们也能够
df[df.date_of_birth.dt.year > 2000]
output
name note profession date_of_birth group1 Jane 94 Mechanical engineer 2002-08-14 B3 Lisa 82 Accountant 2002-10-24 A4 Matt 90 Athlete 2004-04-05 C
多个条件交加过滤数据
当咱们遇上多个条件,并且是交加的状况下过滤数据时,代码应该这么来写
df[(df.date_of_birth.dt.year > 2000) & (df.profession.str.contains("engineer"))]
output
name note profession date_of_birth group1 Jane 94 Mechanical engineer 2002-08-14 B
多个条件并集筛选数据
当多个条件是以并集的形式来过滤数据的时候,代码如下
df[(df.note > 90) | (df.profession=="Data scientist")]
output
name note profession date_of_birth group0 John 92 Electrical engineer 1998-11-01 A1 Jane 94 Mechanical engineer 2002-08-14 B2 Emily 87 Data scientist 1996-01-12 B
Query
办法过滤数据
Pandas
当中的query
办法也能够对数据进行过滤,咱们将过滤的条件输出
df.query("note > 90")
output
name note profession date_of_birth group0 John 92 Electrical engineer 1998-11-01 A1 Jane 94 Mechanical engineer 2002-08-14 B
又或者是
df.query("group=='A' and note > 89")
output
name note profession date_of_birth group0 John 92 Electrical engineer 1998-11-01 A
nsmallest
办法过滤数据
pandas
当中的nsmallest
以及nlargest
办法是用来找到数据集当中最大、最小的若干数据,代码如下
df.nsmallest(2, "note")
output
name note profession date_of_birth group3 Lisa 82 Accountant 2002-10-24 A2 Emily 87 Data scientist 1996-01-12 B
df.nlargest(2, "note")
output
name note profession date_of_birth group1 Jane 94 Mechanical engineer 2002-08-14 B0 John 92 Electrical engineer 1998-11-01 A
isna()
办法
isna()
办法性能在于过滤出那些是空值的数据,首先咱们将表格当中的某些数据设置成空值
df.loc[0, "profession"] = np.nandf[df.profession.isna()]
output
name note profession date_of_birth group0 John 92 NaN 1998-11-01 A
notna()
办法
notna()
办法下面的isna()
办法正好相同的性能在于过滤出那些不是空值的数据,代码如下
df[df.profession.notna()]
output
name note profession date_of_birth group1 Jane 94 Mechanical engineer 2002-08-14 B2 Emily 87 Data scientist 1996-01-12 B3 Lisa 82 Accountant 2002-10-24 A4 Matt 90 Athlete 2004-04-05 C
assign
办法
pandas
当中的assign
办法作用是间接向数据集当中来增加一列
df_1 = df.assign(score=np.random.randint(0,100,size=5))df_1
output
name note profession date_of_birth group score0 John 92 Electrical engineer 1998-11-01 A 191 Jane 94 Mechanical engineer 2002-08-14 B 842 Emily 87 Data scientist 1996-01-12 B 683 Lisa 82 Accountant 2002-10-24 A 704 Matt 90 Athlete 2004-04-05 C 39
explode
办法
explode()
办法直译的话,是爆炸的意思,咱们常常会遇到这样的数据集
Name Hobby0 吕布 [打篮球, 玩游戏, 喝奶茶]1 貂蝉 [敲代码, 看电影]2 赵云 [听音乐, 健身]
Hobby
列当中的每行数据都以列表的模式集中到了一起,而explode()
办法则是将这些集中到一起的数据拆开来,代码如下
Name Hobby0 吕布 打篮球0 吕布 玩游戏0 吕布 喝奶茶1 貂蝉 敲代码1 貂蝉 看电影2 赵云 听音乐2 赵云 健身
当然咱们会开展来之后,数据会存在反复的状况,
df.explode('Hobby').drop_duplicates().reset_index(drop=True)
output
Name Hobby0 吕布 打篮球1 吕布 玩游戏2 吕布 喝奶茶3 貂蝉 敲代码4 貂蝉 看电影5 赵云 听音乐6 赵云 健身
好了,这就是明天分享的内容,如果你感觉文章还不错,欢送关注公众号:Python编程学习圈,每日干货分享,发送“J”还可支付大量学习材料。或是返回编程学习网,理解更多编程技术常识。