关于机器学习:前置机器学习四一文掌握Pandas用法

55次阅读

共计 16699 个字符，预计需要花费 42 分钟才能阅读完成。

Pandas 提供疾速，灵便和富于表现力的 数据结构 ，是弱小的 数据分析Python 库。

本文收录于机器学习前置教程系列。

Pandas 建设在 NumPy 之上，更多 NumPy 相干的知识点能够参考我之前写的文章前置机器学习（三）：30 分钟把握罕用 NumPy 用法。
Pandas 特地适宜解决表格数据，如 SQL 表格、EXCEL 表格。有序或无序的工夫序列。具备行和列标签的任意矩阵数据。

关上 Jupyter Notebook，导入 numpy 和 pandas 开始咱们的教程：

import numpy as np
import pandas as pd

Series 是带有索引的一维 ndarray 数组。索引值可不惟一，但必须是可哈希的。

pd.Series([1, 3, 5, np.nan, 6, 8])

输入：

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

咱们能够看到默认索引值为 0、1、2、3、4、5 这样的数字。增加 index 属性，指定其为 ’c’,’a’,’i’,’yong’,’j’,’i’。

pd.Series([1, 3, 5, np.nan, 6, 8], index=['c','a','i','yong','j','i'])

输入如下，咱们能够看到 index 是可反复的。

c       1.0
a       3.0
i       5.0
yong    NaN
j       6.0
i       8.0
dtype: float64

DataFrame 是带有行和列的表格构造。能够了解为多个 Series 对象的字典构造。

pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), index=['i','ii','iii'], columns=['A', 'B', 'C'])

输入表格如下，其中 index 对应它的行，columns对应它的列。

	A	B	C
i	1	2	3
ii	4	5	6
iii	7	8	9

筹备数据，随机生成 6 行 4 列的二维数组，行标签为从 20210101 到 20210106 的日期，列标签为 A、B、C、D。

import numpy as np
import pandas as pd
np.random.seed(20201212)
df = pd.DataFrame(np.random.randn(6, 4), index=pd.date_range('20210101', periods=6), columns=list('ABCD'))
df

展现表格如下：

	A	B	C	D
2021-01-01	0.270961	-0.405463	0.348373	0.828572
2021-01-02	0.696541	0.136352	-1.64592	-0.69841
2021-01-03	0.325415	-0.602236	-0.134508	1.28121
2021-01-04	-0.33032	-1.40384	-0.93809	1.48804
2021-01-05	0.348708	1.27175	0.626011	-0.253845
2021-01-06	-0.816064	1.30197	0.656281	-1.2718

查看表格前几行：

df.head(2)

展现表格如下：

	A	B	C	D
2021-01-01	0.270961	-0.405463	0.348373	0.828572
2021-01-02	0.696541	0.136352	-1.64592	-0.69841

查看表格后几行：

df.tail(3)

展现表格如下：

	A	B	C	D
2021-01-04	-0.33032	-1.40384	-0.93809	1.48804
2021-01-05	0.348708	1.27175	0.626011	-0.253845
2021-01-06	-0.816064	1.30197	0.656281	-1.2718

describe办法用于生成 DataFrame 的描述统计信息。能够很不便的查看数据集的散布状况。留神，这里的统计散布不蕴含 NaN 值。

df.describe()

展现如下：

	A	B	C	D
count	6	6	6	6
mean	0.0825402	0.0497552	-0.181309	0.22896
std	0.551412	1.07834	0.933155	1.13114
min	-0.816064	-1.40384	-1.64592	-1.2718
25%	-0.18	-0.553043	-0.737194	-0.587269
50%	0.298188	-0.134555	0.106933	0.287363
75%	0.342885	0.987901	0.556601	1.16805
max	0.696541	1.30197	0.656281	1.48804

咱们首先回顾一下咱们把握的数学公式。

平均数(mean)：

$$\bar x = \frac{\sum_{i=1}^{n}{x_i}}{n}$$

方差(variance):

$$s^2 = \frac{\sum_{i=1}^{n}{(x_i -\bar x)^2}}{n}$$

标准差(std):

$$s = \sqrt{\frac{\sum_{i=1}^{n}{(x_i -\bar x)^2}}{n}}$$

咱们解释一下 pandas 的 describe 统计信息各属性的意义。咱们仅以 A 列为例。

count示意计数。A 列有 6 个数据不为空。
mean示意平均值。A 列所有不为空的数据平均值为 0.0825402。
std示意标准差。A 列的标准差为 0.551412。
min示意最小值。A 列最小值为 -0.816064。即，0% 的数据比 -0.816064 小。
25%示意四分之一分位数。A 列的四分之一分位数为 -0.18。即，25% 的数据比 -0.18 小。
50%示意二分之一分位数。A 列的四分之一分位数为 0.298188。即，50% 的数据比 0.298188 小。
75%示意四分之三分位数。A 列的四分之三分位数为 0.342885。即，75% 的数据比 0.342885 小。
max示意最大值。A 列的最大值为 0.696541。即，100% 的数据比 0.696541 小。

T个别示意 Transpose 的缩写，即转置。行列转换。

df.T

展现表格如下：

	2021-01-01	2021-01-02	2021-01-03	2021-01-04	2021-01-05	2021-01-06
A	0.270961	0.696541	0.325415	-0.33032	0.348708	-0.816064
B	-0.405463	0.136352	-0.602236	-1.40384	1.27175	1.30197
C	0.348373	-1.64592	-0.134508	-0.93809	0.626011	0.656281
D	0.828572	-0.69841	1.28121	1.48804	-0.253845	-1.2718

指定某一列进行排序，如下代码依据 C 列进行正序排序。

df.sort_values(by='C')

展现表格如下：

	A	B	C	D
2021-01-02	0.696541	0.136352	-1.64592	-0.69841
2021-01-04	-0.33032	-1.40384	-0.93809	1.48804
2021-01-03	0.325415	-0.602236	-0.134508	1.28121
2021-01-01	0.270961	-0.405463	0.348373	0.828572
2021-01-05	0.348708	1.27175	0.626011	-0.253845
2021-01-06	-0.816064	1.30197	0.656281	-1.2718

抉择某列最大的 n 行数据。如：df.nlargest(2,'A')示意，返回 A 列最大的 2 行数据。

df.nlargest(2,'A')

展现表格如下：

	A	B	C	D
2021-01-02	0.696541	0.136352	-1.64592	-0.69841
2021-01-05	0.348708	1.27175	0.626011	-0.253845

sample办法示意查看随机的样例数据。

df.sample(5)示意返回随机 5 行数据。

df.sample(5)

参数 frac 示意 fraction，分数的意思。frac=0.01 即返回 1% 的随机数据作为样例展现。

df.sample(frac=0.01)

咱们输出 df['A'] 命令选取 A 列。

df['A']

输入 A 列数据，同时也是一个 Series 对象：

2021-01-01    0.270961
2021-01-02    0.696541
2021-01-03    0.325415
2021-01-04   -0.330320
2021-01-05    0.348708
2021-01-06   -0.816064
Name: A, dtype: float64

df[0:3]该代码与 df.head(3) 同理。但 df[0:3] 是 NumPy 的数组抉择形式，这阐明了 Pandas 对于 NumPy 具备良好的反对。

df[0:3]

展现表格如下：

	A	B	C	D
2021-01-01	0.270961	-0.405463	0.348373	0.828572
2021-01-02	0.696541	0.136352	-1.64592	-0.69841
2021-01-03	0.325415	-0.602236	-0.134508	1.28121

通过 loc 办法指定行列标签。

df.loc['2021-01-01':'2021-01-02', ['A', 'B']]

展现表格如下：

	A	B
2021-01-01	0.270961	-0.405463
2021-01-02	0.696541	0.136352

iloc 与 loc 不同。loc指定具体的标签，而 iloc 指定标签的索引地位。df.iloc[3:5, 0:3]示意选取索引为 3、4 的行，索引为 0、1、2 的列。即，第 4、5 行，第 1、2、3 列。
留神，索引序号从 0 开始。冒号示意区间，左右两侧别离示意开始和完结。如 3:5 示意左开右闭区间[3,5)，即不蕴含 5 本身。

df.iloc[3:5, 0:3]

	A	B	C
2021-01-04	-0.33032	-1.40384	-0.93809
2021-01-05	0.348708	1.27175	0.626011

df.iloc[:, 1:3]

	B	C
2021-01-01	-0.405463	0.348373
2021-01-02	0.136352	-1.64592
2021-01-03	-0.602236	-0.134508
2021-01-04	-1.40384	-0.93809
2021-01-05	1.27175	0.626011
2021-01-06	1.30197	0.656281

DataFrame 可依据条件进行筛选，当条件判断 True 时，返回。当条件判断为 False 时，过滤掉。

咱们设置一个过滤器用来判断 A 列是否大于 0。

filter = df['A'] > 0
filter

输入后果如下，能够看到 2021-01-04 和2021-01-06的行为 False。

2021-01-01     True
2021-01-02     True
2021-01-03     True
2021-01-04    False
2021-01-05     True
2021-01-06    False
Name: A, dtype: bool

咱们通过过滤器查看数据集。

df[filter]
# df[df['A'] > 0]

查看表格咱们能够发现，2021-01-04和 2021-01-06 的行被过滤掉了。

	A	B	C	D
2021-01-01	0.270961	-0.405463	0.348373	0.828572
2021-01-02	0.696541	0.136352	-1.64592	-0.69841
2021-01-03	0.325415	-0.602236	-0.134508	1.28121
2021-01-05	0.348708	1.27175	0.626011	-0.253845

筹备数据。

df2 = df.copy()
df2.loc[:3, 'E'] = 1.0
f_series = {'2021-01-02': 1.0,'2021-01-03': 2.0,'2021-01-04': 3.0,'2021-01-05': 4.0,'2021-01-06': 5.0}
df2['F'] = pd.Series(f_series)
df2

展现表格如下：

	A	B	C	D	F	E
2021-01-01	0.270961	-0.405463	0.348373	0.828572	nan	1
2021-01-02	0.696541	0.136352	-1.64592	-0.69841	1	1
2021-01-03	0.325415	-0.602236	-0.134508	1.28121	2	1
2021-01-04	-0.33032	-1.40384	-0.93809	1.48804	3	nan
2021-01-05	0.348708	1.27175	0.626011	-0.253845	4	nan
2021-01-06	-0.816064	1.30197	0.656281	-1.2718	5	nan

应用 dropna 办法清空 NaN 值。留神：dropa 办法返回新的 DataFrame，并不会扭转原有的 DataFrame。

df2.dropna(how='any')

以上代码示意，当行数据有任意的数值为空时，删除。

	A	B	C	D	F	E
2021-01-02	0.696541	0.136352	-1.64592	-0.69841	1	1
2021-01-03	0.325415	-0.602236	-0.134508	1.28121	2	1

应用 filna 命令填补 NaN 值。

df2.fillna(df2.mean())

以上代码示意，应用每一列的平均值来填补空缺。同样地，fillna 并不会更新原有的 DataFrame，如需更新原有 DataFrame 应用代码df2 = df2.fillna(df2.mean())。

展现表格如下：

	A	B	C	D	F	E
2021-01-01	0.270961	-0.405463	0.348373	0.828572	3	1
2021-01-02	0.696541	0.136352	-1.64592	-0.69841	1	1
2021-01-03	0.325415	-0.602236	-0.134508	1.28121	2	1
2021-01-04	-0.33032	-1.40384	-0.93809	1.48804	3	1
2021-01-05	0.348708	1.27175	0.626011	-0.253845	4	1
2021-01-06	-0.816064	1.30197	0.656281	-1.2718	5	1

agg 是 Aggregate 的缩写，意为聚合。

罕用聚合办法如下：

mean(): Compute mean of groups
sum(): Compute sum of group values
size(): Compute group sizes
count(): Compute count of group
std(): Standard deviation of groups
var(): Compute variance of groups
sem(): Standard error of the mean of groups
describe(): Generates descriptive statistics
first(): Compute first of group values
last(): Compute last of group values
nth() : Take nth value, or a subset if n is a list
min(): Compute min of group values
max(): Compute max of group values

df.mean()

返回各列平均值

A    0.082540
B    0.049755
C   -0.181309
D    0.228960
dtype: float64

可通过加参数 axis 查看行平均值。

df.mean(axis=1)

输入：

2021-01-01    0.260611
2021-01-02   -0.377860
2021-01-03    0.217470
2021-01-04   -0.296053
2021-01-05    0.498156
2021-01-06   -0.032404
dtype: float64

如果咱们想查看某一列的多项聚合统计怎么办？
这时咱们能够调用 agg 办法：

df.agg(['std','mean'])['A']

返回结果显示标准差 std 和均值 mean：


std     0.551412
mean    0.082540
Name: A, dtype: float64

对于不同的列利用不同的聚合函数：

df.agg({'A':['max','mean'],'B':['mean','std','var']})

返回后果如下：

	A	B
max	0.696541	nan
mean	0.0825402	0.0497552
std	nan	1.07834
var	nan	1.16281

apply()是对办法的调用。
如 df.apply(np.sum) 示意每一列调用 np.sum 办法，返回每一列的数值和。

df.apply(np.sum)

输入后果为：

A    0.495241
B    0.298531
C   -1.087857
D    1.373762
dtype: float64

apply 办法反对 lambda 表达式。

df.apply(lambda n: n*2)

	A	B	C	D
2021-01-01	0.541923	-0.810925	0.696747	1.65714
2021-01-02	1.39308	0.272704	-3.29185	-1.39682
2021-01-03	0.65083	-1.20447	-0.269016	2.56242
2021-01-04	-0.66064	-2.80768	-1.87618	2.97607
2021-01-05	0.697417	2.5435	1.25202	-0.50769
2021-01-06	-1.63213	2.60393	1.31256	-2.5436

value_counts 办法查看各行、列的数值反复统计。
咱们从新生成一些整数数据，来保障有肯定的数据反复。

np.random.seed(101)
df3 = pd.DataFrame(np.random.randint(0,9,size = (6,4)),columns=list('ABCD'))
df3

	A	B	C	D
0	1	6	7	8
1	4	8	5	0
2	5	8	1	3
3	8	3	3	2
4	8	3	7	0
5	7	8	4	3

调用 value_counts()办法。

df3['A'].value_counts()

查看输入咱们能够看到 A 列的数字 8 有两个，其余数字的数量为 1。

8    2
7    1
5    1
4    1
1    1
Name: A, dtype: int64

Pandas 内置字符串解决办法。

names = pd.Series(['andrew','bobo','claire','david','4'])
names.str.upper()

通过以上代码咱们将 Series 中的字符串全副设置为大写。

0    ANDREW
1      BOBO
2    CLAIRE
3     DAVID
4         4
dtype: object

首字母大写：

names.str.capitalize()

输入为：

0    Andrew
1      Bobo
2    Claire
3     David
4         4
dtype: object

判断是否为数字：

names.str.isdigit()

输入为：

0    False
1    False
2    False
3    False
4     True
dtype: bool

字符串宰割：

tech_finance = ['GOOG,APPL,AMZN','JPM,BAC,GS']
tickers = pd.Series(tech_finance)
tickers.str.split(',').str[0:2]

以逗号宰割字符串，后果为：

0    [GOOG, APPL]
1      [JPM, BAC]
dtype: object

concat 用来将数据集串联起来。咱们先筹备数据。

data_one = {'Col1': ['A0', 'A1', 'A2', 'A3'],'Col2': ['B0', 'B1', 'B2', 'B3']}
data_two = {'Col1': ['C0', 'C1', 'C2', 'C3'], 'Col2': ['D0', 'D1', 'D2', 'D3']}
one = pd.DataFrame(data_one)
two = pd.DataFrame(data_two)

应用 concat 办法将两个数据集串联起来。

pt(pd.concat([one,two]))

失去表格：

	Col1	Col2
0	A0	B0
1	A1	B1
2	A2	B2
3	A3	B3
0	C0	D0
1	C1	D1
2	C2	D2
3	C3	D3

merge 相当于 SQL 操作中的 join 办法，用于将两个数据集通过某种关系连接起来

registrations = pd.DataFrame({'reg_id':[1,2,3,4],'name':['Andrew','Bobo','Claire','David']})
logins = pd.DataFrame({'log_id':[1,2,3,4],'name':['Xavier','Andrew','Yolanda','Bobo']})

咱们依据 name 来连贯两个张表，连贯形式为outer。

pd.merge(left=registrations, right=logins, how='outer',on='name')

返回后果为：

	reg_id	name	log_id
0	1	Andrew	2
1	2	Bobo	4
2	3	Claire	nan
3	4	David	nan
4	nan	Xavier	1
5	nan	Yolanda	3

咱们留神，how : {‘left’, ‘right’, ‘outer’, ‘inner’} 有 4 种连贯形式。示意是否选取左右两侧表的 nan 值。如 left 示意保留左侧表中所有数据，当遇到右侧表数据为 nan 值时，不显示右侧的数据。
简略来说，把 left 表和 right 表看作两个汇合。

left 示意取左表全副汇合 + 两表交加
right 示意取右表全副汇合 + 两表交加
outer 示意取两表并集
inner 示意取两表交加

Pandas 中的分组性能十分相似于 SQL 语句SELECT Column1, Column2, mean(Column3), sum(Column4)FROM SomeTableGROUP BY Column1, Column2。即便没有接触过 SQL 也没有关系，分组就相当于把表格数据依照某一列进行拆分、统计、合并的过程。

筹备数据。

np.random.seed(20201212)
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})
df

能够看到，咱们的 A 列和 B 列有很多反复数据。这时咱们能够依据 foo/bar 或者 one/two 进行分组。

	A	B	C	D
0	foo	one	0.270961	0.325415
1	bar	one	-0.405463	-0.602236
2	foo	two	0.348373	-0.134508
3	bar	three	0.828572	1.28121
4	foo	two	0.696541	-0.33032
5	bar	two	0.136352	-1.40384
6	foo	one	-1.64592	-0.93809
7	foo	three	-0.69841	1.48804

咱们利用 groupby 办法将上方表格中的数据进行分组。

df.groupby('A')

执行上方代码能够看到，groupby 办法返回的是一个类型为 DataFrameGroupBy 的对象。咱们无奈间接查看，须要利用聚合函数。参考本文 4.1 节。

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000014C6742E248>

咱们利用聚合函数 sum 试试。

df.groupby('A').sum()

展现表格如下：

A	C	D
bar	0.559461	-0.724868
foo	-1.02846	0.410533

groupby办法反对将多个列作为参数传入。

df.groupby(['A', 'B']).sum()

分组后显示后果如下：

A	B	C	D
bar	one	-0.405463	-0.602236
	one	-0.405463	-0.602236
	three	0.828572	1.28121
	two	0.136352	-1.40384
foo	one	-1.37496	-0.612675
	three	-0.69841	1.48804
	two	1.04491	-0.464828

咱们利用 agg()，将聚合办法数组作为参数传入办法。下方代码依据 A 分类且只统计C 列的数值。

df.groupby('A')['C'].agg([np.sum, np.mean, np.std])

能够看到 bar 组与 foo 组各聚合函数的后果如下：

A	sum	mean	std
bar	0.559461	0.186487	0.618543
foo	-1.02846	-0.205692	0.957242

下方代码对 C、D 列别离进行不同的聚合统计，对 C 列进行求和，对 D 列进行标准差统计。

df.groupby('A').agg({'C': 'sum', 'D': lambda x: np.std(x, ddof=1)})

输入如下：

A	C	D
bar	0.559461	1.37837
foo	-1.02846	0.907422

更多对于 Pandas 的 goupby 办法请参考官网:https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

reshape示意重塑表格。对于简单表格，咱们须要将其转换成适宜咱们了解的样子，比方依据某些属性分组后进行独自统计。

stack办法将表格分为索引和数据两个局部。索引各列保留，数据重叠搁置。

筹备数据。

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

依据上方代码，咱们创立了一个复合索引。

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

咱们创立一个具备复合索引的 DataFrame。

np.random.seed(20201212)
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df

输入如下：

A	B	C	D
bar	one	0.270961	-0.405463
	two	0.348373	0.828572
baz	one	0.696541	0.136352
	two	-1.64592	-0.69841
foo	one	0.325415	-0.602236
	two	-0.134508	1.28121
qux	one	-0.33032	-1.40384
	two	-0.93809	1.48804

咱们执行 stack 办法。

stacked = df.stack()
stacked

输入重叠（压缩）后的表格如下。留神：你应用 Jupyter Notebook/Lab 进行的输入可能和如下后果不太一样。下方输入的各位为了不便在 Markdown 中显示有肯定的调整。

first  second   
bar    one     A    0.942502
bar    one     B    0.060742
bar    two     A    1.340975
bar    two     B   -1.712152
baz    one     A    1.899275
baz    one     B    1.237799
baz    two     A   -1.589069
baz    two     B    1.288342
foo    one     A   -0.326792
foo    one     B    1.576351
foo    two     A    1.526528
foo    two     B    1.410695
qux    one     A    0.420718
qux    one     B   -0.288002
qux    two     A    0.361586
qux    two     B    0.177352
dtype: float64

咱们执行 unstack 将数据进行开展。

stacked.unstack()

输入原表格。

A	B	C	D
bar	one	0.270961	-0.405463
	two	0.348373	0.828572
baz	one	0.696541	0.136352
	two	-1.64592	-0.69841
foo	one	0.325415	-0.602236
	two	-0.134508	1.28121
qux	one	-0.33032	-1.40384
	two	-0.93809	1.48804

咱们退出参数level。

stacked.unstack(level=0)
#stacked.unstack(level=1)

当 level=0 时失去如下输入，大家能够试试 level=1 时输入什么。

second	first	bar	baz	foo	qux
one	A	0.942502	1.89927	-0.326792	0.420718
one	B	0.060742	1.2378	1.57635	-0.288002
two	A	1.34097	-1.58907	1.52653	0.361586
two	B	-1.71215	1.28834	1.4107	0.177352

pivot_table 示意透视表，是一种对数据动静排布并且分类汇总的表格格局。

咱们生成无索引列的 DataFrame。

np.random.seed(99)
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                    'B': ['A', 'B', 'C'] * 4,
                    'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                    'D': np.random.randn(12),
                    'E': np.random.randn(12)})
df

展现表格如下：

	A	B	C	D	E
0	one	A	foo	-0.142359	0.0235001
1	one	B	foo	2.05722	0.456201
2	two	C	foo	0.283262	0.270493
3	three	A	bar	1.32981	-1.43501
4	one	B	bar	-0.154622	0.882817
5	one	C	bar	-0.0690309	-0.580082
6	two	A	foo	0.75518	-0.501565
7	three	B	foo	0.825647	0.590953
8	one	C	foo	-0.113069	-0.731616
9	one	A	bar	-2.36784	0.261755
10	two	B	bar	-0.167049	-0.855796
11	three	C	bar	0.685398	-0.187526

通过观察数据，咱们能够显然得出 A、B、C 列的具备肯定属性含意。咱们执行 pivot_table 办法。

pd.pivot_table(df, values=['D','E'], index=['A', 'B'], columns=['C'])

上方代码的意思为，将 D、E 列作为数据列，A、B 作为复合行索引， C 的数据值 作为列索引。

	(‘D’, ‘bar’)	(‘D’, ‘foo’)	(‘E’, ‘bar’)	(‘E’, ‘foo’)
(‘one’, ‘A’)	-2.36784	-0.142359	0.261755	0.0235001
(‘one’, ‘B’)	-0.154622	2.05722	0.882817	0.456201
(‘one’, ‘C’)	-0.0690309	-0.113069	-0.580082	-0.731616
(‘three’, ‘A’)	1.32981	nan	-1.43501	nan
(‘three’, ‘B’)	nan	0.825647	nan	0.590953
(‘three’, ‘C’)	0.685398	nan	-0.187526	nan
(‘two’, ‘A’)	nan	0.75518	nan	-0.501565
(‘two’, ‘B’)	-0.167049	nan	-0.855796	nan
(‘two’, ‘C’)	nan	0.283262	nan	0.270493

date_range是 Pandas 自带的生成日期距离的办法。咱们执行下方代码：

rng = pd.date_range('1/1/2021', periods=100, freq='S')
pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

date_range 办法从 2021 年 1 月 1 日 0 秒开始，以 1 秒作为工夫距离执行 100 次时间段的划分。输入后果如下：

2021-01-01 00:00:00    475
2021-01-01 00:00:01    145
2021-01-01 00:00:02     13
2021-01-01 00:00:03    240
2021-01-01 00:00:04    183
                      ... 
2021-01-01 00:01:35    413
2021-01-01 00:01:36    330
2021-01-01 00:01:37    272
2021-01-01 00:01:38    304
2021-01-01 00:01:39    151
Freq: S, Length: 100, dtype: int32

咱们将 freq 的参数值从 S(second)改为 M(Month)试试看。

rng = pd.date_range('1/1/2021', periods=100, freq='M')
pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

输入：

2021-01-31    311
2021-02-28    256
2021-03-31    327
2021-04-30    151
2021-05-31    484
             ... 
2028-12-31    170
2029-01-31    492
2029-02-28    205
2029-03-31     90
2029-04-30    446
Freq: M, Length: 100, dtype: int32

咱们设置能够以季度作为频率进行日期生成。

prng = pd.period_range('2018Q1', '2020Q4', freq='Q-NOV')
pd.Series(np.random.randn(len(prng)), prng)

输入 2018 第一季度到 2020 第四季度间的全副季度。

2018Q1    0.833025
2018Q2   -0.509514
2018Q3   -0.735542
2018Q4   -0.224403
2019Q1   -0.119709
2019Q2   -1.379413
2019Q3    0.871741
2019Q4    0.877493
2020Q1    0.577611
2020Q2   -0.365737
2020Q3   -0.473404
2020Q4    0.529800
Freq: Q-NOV, dtype: float64

Pandas 有一种非凡的数据类型叫做 ” 目录 ”，即 dtype=”category”，咱们依据将某些列设置为目录来进行分类。

筹备数据。

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
df

	id	raw_grade
0	1	a
1	2	b
2	3	b
3	4	a
4	5	a
5	6	e

咱们增加一个新列 grade 并将它的数据类型设置为category。

df["grade"] = df["raw_grade"].astype("category")
df["grade"]

咱们能够看到 grade 列只有 3 种值 a,b,e。

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

咱们按程序替换 a、b、e 为 very good、good、very bad。

df["grade"].cat.categories = ["very good", "good", "very bad"]

此时的表格为：

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

咱们对表格进行排序：

df.sort_values(by="grade", ascending=False)

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very good

查看各类别的数量：

df.groupby("grade").size()

以上代码输入为：

grade
very good    3
good         2
very bad     1
dtype: int64

Pandas 反对间接从文件中读写数据，如 CSV、JSON、EXCEL 等文件格式。Pandas 反对的文件格式如下。

Format Type	Data Description	Reader	Writer
text	CSV	read_csv	to_csv
text	Fixed-Width Text File	read_fwf
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	Local clipboard	read_clipboard	to_clipboard
	MS Excel	read_excel	to_excel
binary	OpenDocument	read_excel
binary	HDF5 Format	read_hdf	to_hdf
binary	Feather Format	read_feather	to_feather
binary	Parquet Format	read_parquet	to_parquet
binary	ORC Format	read_orc
binary	Msgpack	read_msgpack	to_msgpack
binary	Stata	read_stata	to_stata
binary	SAS	read_sas
binary	SPSS	read_spss
binary	Python Pickle Format	read_pickle	to_pickle
SQL	SQL	read_sql	to_sql
SQL	Google BigQuery	read_gbq	to_gbq

咱们仅以 CSV 文件为例作为解说。其余格局请参考上方表格。

咱们从 CSV 文件导入数据。大家不必特地在意下方网址的域名地址。

df = pd.read_csv("http://blog.caiyongji.com/assets/housing.csv")

查看前 5 行数据：

df.head(5)

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41	880	129	322	126	8.3252	452600	NEAR BAY
1	-122.22	37.86	21	7099	1106	2401	1138	8.3014	358500	NEAR BAY
2	-122.24	37.85	52	1467	190	496	177	7.2574	352100	NEAR BAY
3	-122.25	37.85	52	1274	235	558	219	5.6431	341300	NEAR BAY
4	-122.25	37.85	52	1627	280	565	259	3.8462	342200	NEAR BAY

Pandas 反对 matplotlib，matplotlib 是功能强大的 Python 可视化工具。本节仅对 Pandas 反对的绘图办法进行简略介绍，咱们将会在下一篇文章中进行 matplotlib 的具体介绍。为了不错过更新，欢送大家关注我。

np.random.seed(999)
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])

咱们间接调用 plot 办法进行展现。
这里有两个须要留神的中央：

该 plot 办法是通过 Pandas 调用的 plot 办法，而非 matplotlib。
咱们晓得 Python 语言是无需分号进行完结语句的。此处的分号示意执行绘图渲染后 间接显示 图像。

df.plot();

df.plot.bar();

df.plot.bar(stacked=True);

咱们下篇将解说 matplotlib 的相干知识点，欢送关注机器学习前置教程系列，或我的集体博客 http://blog.caiyongji.com/ 同步更新。

正文完

机器学习

发表至：机器学习

2020-12-13

0

关于机器学习:FINC3600-财政问题求解

关于机器学习:小Mi的MindSpore学习之旅python在手excel不愁

关于机器学习:机器学习吴恩达小白笔记3代价函数可视化梯度下降

关于机器学习:宜信OCR技术探索与实践直播速记

关于python:用Python写个开心消消乐小游戏

关于机器学习:前置机器学习四一文掌握Pandas用法

一、Series 和 DataFrame

1. pandas.Series

2. pandas.DataFrame

二、Pandas 常见用法

1. 拜访数据

1.1 head()和 tail()

1.2 describe()

1.3 T

1.4 sort_values()

1.5 nlargest()

1.6 sample()

2. 抉择数据

2.1 依据标签抉择

2.2 依据地位抉择

2.3 布尔索引

3. 解决缺失值

3.1 dropna()

3.2 fillna()

4. 操作方法

4.1 agg()

4.2 apply()

4.3 value_counts()

4.4 str

5. 合并

5.1 concat()

5.2 merge()

6. 分组 GroupBy

6.1 单列分组

6.2 多列分组

6.3 利用多聚合办法

6.4 不同列进行不同聚合统计

6.5 更多

三、Pandas 进阶用法

1. reshape

1.1 stack() 和 unstack()

1.2 pivot_table()

2. 工夫序列

3. 分类

4. IO

5. 绘图

四、更多

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）