简介
数据分析中常常会用到很多统计类的办法,本文将会介绍Pandas中应用到的统计办法。
变动百分百
Series和DF都有一个pct_change()
办法用来计算数据变动的百分比。这个办法在填充NaN值的时候特地有用。
ser = pd.Series(np.random.randn(8))ser.pct_change()Out[45]: 0 NaN1 -1.2647162 4.1250063 -1.1590924 -0.0912925 4.8377526 -1.1821467 -8.721482dtype: float64serOut[46]: 0 -0.9505151 0.2516172 1.2895373 -0.2051554 -0.1864265 -1.0883106 0.1982317 -1.530635dtype: float64
pct_change还有个periods参数,能够指定计算百分比的periods,也就是隔多少个元素来计算:
In [3]: df = pd.DataFrame(np.random.randn(10, 4))In [4]: df.pct_change(periods=3)Out[4]: 0 1 2 30 NaN NaN NaN NaN1 NaN NaN NaN NaN2 NaN NaN NaN NaN3 -0.218320 -1.054001 1.987147 -0.5101834 -0.439121 -1.816454 0.649715 -4.8228095 -0.127833 -3.042065 -5.866604 -1.7769776 -2.596833 -1.959538 -2.111697 -3.7989007 -0.117826 -2.169058 0.036094 -0.0676968 2.492606 -1.357320 -1.205802 -1.5586979 -1.012977 2.324558 -1.003744 -0.371806
Covariance协方差
Series.cov() 用来计算两个Series的协方差,会疏忽掉NaN的数据。
In [5]: s1 = pd.Series(np.random.randn(1000))In [6]: s2 = pd.Series(np.random.randn(1000))In [7]: s1.cov(s2)Out[7]: 0.0006801088174310875
同样的,DataFrame.cov() 会计算对应Series的协方差,也会疏忽NaN的数据。
In [8]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])In [9]: frame.cov()Out[9]: a b c d ea 1.000882 -0.003177 -0.002698 -0.006889 0.031912b -0.003177 1.024721 0.000191 0.009212 0.000857c -0.002698 0.000191 0.950735 -0.031743 -0.005087d -0.006889 0.009212 -0.031743 1.002983 -0.047952e 0.031912 0.000857 -0.005087 -0.047952 1.042487
DataFrame.cov 带有一个min_periods参数,能够指定计算协方差的最小元素个数,以保障不会呈现极值数据的状况。
In [10]: frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])In [11]: frame.loc[frame.index[:5], "a"] = np.nanIn [12]: frame.loc[frame.index[5:10], "b"] = np.nanIn [13]: frame.cov()Out[13]: a b ca 1.123670 -0.412851 0.018169b -0.412851 1.154141 0.305260c 0.018169 0.305260 1.301149In [14]: frame.cov(min_periods=12)Out[14]: a b ca 1.123670 NaN 0.018169b NaN 1.154141 0.305260c 0.018169 0.305260 1.301149
Correlation相关系数
corr() 办法能够用来计算相关系数。有三种相关系数的计算方法:
办法名 | 形容 |
---|---|
pearson (default) | 规范相关系数 |
kendall | Kendall Tau相关系数 |
spearman | 斯皮尔曼等级相关系数 |
n [15]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])In [16]: frame.iloc[::2] = np.nan# Series with SeriesIn [17]: frame["a"].corr(frame["b"])Out[17]: 0.013479040400098775In [18]: frame["a"].corr(frame["b"], method="spearman")Out[18]: -0.007289885159540637# Pairwise correlation of DataFrame columnsIn [19]: frame.corr()Out[19]: a b c d ea 1.000000 0.013479 -0.049269 -0.042239 -0.028525b 0.013479 1.000000 -0.020433 -0.011139 0.005654c -0.049269 -0.020433 1.000000 0.018587 -0.054269d -0.042239 -0.011139 0.018587 1.000000 -0.017060e -0.028525 0.005654 -0.054269 -0.017060 1.000000
corr同样也反对 min_periods :
In [20]: frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])In [21]: frame.loc[frame.index[:5], "a"] = np.nanIn [22]: frame.loc[frame.index[5:10], "b"] = np.nanIn [23]: frame.corr()Out[23]: a b ca 1.000000 -0.121111 0.069544b -0.121111 1.000000 0.051742c 0.069544 0.051742 1.000000In [24]: frame.corr(min_periods=12)Out[24]: a b ca 1.000000 NaN 0.069544b NaN 1.000000 0.051742c 0.069544 0.051742 1.000000
corrwith 能够计算不同DF间的相关系数。
In [27]: index = ["a", "b", "c", "d", "e"]In [28]: columns = ["one", "two", "three", "four"]In [29]: df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns)In [30]: df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns)In [31]: df1.corrwith(df2)Out[31]: one -0.125501two -0.493244three 0.344056four 0.004183dtype: float64In [32]: df2.corrwith(df1, axis=1)Out[32]: a -0.675817b 0.458296c 0.190809d -0.186275e NaNdtype: float64
rank等级
rank办法能够对Series中的数据进行排列等级。什么叫等级呢? 咱们举个例子:
s = pd.Series(np.random.randn(5), index=list("abcde"))sOut[51]: a 0.336259b 1.073116c -0.402291d 0.624186e -0.422478dtype: float64s["d"] = s["b"] # so there's a tiesOut[53]: a 0.336259b 1.073116c -0.402291d 1.073116e -0.422478dtype: float64s.rank()Out[54]: a 3.0b 4.5c 2.0d 4.5e 1.0dtype: float64
下面咱们创立了一个Series,外面的数据从小到大排序 :
-0.422478 < -0.402291 < 0.336259 < 1.073116 < 1.073116
所以相应的rank就是 1 , 2 ,3 ,4 , 5.
因为咱们有两个值是雷同的,默认状况下会取两者的平均值,也就是 4.5.
除了 default_rank , 还能够指定max_rank ,这样每个值都是最大的5 。
还能够指定 NA_bottom , 示意对于NaN的数据也用来计算rank,并且会放在最底部,也就是最大值。
还能够指定 pct_rank , rank值是一个百分比值。
df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog',... 'spider', 'snake'],... 'Number_legs': [4, 2, 4, 8, np.nan]})>>> df Animal Number_legs0 cat 4.01 penguin 2.02 dog 4.03 spider 8.04 snake NaN
df['default_rank'] = df['Number_legs'].rank()>>> df['max_rank'] = df['Number_legs'].rank(method='max')>>> df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')>>> df['pct_rank'] = df['Number_legs'].rank(pct=True)>>> df Animal Number_legs default_rank max_rank NA_bottom pct_rank0 cat 4.0 2.5 3.0 2.5 0.6251 penguin 2.0 1.0 1.0 1.0 0.2502 dog 4.0 2.5 3.0 2.5 0.6253 spider 8.0 4.0 4.0 4.0 1.0004 snake NaN NaN NaN 5.0 NaN
rank还能够指定按行 (axis=0
) 或者 按列 (axis=1
)来计算。
In [36]: df = pd.DataFrame(np.random.randn(10, 6))In [37]: df[4] = df[2][:5] # some tiesIn [38]: dfOut[38]: 0 1 2 3 4 50 -0.904948 -1.163537 -1.457187 0.135463 -1.457187 0.2946501 -0.976288 -0.244652 -0.748406 -0.999601 -0.748406 -0.8008092 0.401965 1.460840 1.256057 1.308127 1.256057 0.8760043 0.205954 0.369552 -0.669304 0.038378 -0.669304 1.1402964 -0.477586 -0.730705 -1.129149 -0.601463 -1.129149 -0.2111965 -1.092970 -0.689246 0.908114 0.204848 NaN 0.4633476 0.376892 0.959292 0.095572 -0.593740 NaN -0.0691807 -1.002601 1.957794 -0.120708 0.094214 NaN -1.4674228 -0.547231 0.664402 -0.519424 -0.073254 NaN -1.2635449 -0.250277 -0.237428 -1.056443 0.419477 NaN 1.375064In [39]: df.rank(1)Out[39]: 0 1 2 3 4 50 4.0 3.0 1.5 5.0 1.5 6.01 2.0 6.0 4.5 1.0 4.5 3.02 1.0 6.0 3.5 5.0 3.5 2.03 4.0 5.0 1.5 3.0 1.5 6.04 5.0 3.0 1.5 4.0 1.5 6.05 1.0 2.0 5.0 3.0 NaN 4.06 4.0 5.0 3.0 1.0 NaN 2.07 2.0 5.0 3.0 4.0 NaN 1.08 2.0 5.0 3.0 4.0 NaN 1.09 2.0 3.0 1.0 4.0 NaN 5.0
本文已收录于 http://www.flydean.com/10-python-pandas-statistical/
最艰深的解读,最粗浅的干货,最简洁的教程,泛滥你不晓得的小技巧等你来发现!