简介

数据分析中常常会用到很多统计类的办法,本文将会介绍Pandas中应用到的统计办法。

变动百分百

Series和DF都有一个pct_change() 办法用来计算数据变动的百分比。这个办法在填充NaN值的时候特地有用。

ser = pd.Series(np.random.randn(8))ser.pct_change()Out[45]: 0         NaN1   -1.2647162    4.1250063   -1.1590924   -0.0912925    4.8377526   -1.1821467   -8.721482dtype: float64serOut[46]: 0   -0.9505151    0.2516172    1.2895373   -0.2051554   -0.1864265   -1.0883106    0.1982317   -1.530635dtype: float64

pct_change还有个periods参数,能够指定计算百分比的periods,也就是隔多少个元素来计算:

In [3]: df = pd.DataFrame(np.random.randn(10, 4))In [4]: df.pct_change(periods=3)Out[4]:           0         1         2         30       NaN       NaN       NaN       NaN1       NaN       NaN       NaN       NaN2       NaN       NaN       NaN       NaN3 -0.218320 -1.054001  1.987147 -0.5101834 -0.439121 -1.816454  0.649715 -4.8228095 -0.127833 -3.042065 -5.866604 -1.7769776 -2.596833 -1.959538 -2.111697 -3.7989007 -0.117826 -2.169058  0.036094 -0.0676968  2.492606 -1.357320 -1.205802 -1.5586979 -1.012977  2.324558 -1.003744 -0.371806

Covariance协方差

Series.cov() 用来计算两个Series的协方差,会疏忽掉NaN的数据。

In [5]: s1 = pd.Series(np.random.randn(1000))In [6]: s2 = pd.Series(np.random.randn(1000))In [7]: s1.cov(s2)Out[7]: 0.0006801088174310875

同样的,DataFrame.cov() 会计算对应Series的协方差,也会疏忽NaN的数据。

In [8]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])In [9]: frame.cov()Out[9]:           a         b         c         d         ea  1.000882 -0.003177 -0.002698 -0.006889  0.031912b -0.003177  1.024721  0.000191  0.009212  0.000857c -0.002698  0.000191  0.950735 -0.031743 -0.005087d -0.006889  0.009212 -0.031743  1.002983 -0.047952e  0.031912  0.000857 -0.005087 -0.047952  1.042487

DataFrame.cov 带有一个min_periods参数,能够指定计算协方差的最小元素个数,以保障不会呈现极值数据的状况。

In [10]: frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])In [11]: frame.loc[frame.index[:5], "a"] = np.nanIn [12]: frame.loc[frame.index[5:10], "b"] = np.nanIn [13]: frame.cov()Out[13]:           a         b         ca  1.123670 -0.412851  0.018169b -0.412851  1.154141  0.305260c  0.018169  0.305260  1.301149In [14]: frame.cov(min_periods=12)Out[14]:           a         b         ca  1.123670       NaN  0.018169b       NaN  1.154141  0.305260c  0.018169  0.305260  1.301149

Correlation相关系数

corr() 办法能够用来计算相关系数。有三种相关系数的计算方法:

办法名形容
pearson (default)规范相关系数
kendallKendall Tau相关系数
spearman斯皮尔曼等级相关系数
n [15]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])In [16]: frame.iloc[::2] = np.nan# Series with SeriesIn [17]: frame["a"].corr(frame["b"])Out[17]: 0.013479040400098775In [18]: frame["a"].corr(frame["b"], method="spearman")Out[18]: -0.007289885159540637# Pairwise correlation of DataFrame columnsIn [19]: frame.corr()Out[19]:           a         b         c         d         ea  1.000000  0.013479 -0.049269 -0.042239 -0.028525b  0.013479  1.000000 -0.020433 -0.011139  0.005654c -0.049269 -0.020433  1.000000  0.018587 -0.054269d -0.042239 -0.011139  0.018587  1.000000 -0.017060e -0.028525  0.005654 -0.054269 -0.017060  1.000000

corr同样也反对 min_periods :

In [20]: frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])In [21]: frame.loc[frame.index[:5], "a"] = np.nanIn [22]: frame.loc[frame.index[5:10], "b"] = np.nanIn [23]: frame.corr()Out[23]:           a         b         ca  1.000000 -0.121111  0.069544b -0.121111  1.000000  0.051742c  0.069544  0.051742  1.000000In [24]: frame.corr(min_periods=12)Out[24]:           a         b         ca  1.000000       NaN  0.069544b       NaN  1.000000  0.051742c  0.069544  0.051742  1.000000

corrwith 能够计算不同DF间的相关系数。

In [27]: index = ["a", "b", "c", "d", "e"]In [28]: columns = ["one", "two", "three", "four"]In [29]: df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns)In [30]: df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns)In [31]: df1.corrwith(df2)Out[31]: one     -0.125501two     -0.493244three    0.344056four     0.004183dtype: float64In [32]: df2.corrwith(df1, axis=1)Out[32]: a   -0.675817b    0.458296c    0.190809d   -0.186275e         NaNdtype: float64

rank等级

rank办法能够对Series中的数据进行排列等级。什么叫等级呢? 咱们举个例子:

s = pd.Series(np.random.randn(5), index=list("abcde"))sOut[51]: a    0.336259b    1.073116c   -0.402291d    0.624186e   -0.422478dtype: float64s["d"] = s["b"]  # so there's a tiesOut[53]: a    0.336259b    1.073116c   -0.402291d    1.073116e   -0.422478dtype: float64s.rank()Out[54]: a    3.0b    4.5c    2.0d    4.5e    1.0dtype: float64

下面咱们创立了一个Series,外面的数据从小到大排序 :

-0.422478 < -0.402291 <  0.336259 <  1.073116 < 1.073116

所以相应的rank就是 1 , 2 ,3 ,4 , 5.

因为咱们有两个值是雷同的,默认状况下会取两者的平均值,也就是 4.5.

除了 default_rank , 还能够指定max_rank ,这样每个值都是最大的5 。

还能够指定 NA_bottom , 示意对于NaN的数据也用来计算rank,并且会放在最底部,也就是最大值。

还能够指定 pct_rank , rank值是一个百分比值。

df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog',...                                    'spider', 'snake'],...                         'Number_legs': [4, 2, 4, 8, np.nan]})>>> df    Animal  Number_legs0      cat          4.01  penguin          2.02      dog          4.03   spider          8.04    snake          NaN
df['default_rank'] = df['Number_legs'].rank()>>> df['max_rank'] = df['Number_legs'].rank(method='max')>>> df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')>>> df['pct_rank'] = df['Number_legs'].rank(pct=True)>>> df    Animal  Number_legs  default_rank  max_rank  NA_bottom  pct_rank0      cat          4.0           2.5       3.0        2.5     0.6251  penguin          2.0           1.0       1.0        1.0     0.2502      dog          4.0           2.5       3.0        2.5     0.6253   spider          8.0           4.0       4.0        4.0     1.0004    snake          NaN           NaN       NaN        5.0       NaN

rank还能够指定按行 (axis=0) 或者 按列 (axis=1)来计算。

In [36]: df = pd.DataFrame(np.random.randn(10, 6))In [37]: df[4] = df[2][:5]  # some tiesIn [38]: dfOut[38]:           0         1         2         3         4         50 -0.904948 -1.163537 -1.457187  0.135463 -1.457187  0.2946501 -0.976288 -0.244652 -0.748406 -0.999601 -0.748406 -0.8008092  0.401965  1.460840  1.256057  1.308127  1.256057  0.8760043  0.205954  0.369552 -0.669304  0.038378 -0.669304  1.1402964 -0.477586 -0.730705 -1.129149 -0.601463 -1.129149 -0.2111965 -1.092970 -0.689246  0.908114  0.204848       NaN  0.4633476  0.376892  0.959292  0.095572 -0.593740       NaN -0.0691807 -1.002601  1.957794 -0.120708  0.094214       NaN -1.4674228 -0.547231  0.664402 -0.519424 -0.073254       NaN -1.2635449 -0.250277 -0.237428 -1.056443  0.419477       NaN  1.375064In [39]: df.rank(1)Out[39]:      0    1    2    3    4    50  4.0  3.0  1.5  5.0  1.5  6.01  2.0  6.0  4.5  1.0  4.5  3.02  1.0  6.0  3.5  5.0  3.5  2.03  4.0  5.0  1.5  3.0  1.5  6.04  5.0  3.0  1.5  4.0  1.5  6.05  1.0  2.0  5.0  3.0  NaN  4.06  4.0  5.0  3.0  1.0  NaN  2.07  2.0  5.0  3.0  4.0  NaN  1.08  2.0  5.0  3.0  4.0  NaN  1.09  2.0  3.0  1.0  4.0  NaN  5.0

本文已收录于 http://www.flydean.com/10-python-pandas-statistical/

最艰深的解读,最粗浅的干货,最简洁的教程,泛滥你不晓得的小技巧等你来发现!