简介

在数据处理中,Pandas会将无奈解析的数据或者缺失的数据应用NaN来示意。尽管所有的数据都有了相应的示意,然而NaN很显著是无奈进行数学运算的。

本文将会解说Pandas对于NaN数据的解决办法。

NaN的例子

下面讲到了缺失的数据会被体现为NaN,咱们来看一个具体的例子:

咱们先来构建一个DF:

In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],   ...:                   columns=['one', 'two', 'three'])   ...: In [2]: df['four'] = 'bar'In [3]: df['five'] = df['one'] > 0In [4]: dfOut[4]:         one       two     three four   fivea  0.469112 -0.282863 -1.509059  bar   Truec -1.135632  1.212112 -0.173215  bar  Falsee  0.119209 -1.044236 -0.861849  bar   Truef -2.104569 -0.494929  1.071804  bar  Falseh  0.721555 -0.706771 -1.039575  bar   True

下面DF只有acefh这几个index,咱们从新index一下数据:

In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])In [6]: df2Out[6]:         one       two     three four   fivea  0.469112 -0.282863 -1.509059  bar   Trueb       NaN       NaN       NaN  NaN    NaNc -1.135632  1.212112 -0.173215  bar  Falsed       NaN       NaN       NaN  NaN    NaNe  0.119209 -1.044236 -0.861849  bar   Truef -2.104569 -0.494929  1.071804  bar  Falseg       NaN       NaN       NaN  NaN    NaNh  0.721555 -0.706771 -1.039575  bar   True

数据缺失,就会产生很多NaN。

为了检测是否NaN,能够应用isna()或者notna() 办法。

In [7]: df2['one']Out[7]: a    0.469112b         NaNc   -1.135632d         NaNe    0.119209f   -2.104569g         NaNh    0.721555Name: one, dtype: float64In [8]: pd.isna(df2['one'])Out[8]: a    Falseb     Truec    Falsed     Truee    Falsef    Falseg     Trueh    FalseName: one, dtype: boolIn [9]: df2['four'].notna()Out[9]: a     Trueb    Falsec     Trued    Falsee     Truef     Trueg    Falseh     TrueName: four, dtype: bool

留神在Python中None是相等的:

In [11]: None == None                                                 # noqa: E711Out[11]: True

然而np.nan是不等的:

In [12]: np.nan == np.nanOut[12]: False

整数类型的缺失值

NaN默认是float类型的,如果是整数类型,咱们能够强制进行转换:

In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())Out[14]: 0       11       22    <NA>3       4dtype: Int64

Datetimes 类型的缺失值

工夫类型的缺失值应用NaT来示意:

In [15]: df2 = df.copy()In [16]: df2['timestamp'] = pd.Timestamp('20120101')In [17]: df2Out[17]:         one       two     three four   five  timestampa  0.469112 -0.282863 -1.509059  bar   True 2012-01-01c -1.135632  1.212112 -0.173215  bar  False 2012-01-01e  0.119209 -1.044236 -0.861849  bar   True 2012-01-01f -2.104569 -0.494929  1.071804  bar  False 2012-01-01h  0.721555 -0.706771 -1.039575  bar   True 2012-01-01In [18]: df2.loc[['a', 'c', 'h'], ['one', 'timestamp']] = np.nanIn [19]: df2Out[19]:         one       two     three four   five  timestampa       NaN -0.282863 -1.509059  bar   True        NaTc       NaN  1.212112 -0.173215  bar  False        NaTe  0.119209 -1.044236 -0.861849  bar   True 2012-01-01f -2.104569 -0.494929  1.071804  bar  False 2012-01-01h       NaN -0.706771 -1.039575  bar   True        NaTIn [20]: df2.dtypes.value_counts()Out[20]: float64           3datetime64[ns]    1bool              1object            1dtype: int64

None 和 np.nan 的转换

对于数字类型的,如果赋值为None,那么会转换为相应的NaN类型:

In [21]: s = pd.Series([1, 2, 3])In [22]: s.loc[0] = NoneIn [23]: sOut[23]: 0    NaN1    2.02    3.0dtype: float64

如果是对象类型,应用None赋值,会放弃原样:

In [24]: s = pd.Series(["a", "b", "c"])In [25]: s.loc[0] = NoneIn [26]: s.loc[1] = np.nanIn [27]: sOut[27]: 0    None1     NaN2       cdtype: object

缺失值的计算

缺失值的数学计算还是缺失值:

In [28]: aOut[28]:         one       twoa       NaN -0.282863c       NaN  1.212112e  0.119209 -1.044236f -2.104569 -0.494929h -2.104569 -0.706771In [29]: bOut[29]:         one       two     threea       NaN -0.282863 -1.509059c       NaN  1.212112 -0.173215e  0.119209 -1.044236 -0.861849f -2.104569 -0.494929  1.071804h       NaN -0.706771 -1.039575In [30]: a + bOut[30]:         one  three       twoa       NaN    NaN -0.565727c       NaN    NaN  2.424224e  0.238417    NaN -2.088472f -4.209138    NaN -0.989859h       NaN    NaN -1.413542

然而在统计中会将NaN当成0来看待。

In [31]: dfOut[31]:         one       two     threea       NaN -0.282863 -1.509059c       NaN  1.212112 -0.173215e  0.119209 -1.044236 -0.861849f -2.104569 -0.494929  1.071804h       NaN -0.706771 -1.039575In [32]: df['one'].sum()Out[32]: -1.9853605075978744In [33]: df.mean(1)Out[33]: a   -0.895961c    0.519449e   -0.595625f   -0.509232h   -0.873173dtype: float64

如果是在cumsum或者cumprod中,默认是会跳过NaN,如果不想统计NaN,能够加上参数skipna=False

In [34]: df.cumsum()Out[34]:         one       two     threea       NaN -0.282863 -1.509059c       NaN  0.929249 -1.682273e  0.119209 -0.114987 -2.544122f -1.985361 -0.609917 -1.472318h       NaN -1.316688 -2.511893In [35]: df.cumsum(skipna=False)Out[35]:    one       two     threea  NaN -0.282863 -1.509059c  NaN  0.929249 -1.682273e  NaN -0.114987 -2.544122f  NaN -0.609917 -1.472318h  NaN -1.316688 -2.511893

应用fillna填充NaN数据

数据分析中,如果有NaN数据,那么须要对其进行解决,一种解决办法就是应用fillna来进行填充。

上面填充常量:

In [42]: df2Out[42]:         one       two     three four   five  timestampa       NaN -0.282863 -1.509059  bar   True        NaTc       NaN  1.212112 -0.173215  bar  False        NaTe  0.119209 -1.044236 -0.861849  bar   True 2012-01-01f -2.104569 -0.494929  1.071804  bar  False 2012-01-01h       NaN -0.706771 -1.039575  bar   True        NaTIn [43]: df2.fillna(0)Out[43]:         one       two     three four   five            timestampa  0.000000 -0.282863 -1.509059  bar   True                    0c  0.000000  1.212112 -0.173215  bar  False                    0e  0.119209 -1.044236 -0.861849  bar   True  2012-01-01 00:00:00f -2.104569 -0.494929  1.071804  bar  False  2012-01-01 00:00:00h  0.000000 -0.706771 -1.039575  bar   True                    0

还能够指定填充办法,比方pad:

In [45]: dfOut[45]:         one       two     threea       NaN -0.282863 -1.509059c       NaN  1.212112 -0.173215e  0.119209 -1.044236 -0.861849f -2.104569 -0.494929  1.071804h       NaN -0.706771 -1.039575In [46]: df.fillna(method='pad')Out[46]:         one       two     threea       NaN -0.282863 -1.509059c       NaN  1.212112 -0.173215e  0.119209 -1.044236 -0.861849f -2.104569 -0.494929  1.071804h -2.104569 -0.706771 -1.039575

能够指定填充的行数:

In [48]: df.fillna(method='pad', limit=1)

fill办法统计:

办法名形容
pad / ffill向前填充
bfill / backfill向后填充

能够应用PandasObject来填充:

In [53]: dffOut[53]:           A         B         C0  0.271860 -0.424972  0.5670201  0.276232 -1.087401 -0.6736902  0.113648 -1.478427  0.5249883       NaN  0.577046 -1.7150024       NaN       NaN -1.1578925 -1.344312       NaN       NaN6 -0.109050  1.643563       NaN7  0.357021 -0.674600       NaN8 -0.968914 -1.294524  0.4137389  0.276662 -0.472035 -0.013960In [54]: dff.fillna(dff.mean())Out[54]:           A         B         C0  0.271860 -0.424972  0.5670201  0.276232 -1.087401 -0.6736902  0.113648 -1.478427  0.5249883 -0.140857  0.577046 -1.7150024 -0.140857 -0.401419 -1.1578925 -1.344312 -0.401419 -0.2935436 -0.109050  1.643563 -0.2935437  0.357021 -0.674600 -0.2935438 -0.968914 -1.294524  0.4137389  0.276662 -0.472035 -0.013960In [55]: dff.fillna(dff.mean()['B':'C'])Out[55]:           A         B         C0  0.271860 -0.424972  0.5670201  0.276232 -1.087401 -0.6736902  0.113648 -1.478427  0.5249883       NaN  0.577046 -1.7150024       NaN -0.401419 -1.1578925 -1.344312 -0.401419 -0.2935436 -0.109050  1.643563 -0.2935437  0.357021 -0.674600 -0.2935438 -0.968914 -1.294524  0.4137389  0.276662 -0.472035 -0.013960

下面操作等同于:

In [56]: dff.where(pd.notna(dff), dff.mean(), axis='columns')

应用dropna删除蕴含NA的数据

除了fillna来填充数据之外,还能够应用dropna删除蕴含na的数据。

In [57]: dfOut[57]:    one       two     threea  NaN -0.282863 -1.509059c  NaN  1.212112 -0.173215e  NaN  0.000000  0.000000f  NaN  0.000000  0.000000h  NaN -0.706771 -1.039575In [58]: df.dropna(axis=0)Out[58]: Empty DataFrameColumns: [one, two, three]Index: []In [59]: df.dropna(axis=1)Out[59]:         two     threea -0.282863 -1.509059c  1.212112 -0.173215e  0.000000  0.000000f  0.000000  0.000000h -0.706771 -1.039575In [60]: df['one'].dropna()Out[60]: Series([], Name: one, dtype: float64)

插值interpolation

数据分析时候,为了数据的安稳,咱们须要一些插值运算interpolate() ,应用起来很简略:

In [61]: tsOut[61]: 2000-01-31    0.4691122000-02-29         NaN2000-03-31         NaN2000-04-28         NaN2000-05-31         NaN                ...   2007-12-31   -6.9502672008-01-31   -7.9044752008-02-29   -6.4417792008-03-31   -8.1849402008-04-30   -9.011531Freq: BM, Length: 100, dtype: float64
In [64]: ts.interpolate()Out[64]: 2000-01-31    0.4691122000-02-29    0.4344692000-03-31    0.3998262000-04-28    0.3651842000-05-31    0.330541                ...   2007-12-31   -6.9502672008-01-31   -7.9044752008-02-29   -6.4417792008-03-31   -8.1849402008-04-30   -9.011531Freq: BM, Length: 100, dtype: float64

插值函数还能够增加参数,指定插值的办法,比方按工夫插值:

In [67]: ts2Out[67]: 2000-01-31    0.4691122000-02-29         NaN2002-07-31   -5.7850372005-01-31         NaN2008-04-30   -9.011531dtype: float64In [68]: ts2.interpolate()Out[68]: 2000-01-31    0.4691122000-02-29   -2.6579622002-07-31   -5.7850372005-01-31   -7.3982842008-04-30   -9.011531dtype: float64In [69]: ts2.interpolate(method='time')Out[69]: 2000-01-31    0.4691122000-02-29    0.2702412002-07-31   -5.7850372005-01-31   -7.1908662008-04-30   -9.011531dtype: float64

按index的float value进行插值:

In [70]: serOut[70]: 0.0      0.01.0      NaN10.0    10.0dtype: float64In [71]: ser.interpolate()Out[71]: 0.0      0.01.0      5.010.0    10.0dtype: float64In [72]: ser.interpolate(method='values')Out[72]: 0.0      0.01.0      1.010.0    10.0dtype: float64

除了插值Series,还能够插值DF:

In [73]: df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],   ....:                    'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})   ....: In [74]: dfOut[74]:      A      B0  1.0   0.251  2.1    NaN2  NaN    NaN3  4.7   4.004  5.6  12.205  6.8  14.40In [75]: df.interpolate()Out[75]:      A      B0  1.0   0.251  2.1   1.502  3.4   2.753  4.7   4.004  5.6  12.205  6.8  14.40

interpolate还接管limit参数,能够指定插值的个数。

In [95]: ser.interpolate(limit=1)Out[95]: 0     NaN1     NaN2     5.03     7.04     NaN5     NaN6    13.07    13.08     NaNdtype: float64

应用replace替换值

replace能够替换常量,也能够替换list:

In [102]: ser = pd.Series([0., 1., 2., 3., 4.])In [103]: ser.replace(0, 5)Out[103]: 0    5.01    1.02    2.03    3.04    4.0dtype: float64
In [104]: ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])Out[104]: 0    4.01    3.02    2.03    1.04    0.0dtype: float64

能够替换DF中特定的数值:

In [106]: df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})In [107]: df.replace({'a': 0, 'b': 5}, 100)Out[107]:      a    b0  100  1001    1    62    2    73    3    84    4    9

能够应用插值替换:

In [108]: ser.replace([1, 2, 3], method='pad')Out[108]: 0    0.01    0.02    0.03    0.04    4.0dtype: float64

本文已收录于 http://www.flydean.com/07-python-pandas-missingdata/

最艰深的解读,最粗浅的干货,最简洁的教程,泛滥你不晓得的小技巧等你来发现!

欢送关注我的公众号:「程序那些事」,懂技术,更懂你!