关于pandas:Pandas高级教程之时间处理

工夫应该是在数据处理中常常会用到的一种数据类型，除了 Numpy 中 datetime64 和 timedelta64 这两种数据类型之外，pandas 还整合了其余 python 库比方 scikits.timeseries 中的性能。

pandas 中有四种工夫类型：

Date times : 日期和工夫，能够带时区。和规范库中的 datetime.datetime 相似。
Time deltas：相对持续时间，和规范库中的 datetime.timedelta 相似。
Time spans：由工夫点及其关联的频率定义的时间跨度。
Date offsets：基于日历计算的工夫和 dateutil.relativedelta.relativedelta 相似。

咱们用一张表来示意：

类型	标量 class	数组 class	pandas 数据类型	次要创立办法
Date times	`Timestamp`	`DatetimeIndex`	`datetime64[ns]` or `datetime64[ns, tz]`	`to_datetime` or `date_range`
Time deltas	`Timedelta`	`TimedeltaIndex`	`timedelta64[ns]`	`to_timedelta` or `timedelta_range`
Time spans	`Period`	`PeriodIndex`	`period[freq]`	`Period` or `period_range`
Date offsets	`DateOffset`	`None`	`None`	`DateOffset`

看一个应用的例子：

In [19]: pd.Series(range(3), index=pd.date_range("2000", freq="D", periods=3))
Out[19]: 
2000-01-01    0
2000-01-02    1
2000-01-03    2
Freq: D, dtype: int64

看一下下面数据类型的空值：

In [24]: pd.Timestamp(pd.NaT)
Out[24]: NaT

In [25]: pd.Timedelta(pd.NaT)
Out[25]: NaT

In [26]: pd.Period(pd.NaT)
Out[26]: NaT

# Equality acts as np.nan would
In [27]: pd.NaT == pd.NaT
Out[27]: False

Timestamp 是最根底的工夫类型，咱们能够这样创立：

In [28]: pd.Timestamp(datetime.datetime(2012, 5, 1))
Out[28]: Timestamp('2012-05-01 00:00:00')

In [29]: pd.Timestamp("2012-05-01")
Out[29]: Timestamp('2012-05-01 00:00:00')

In [30]: pd.Timestamp(2012, 5, 1)
Out[30]: Timestamp('2012-05-01 00:00:00')

Timestamp 作为 index 会主动被转换为 DatetimeIndex：

In [33]: dates = [....:     pd.Timestamp("2012-05-01"),
   ....:     pd.Timestamp("2012-05-02"),
   ....:     pd.Timestamp("2012-05-03"),
   ....: ]
   ....: 

In [34]: ts = pd.Series(np.random.randn(3), dates)

In [35]: type(ts.index)
Out[35]: pandas.core.indexes.datetimes.DatetimeIndex

In [36]: ts.index
Out[36]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)

In [37]: ts
Out[37]: 
2012-05-01    0.469112
2012-05-02   -0.282863
2012-05-03   -1.509059
dtype: float64

还能够应用 date_range 来创立 DatetimeIndex：

In [74]: start = datetime.datetime(2011, 1, 1)

In [75]: end = datetime.datetime(2012, 1, 1)

In [76]: index = pd.date_range(start, end)

In [77]: index
Out[77]: 
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
               '2011-01-09', '2011-01-10',
               ...
               '2011-12-23', '2011-12-24', '2011-12-25', '2011-12-26',
               '2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30',
               '2011-12-31', '2012-01-01'],
              dtype='datetime64[ns]', length=366, freq='D')

date_range 是日历范畴，bdate_range 是工作日范畴：

In [78]: index = pd.bdate_range(start, end)

In [79]: index
Out[79]: 
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
               '2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',
               '2011-01-13', '2011-01-14',
               ...
               '2011-12-19', '2011-12-20', '2011-12-21', '2011-12-22',
               '2011-12-23', '2011-12-26', '2011-12-27', '2011-12-28',
               '2011-12-29', '2011-12-30'],
              dtype='datetime64[ns]', length=260, freq='B')

两个办法都能够带上 start, end, 和 periods 参数。

In [84]: pd.bdate_range(end=end, periods=20)
In [83]: pd.date_range(start, end, freq="W")
In [86]: pd.date_range("2018-01-01", "2018-01-05", periods=5)

`origin`

应用 origin 参数，能够批改 DatetimeIndex 的终点：

In [67]: pd.to_datetime([1, 2, 3], unit="D", origin=pd.Timestamp("1960-01-01"))
Out[67]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)

默认状况下 origin='unix', 也就是终点是 1970-01-01 00:00:00.

In [68]: pd.to_datetime([1, 2, 3], unit="D")
Out[68]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)

应用 format 参数能够对工夫进行格式化：

In [51]: pd.to_datetime("2010/11/12", format="%Y/%m/%d")
Out[51]: Timestamp('2010-11-12 00:00:00')

In [52]: pd.to_datetime("12-11-2010 00:00", format="%d-%m-%Y %H:%M")
Out[52]: Timestamp('2010-11-12 00:00:00')

Period 示意的是一个时间跨度, 通常和 freq 一起应用：

In [31]: pd.Period("2011-01")
Out[31]: Period('2011-01', 'M')

In [32]: pd.Period("2012-05", freq="D")
Out[32]: Period('2012-05-01', 'D')

Period 能够间接进行运算：

In [345]: p = pd.Period("2012", freq="A-DEC")

In [346]: p + 1
Out[346]: Period('2013', 'A-DEC')

In [347]: p - 3
Out[347]: Period('2009', 'A-DEC')

In [348]: p = pd.Period("2012-01", freq="2M")

In [349]: p + 2
Out[349]: Period('2012-05', '2M')

In [350]: p - 1
Out[350]: Period('2011-11', '2M')

留神，Period 只有具备雷同的 freq 能力进行算数运算。包含 offsets 和 timedelta

In [352]: p = pd.Period("2014-07-01 09:00", freq="H")

In [353]: p + pd.offsets.Hour(2)
Out[353]: Period('2014-07-01 11:00', 'H')

In [354]: p + datetime.timedelta(minutes=120)
Out[354]: Period('2014-07-01 11:00', 'H')

In [355]: p + np.timedelta64(7200, "s")
Out[355]: Period('2014-07-01 11:00', 'H')

Period 作为 index 能够主动被转换为 PeriodIndex：

In [38]: periods = [pd.Period("2012-01"), pd.Period("2012-02"), pd.Period("2012-03")]

In [39]: ts = pd.Series(np.random.randn(3), periods)

In [40]: type(ts.index)
Out[40]: pandas.core.indexes.period.PeriodIndex

In [41]: ts.index
Out[41]: PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='period[M]', freq='M')

In [42]: ts
Out[42]: 
2012-01   -1.135632
2012-02    1.212112
2012-03   -0.173215
Freq: M, dtype: float64

能够通过 pd.period_range 办法来创立 PeriodIndex：

In [359]: prng = pd.period_range("1/1/2011", "1/1/2012", freq="M")

In [360]: prng
Out[360]: 
PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06',
             '2011-07', '2011-08', '2011-09', '2011-10', '2011-11', '2011-12',
             '2012-01'],
            dtype='period[M]', freq='M')

还能够通过 PeriodIndex 间接创立：

In [361]: pd.PeriodIndex(["2011-1", "2011-2", "2011-3"], freq="M")
Out[361]: PeriodIndex(['2011-01', '2011-02', '2011-03'], dtype='period[M]', freq='M')

DateOffset 示意的是频率对象。它和 Timedelta 很相似，示意的是一个持续时间，然而有非凡的日历规定。比方 Timedelta 一天必定是 24 小时，而在 DateOffset 中依据夏令时的不同，一天可能会有 23，24 或者 25 小时。

# This particular day contains a day light savings time transition
In [144]: ts = pd.Timestamp("2016-10-30 00:00:00", tz="Europe/Helsinki")

# Respects absolute time
In [145]: ts + pd.Timedelta(days=1)
Out[145]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')

# Respects calendar time
In [146]: ts + pd.DateOffset(days=1)
Out[146]: Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki')

In [147]: friday = pd.Timestamp("2018-01-05")

In [148]: friday.day_name()
Out[148]: 'Friday'

# Add 2 business days (Friday --> Tuesday)
In [149]: two_business_days = 2 * pd.offsets.BDay()

In [150]: two_business_days.apply(friday)
Out[150]: Timestamp('2018-01-09 00:00:00')

In [151]: friday + two_business_days
Out[151]: Timestamp('2018-01-09 00:00:00')

In [152]: (friday + two_business_days).day_name()
Out[152]: 'Tuesday'

DateOffsets 和 Frequency 运算是先关的，看一下可用的 Date Offset 和它相关联的 Frequency：

Date Offset	Frequency String	形容
`DateOffset`	None	通用的 offset 类
`BDay` or `BusinessDay`	`'B'`	工作日
`CDay` or `CustomBusinessDay`	`'C'`	自定义的工作日
`Week`	`'W'`	一周
`WeekOfMonth`	`'WOM'`	每个月的第几周的第几天
`LastWeekOfMonth`	`'LWOM'`	每个月最初一周的第几天
`MonthEnd`	`'M'`	日历月末
MonthBegin	`'MS'`	日历月初
`BMonthEnd` or `BusinessMonthEnd`	`'BM'`	营业月底
`BMonthBegin` or `BusinessMonthBegin`	`'BMS'`	营业月初
`CBMonthEnd` or `CustomBusinessMonthEnd`	`'CBM'`	自定义营业月底
`CBMonthBegin` or `CustomBusinessMonthBegin`	`'CBMS'`	自定义营业月初
`SemiMonthEnd`	`'SM'`	日历月末的第 15 天
`SemiMonthBegin`	`'SMS'`	日历月初的第 15 天
`QuarterEnd`	`'Q'`	日历季末
`QuarterBegin`	`'QS'`	日历季初
`BQuarterEnd`	`'BQ`	工作季末
`BQuarterBegin`	`'BQS'`	工作季初
`FY5253Quarter`	`'REQ'`	批发季（52-53 week)
`YearEnd`	`'A'`	日历年末
`YearBegin`	`'AS'` or `'BYS'`	日历年初
`BYearEnd`	`'BA'`	营业年末
`BYearBegin`	`'BAS'`	营业年初
`FY5253`	`'RE'`	批发年 (aka 52-53 week)
`Easter`	None	复活节假期
`BusinessHour`	`'BH'`	business hour
`CustomBusinessHour`	`'CBH'`	custom business hour
`Day`	`'D'`	一天的相对工夫
`Hour`	`'H'`	一小时
`Minute`	`'T'` or `'min'`	一分钟
`Second`	`'S'`	一秒钟
`Milli`	`'L'` or `'ms'`	一奥妙
`Micro`	`'U'` or `'us'`	一毫秒
`Nano`	`'N'`	一纳秒

DateOffset 还有两个办法 rollforward() 和 rollback() 能够将工夫进行挪动：

In [153]: ts = pd.Timestamp("2018-01-06 00:00:00")

In [154]: ts.day_name()
Out[154]: 'Saturday'

# BusinessHour's valid offset dates are Monday through Friday
In [155]: offset = pd.offsets.BusinessHour(start="09:00")

# Bring the date to the closest offset date (Monday)
In [156]: offset.rollforward(ts)
Out[156]: Timestamp('2018-01-08 09:00:00')

# Date is brought to the closest offset date first and then the hour is added
In [157]: ts + offset
Out[157]: Timestamp('2018-01-08 10:00:00')

下面的操作会主动保留小时，分钟等信息，如果想要设置为 00:00:00，能够调用 normalize() 办法：

In [158]: ts = pd.Timestamp("2014-01-01 09:00")

In [159]: day = pd.offsets.Day()

In [160]: day.apply(ts)
Out[160]: Timestamp('2014-01-02 09:00:00')

In [161]: day.apply(ts).normalize()
Out[161]: Timestamp('2014-01-02 00:00:00')

In [162]: ts = pd.Timestamp("2014-01-01 22:00")

In [163]: hour = pd.offsets.Hour()

In [164]: hour.apply(ts)
Out[164]: Timestamp('2014-01-01 23:00:00')

In [165]: hour.apply(ts).normalize()
Out[165]: Timestamp('2014-01-01 00:00:00')

In [166]: hour.apply(pd.Timestamp("2014-01-01 23:30")).normalize()
Out[166]: Timestamp('2014-01-02 00:00:00')

工夫能够作为 index，并且作为 index 的时候会有一些很不便的个性。

能够间接应用工夫来获取相应的数据：

In [99]: ts["1/31/2011"]
Out[99]: 0.11920871129693428

In [100]: ts[datetime.datetime(2011, 12, 25):]
Out[100]: 
2011-12-30    0.56702
Freq: BM, dtype: float64

In [101]: ts["10/31/2011":"12/31/2011"]
Out[101]: 
2011-10-31    0.271860
2011-11-30   -0.424972
2011-12-30    0.567020
Freq: BM, dtype: float64

获取全年的数据：

In [102]: ts["2011"]
Out[102]: 
2011-01-31    0.119209
2011-02-28   -1.044236
2011-03-31   -0.861849
2011-04-29   -2.104569
2011-05-31   -0.494929
2011-06-30    1.071804
2011-07-29    0.721555
2011-08-31   -0.706771
2011-09-30   -1.039575
2011-10-31    0.271860
2011-11-30   -0.424972
2011-12-30    0.567020
Freq: BM, dtype: float64

获取某个月的数据：

In [103]: ts["2011-6"]
Out[103]: 
2011-06-30    1.071804
Freq: BM, dtype: float64

DF 能够承受工夫作为 loc 的参数：

In [105]: dft
Out[105]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043

[100000 rows x 1 columns]

In [106]: dft.loc["2013"]
Out[106]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043

[100000 rows x 1 columns]

工夫切片：

In [107]: dft["2013-1":"2013-2"]
Out[107]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-02-28 23:55:00  0.850929
2013-02-28 23:56:00  0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517

[84960 rows x 1 columns]

思考上面的一个精度为分的 Series 对象：

In [120]: series_minute = pd.Series(.....:     [1, 2, 3],
   .....:     pd.DatetimeIndex(.....:         ["2011-12-31 23:59:00", "2012-01-01 00:00:00", "2012-01-01 00:02:00"]
   .....:     ),
   .....: )
   .....: 

In [121]: series_minute.index.resolution
Out[121]: 'minute'

工夫精度小于分的话，返回的是一个 Series 对象：

In [122]: series_minute["2011-12-31 23"]
Out[122]: 
2011-12-31 23:59:00    1
dtype: int64

工夫精度大于分的话，返回的是一个常量：

In [123]: series_minute["2011-12-31 23:59"]
Out[123]: 1

In [124]: series_minute["2011-12-31 23:59:00"]
Out[124]: 1

同样的，如果精度为秒的话，小于秒会返回一个对象，等于秒会返回常量值。

应用 shift 办法能够让 time series 进行相应的挪动：

In [275]: ts = pd.Series(range(len(rng)), index=rng)

In [276]: ts = ts[:5]

In [277]: ts.shift(1)
Out[277]: 
2012-01-01    NaN
2012-01-02    0.0
2012-01-03    1.0
Freq: D, dtype: float64

通过指定 freq，能够设置 shift 的形式：

In [278]: ts.shift(5, freq="D")
Out[278]: 
2012-01-06    0
2012-01-07    1
2012-01-08    2
Freq: D, dtype: int64

In [279]: ts.shift(5, freq=pd.offsets.BDay())
Out[279]: 
2012-01-06    0
2012-01-09    1
2012-01-10    2
dtype: int64

In [280]: ts.shift(5, freq="BM")
Out[280]: 
2012-05-31    0
2012-05-31    1
2012-05-31    2
dtype: int64

工夫序列能够通过调用 asfreq 的办法转换其频率：

In [281]: dr = pd.date_range("1/1/2010", periods=3, freq=3 * pd.offsets.BDay())

In [282]: ts = pd.Series(np.random.randn(3), index=dr)

In [283]: ts
Out[283]: 
2010-01-01    1.494522
2010-01-06   -0.778425
2010-01-11   -0.253355
Freq: 3B, dtype: float64

In [284]: ts.asfreq(pd.offsets.BDay())
Out[284]: 
2010-01-01    1.494522
2010-01-04         NaN
2010-01-05         NaN
2010-01-06   -0.778425
2010-01-07         NaN
2010-01-08         NaN
2010-01-11   -0.253355
Freq: B, dtype: float64

asfreq 还能够指定批改频率过后的填充办法：

In [285]: ts.asfreq(pd.offsets.BDay(), method="pad")
Out[285]: 
2010-01-01    1.494522
2010-01-04    1.494522
2010-01-05    1.494522
2010-01-06   -0.778425
2010-01-07   -0.778425
2010-01-08   -0.778425
2010-01-11   -0.253355
Freq: B, dtype: float64

给定的工夫序列能够通过调用 resample 办法来从新取样：

In [286]: rng = pd.date_range("1/1/2012", periods=100, freq="S")

In [287]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [288]: ts.resample("5Min").sum()
Out[288]: 
2012-01-01    25103
Freq: 5T, dtype: int64

resample 能够承受各类统计办法，比方：sum, mean, std, sem, max, min, median, first, last, ohlc。

In [289]: ts.resample("5Min").mean()
Out[289]: 
2012-01-01    251.03
Freq: 5T, dtype: float64

In [290]: ts.resample("5Min").ohlc()
Out[290]: 
            open  high  low  close
2012-01-01   308   460    9    205

In [291]: ts.resample("5Min").max()
Out[291]: 
2012-01-01    460
Freq: 5T, dtype: int64

本文已收录于 http://www.flydean.com/15-python-pandas-time/

最艰深的解读，最粗浅的干货，最简洁的教程，泛滥你不晓得的小技巧等你来发现！

欢送关注我的公众号:「程序那些事」, 懂技术，更懂你！

关于pandas:Pandas高级教程之时间处理

简介

工夫分类

Timestamp

DatetimeIndex

date_range 和 bdate_range

`origin`

格式化

Period

DateOffset

作为 index

切片和齐全匹配

工夫序列的操作

Shifting

频率转换

Resampling 从新取样

Just My Socks（注册教程内含优惠码）

关于pandas:Pandas高级教程之时间处理

简介

工夫分类

Timestamp

DatetimeIndex

date_range 和 bdate_range

origin

格式化

Period

DateOffset

作为 index

切片和齐全匹配

工夫序列的操作

Shifting

频率转换

Resampling 从新取样

Just My Socks（注册教程 内含优惠码）

`origin`

Just My Socks（注册教程内含优惠码）