简介

工夫应该是在数据处理中常常会用到的一种数据类型,除了Numpy中datetime64 和 timedelta64 这两种数据类型之外,pandas 还整合了其余python库比方 scikits.timeseries 中的性能。

工夫分类

pandas中有四种工夫类型:

  1. Date times : 日期和工夫,能够带时区。和规范库中的 datetime.datetime 相似。
  2. Time deltas: 相对持续时间,和 规范库中的 datetime.timedelta 相似。
  3. Time spans: 由工夫点及其关联的频率定义的时间跨度。
  4. Date offsets:基于日历计算的工夫 和 dateutil.relativedelta.relativedelta 相似。

咱们用一张表来示意:

类型标量class数组classpandas数据类型次要创立办法
Date timesTimestampDatetimeIndexdatetime64[ns] or datetime64[ns, tz]to_datetime or date_range
Time deltasTimedeltaTimedeltaIndextimedelta64[ns]to_timedelta or timedelta_range
Time spansPeriodPeriodIndexperiod[freq]Period or period_range
Date offsetsDateOffsetNoneNoneDateOffset

看一个应用的例子:

In [19]: pd.Series(range(3), index=pd.date_range("2000", freq="D", periods=3))Out[19]: 2000-01-01    02000-01-02    12000-01-03    2Freq: D, dtype: int64

看一下下面数据类型的空值:

In [24]: pd.Timestamp(pd.NaT)Out[24]: NaTIn [25]: pd.Timedelta(pd.NaT)Out[25]: NaTIn [26]: pd.Period(pd.NaT)Out[26]: NaT# Equality acts as np.nan wouldIn [27]: pd.NaT == pd.NaTOut[27]: False

Timestamp

Timestamp 是最根底的工夫类型,咱们能够这样创立:

In [28]: pd.Timestamp(datetime.datetime(2012, 5, 1))Out[28]: Timestamp('2012-05-01 00:00:00')In [29]: pd.Timestamp("2012-05-01")Out[29]: Timestamp('2012-05-01 00:00:00')In [30]: pd.Timestamp(2012, 5, 1)Out[30]: Timestamp('2012-05-01 00:00:00')

DatetimeIndex

Timestamp 作为index会主动被转换为DatetimeIndex:

In [33]: dates = [   ....:     pd.Timestamp("2012-05-01"),   ....:     pd.Timestamp("2012-05-02"),   ....:     pd.Timestamp("2012-05-03"),   ....: ]   ....: In [34]: ts = pd.Series(np.random.randn(3), dates)In [35]: type(ts.index)Out[35]: pandas.core.indexes.datetimes.DatetimeIndexIn [36]: ts.indexOut[36]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)In [37]: tsOut[37]: 2012-05-01    0.4691122012-05-02   -0.2828632012-05-03   -1.509059dtype: float64

date_range 和 bdate_range

还能够应用 date_range 来创立DatetimeIndex:

In [74]: start = datetime.datetime(2011, 1, 1)In [75]: end = datetime.datetime(2012, 1, 1)In [76]: index = pd.date_range(start, end)In [77]: indexOut[77]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',               '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',               '2011-01-09', '2011-01-10',               ...               '2011-12-23', '2011-12-24', '2011-12-25', '2011-12-26',               '2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30',               '2011-12-31', '2012-01-01'],              dtype='datetime64[ns]', length=366, freq='D')

date_range 是日历范畴,bdate_range 是工作日范畴:

In [78]: index = pd.bdate_range(start, end)In [79]: indexOut[79]: DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',               '2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',               '2011-01-13', '2011-01-14',               ...               '2011-12-19', '2011-12-20', '2011-12-21', '2011-12-22',               '2011-12-23', '2011-12-26', '2011-12-27', '2011-12-28',               '2011-12-29', '2011-12-30'],              dtype='datetime64[ns]', length=260, freq='B')

两个办法都能够带上 start, end, 和 periods 参数。

In [84]: pd.bdate_range(end=end, periods=20)In [83]: pd.date_range(start, end, freq="W")In [86]: pd.date_range("2018-01-01", "2018-01-05", periods=5)

origin

应用 origin参数,能够批改 DatetimeIndex 的终点:

In [67]: pd.to_datetime([1, 2, 3], unit="D", origin=pd.Timestamp("1960-01-01"))Out[67]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)

默认状况下 origin='unix', 也就是终点是 1970-01-01 00:00:00.

In [68]: pd.to_datetime([1, 2, 3], unit="D")Out[68]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)

格式化

应用format参数能够对工夫进行格式化:

In [51]: pd.to_datetime("2010/11/12", format="%Y/%m/%d")Out[51]: Timestamp('2010-11-12 00:00:00')In [52]: pd.to_datetime("12-11-2010 00:00", format="%d-%m-%Y %H:%M")Out[52]: Timestamp('2010-11-12 00:00:00')

Period

Period 示意的是一个时间跨度,通常和freq一起应用:

In [31]: pd.Period("2011-01")Out[31]: Period('2011-01', 'M')In [32]: pd.Period("2012-05", freq="D")Out[32]: Period('2012-05-01', 'D')

Period能够间接进行运算:

In [345]: p = pd.Period("2012", freq="A-DEC")In [346]: p + 1Out[346]: Period('2013', 'A-DEC')In [347]: p - 3Out[347]: Period('2009', 'A-DEC')In [348]: p = pd.Period("2012-01", freq="2M")In [349]: p + 2Out[349]: Period('2012-05', '2M')In [350]: p - 1Out[350]: Period('2011-11', '2M')
留神,Period只有具备雷同的freq能力进行算数运算。包含 offsets 和 timedelta
In [352]: p = pd.Period("2014-07-01 09:00", freq="H")In [353]: p + pd.offsets.Hour(2)Out[353]: Period('2014-07-01 11:00', 'H')In [354]: p + datetime.timedelta(minutes=120)Out[354]: Period('2014-07-01 11:00', 'H')In [355]: p + np.timedelta64(7200, "s")Out[355]: Period('2014-07-01 11:00', 'H')

Period作为index能够主动被转换为PeriodIndex:

In [38]: periods = [pd.Period("2012-01"), pd.Period("2012-02"), pd.Period("2012-03")]In [39]: ts = pd.Series(np.random.randn(3), periods)In [40]: type(ts.index)Out[40]: pandas.core.indexes.period.PeriodIndexIn [41]: ts.indexOut[41]: PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='period[M]', freq='M')In [42]: tsOut[42]: 2012-01   -1.1356322012-02    1.2121122012-03   -0.173215Freq: M, dtype: float64

能够通过 pd.period_range 办法来创立 PeriodIndex:

In [359]: prng = pd.period_range("1/1/2011", "1/1/2012", freq="M")In [360]: prngOut[360]: PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06',             '2011-07', '2011-08', '2011-09', '2011-10', '2011-11', '2011-12',             '2012-01'],            dtype='period[M]', freq='M')

还能够通过PeriodIndex间接创立:

In [361]: pd.PeriodIndex(["2011-1", "2011-2", "2011-3"], freq="M")Out[361]: PeriodIndex(['2011-01', '2011-02', '2011-03'], dtype='period[M]', freq='M')

DateOffset

DateOffset示意的是频率对象。它和Timedelta很相似,示意的是一个持续时间,然而有非凡的日历规定。比方Timedelta一天必定是24小时,而在 DateOffset中依据夏令时的不同,一天可能会有23,24或者25小时。

# This particular day contains a day light savings time transitionIn [144]: ts = pd.Timestamp("2016-10-30 00:00:00", tz="Europe/Helsinki")# Respects absolute timeIn [145]: ts + pd.Timedelta(days=1)Out[145]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')# Respects calendar timeIn [146]: ts + pd.DateOffset(days=1)Out[146]: Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki')In [147]: friday = pd.Timestamp("2018-01-05")In [148]: friday.day_name()Out[148]: 'Friday'# Add 2 business days (Friday --> Tuesday)In [149]: two_business_days = 2 * pd.offsets.BDay()In [150]: two_business_days.apply(friday)Out[150]: Timestamp('2018-01-09 00:00:00')In [151]: friday + two_business_daysOut[151]: Timestamp('2018-01-09 00:00:00')In [152]: (friday + two_business_days).day_name()Out[152]: 'Tuesday'

DateOffsets 和Frequency 运算是先关的,看一下可用的Date Offset 和它相关联的 Frequency

Date OffsetFrequency String形容
DateOffsetNone通用的offset 类
BDay or BusinessDay'B'工作日
CDay or CustomBusinessDay'C'自定义的工作日
Week'W'一周
WeekOfMonth'WOM'每个月的第几周的第几天
LastWeekOfMonth'LWOM'每个月最初一周的第几天
MonthEnd'M'日历月末
MonthBegin'MS'日历月初
BMonthEnd or BusinessMonthEnd'BM'营业月底
BMonthBegin or BusinessMonthBegin'BMS'营业月初
CBMonthEnd or CustomBusinessMonthEnd'CBM'自定义营业月底
CBMonthBegin or CustomBusinessMonthBegin'CBMS'自定义营业月初
SemiMonthEnd'SM'日历月末的第15天
SemiMonthBegin'SMS'日历月初的第15天
QuarterEnd'Q'日历季末
QuarterBegin'QS'日历季初
BQuarterEnd'BQ工作季末
BQuarterBegin'BQS'工作季初
FY5253Quarter'REQ'批发季( 52-53 week)
YearEnd'A'日历年末
YearBegin'AS' or 'BYS'日历年初
BYearEnd'BA'营业年末
BYearBegin'BAS'营业年初
FY5253'RE'批发年 (aka 52-53 week)
EasterNone复活节假期
BusinessHour'BH'business hour
CustomBusinessHour'CBH'custom business hour
Day'D'一天的相对工夫
Hour'H'一小时
Minute'T' or 'min'一分钟
Second'S'一秒钟
Milli'L' or 'ms'一奥妙
Micro'U' or 'us'一毫秒
Nano'N'一纳秒

DateOffset还有两个办法 rollforward()rollback() 能够将工夫进行挪动:

In [153]: ts = pd.Timestamp("2018-01-06 00:00:00")In [154]: ts.day_name()Out[154]: 'Saturday'# BusinessHour's valid offset dates are Monday through FridayIn [155]: offset = pd.offsets.BusinessHour(start="09:00")# Bring the date to the closest offset date (Monday)In [156]: offset.rollforward(ts)Out[156]: Timestamp('2018-01-08 09:00:00')# Date is brought to the closest offset date first and then the hour is addedIn [157]: ts + offsetOut[157]: Timestamp('2018-01-08 10:00:00')

下面的操作会主动保留小时,分钟等信息,如果想要设置为 00:00:00 , 能够调用normalize() 办法:

In [158]: ts = pd.Timestamp("2014-01-01 09:00")In [159]: day = pd.offsets.Day()In [160]: day.apply(ts)Out[160]: Timestamp('2014-01-02 09:00:00')In [161]: day.apply(ts).normalize()Out[161]: Timestamp('2014-01-02 00:00:00')In [162]: ts = pd.Timestamp("2014-01-01 22:00")In [163]: hour = pd.offsets.Hour()In [164]: hour.apply(ts)Out[164]: Timestamp('2014-01-01 23:00:00')In [165]: hour.apply(ts).normalize()Out[165]: Timestamp('2014-01-01 00:00:00')In [166]: hour.apply(pd.Timestamp("2014-01-01 23:30")).normalize()Out[166]: Timestamp('2014-01-02 00:00:00')

作为index

工夫能够作为index,并且作为index的时候会有一些很不便的个性。

能够间接应用工夫来获取相应的数据:

In [99]: ts["1/31/2011"]Out[99]: 0.11920871129693428In [100]: ts[datetime.datetime(2011, 12, 25):]Out[100]: 2011-12-30    0.56702Freq: BM, dtype: float64In [101]: ts["10/31/2011":"12/31/2011"]Out[101]: 2011-10-31    0.2718602011-11-30   -0.4249722011-12-30    0.567020Freq: BM, dtype: float64

获取全年的数据:

In [102]: ts["2011"]Out[102]: 2011-01-31    0.1192092011-02-28   -1.0442362011-03-31   -0.8618492011-04-29   -2.1045692011-05-31   -0.4949292011-06-30    1.0718042011-07-29    0.7215552011-08-31   -0.7067712011-09-30   -1.0395752011-10-31    0.2718602011-11-30   -0.4249722011-12-30    0.567020Freq: BM, dtype: float64

获取某个月的数据:

In [103]: ts["2011-6"]Out[103]: 2011-06-30    1.071804Freq: BM, dtype: float64

DF能够承受工夫作为loc的参数:

In [105]: dftOut[105]:                             A2013-01-01 00:00:00  0.2762322013-01-01 00:01:00 -1.0874012013-01-01 00:02:00 -0.6736902013-01-01 00:03:00  0.1136482013-01-01 00:04:00 -1.478427...                       ...2013-03-11 10:35:00 -0.7479672013-03-11 10:36:00 -0.0345232013-03-11 10:37:00 -0.2017542013-03-11 10:38:00 -1.5090672013-03-11 10:39:00 -1.693043[100000 rows x 1 columns]In [106]: dft.loc["2013"]Out[106]:                             A2013-01-01 00:00:00  0.2762322013-01-01 00:01:00 -1.0874012013-01-01 00:02:00 -0.6736902013-01-01 00:03:00  0.1136482013-01-01 00:04:00 -1.478427...                       ...2013-03-11 10:35:00 -0.7479672013-03-11 10:36:00 -0.0345232013-03-11 10:37:00 -0.2017542013-03-11 10:38:00 -1.5090672013-03-11 10:39:00 -1.693043[100000 rows x 1 columns]

工夫切片:

In [107]: dft["2013-1":"2013-2"]Out[107]:                             A2013-01-01 00:00:00  0.2762322013-01-01 00:01:00 -1.0874012013-01-01 00:02:00 -0.6736902013-01-01 00:03:00  0.1136482013-01-01 00:04:00 -1.478427...                       ...2013-02-28 23:55:00  0.8509292013-02-28 23:56:00  0.9767122013-02-28 23:57:00 -2.6938842013-02-28 23:58:00 -1.5755352013-02-28 23:59:00 -1.573517[84960 rows x 1 columns]

切片和齐全匹配

思考上面的一个精度为分的Series对象:

In [120]: series_minute = pd.Series(   .....:     [1, 2, 3],   .....:     pd.DatetimeIndex(   .....:         ["2011-12-31 23:59:00", "2012-01-01 00:00:00", "2012-01-01 00:02:00"]   .....:     ),   .....: )   .....: In [121]: series_minute.index.resolutionOut[121]: 'minute'

工夫精度小于分的话,返回的是一个Series对象:

In [122]: series_minute["2011-12-31 23"]Out[122]: 2011-12-31 23:59:00    1dtype: int64

工夫精度大于分的话,返回的是一个常量:

In [123]: series_minute["2011-12-31 23:59"]Out[123]: 1In [124]: series_minute["2011-12-31 23:59:00"]Out[124]: 1

同样的,如果精度为秒的话,小于秒会返回一个对象,等于秒会返回常量值。

工夫序列的操作

Shifting

应用shift办法能够让 time series 进行相应的挪动:

In [275]: ts = pd.Series(range(len(rng)), index=rng)In [276]: ts = ts[:5]In [277]: ts.shift(1)Out[277]: 2012-01-01    NaN2012-01-02    0.02012-01-03    1.0Freq: D, dtype: float64

通过指定 freq , 能够设置shift的形式:

In [278]: ts.shift(5, freq="D")Out[278]: 2012-01-06    02012-01-07    12012-01-08    2Freq: D, dtype: int64In [279]: ts.shift(5, freq=pd.offsets.BDay())Out[279]: 2012-01-06    02012-01-09    12012-01-10    2dtype: int64In [280]: ts.shift(5, freq="BM")Out[280]: 2012-05-31    02012-05-31    12012-05-31    2dtype: int64

频率转换

工夫序列能够通过调用 asfreq 的办法转换其频率:

In [281]: dr = pd.date_range("1/1/2010", periods=3, freq=3 * pd.offsets.BDay())In [282]: ts = pd.Series(np.random.randn(3), index=dr)In [283]: tsOut[283]: 2010-01-01    1.4945222010-01-06   -0.7784252010-01-11   -0.253355Freq: 3B, dtype: float64In [284]: ts.asfreq(pd.offsets.BDay())Out[284]: 2010-01-01    1.4945222010-01-04         NaN2010-01-05         NaN2010-01-06   -0.7784252010-01-07         NaN2010-01-08         NaN2010-01-11   -0.253355Freq: B, dtype: float64

asfreq还能够指定批改频率过后的填充办法:

In [285]: ts.asfreq(pd.offsets.BDay(), method="pad")Out[285]: 2010-01-01    1.4945222010-01-04    1.4945222010-01-05    1.4945222010-01-06   -0.7784252010-01-07   -0.7784252010-01-08   -0.7784252010-01-11   -0.253355Freq: B, dtype: float64

Resampling 从新取样

给定的工夫序列能够通过调用resample办法来从新取样:

In [286]: rng = pd.date_range("1/1/2012", periods=100, freq="S")In [287]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)In [288]: ts.resample("5Min").sum()Out[288]: 2012-01-01    25103Freq: 5T, dtype: int64

resample 能够承受各类统计办法,比方: sum, mean, std, sem, max, min, median, first, last, ohlc

In [289]: ts.resample("5Min").mean()Out[289]: 2012-01-01    251.03Freq: 5T, dtype: float64In [290]: ts.resample("5Min").ohlc()Out[290]:             open  high  low  close2012-01-01   308   460    9    205In [291]: ts.resample("5Min").max()Out[291]: 2012-01-01    460Freq: 5T, dtype: int64

本文已收录于 http://www.flydean.com/15-python-pandas-time/

最艰深的解读,最粗浅的干货,最简洁的教程,泛滥你不晓得的小技巧等你来发现!

欢送关注我的公众号:「程序那些事」,懂技术,更懂你!