穿插验证利用于工夫序列须要留神是要避免透露和取得牢靠的性能预计本文将介绍蒙特卡洛穿插验证。这是一种风行的 TimeSeriesSplits 办法的代替办法。
工夫序列穿插验证
TimeSeriesSplit 通常是工夫序列数据进行穿插验证的首选办法。下图 1 阐明了该办法的操作形式。可用的工夫序列被分成几个大小相等的折叠。而后每一次折首先被用来测试一个模型,而后从新训练它。除了第一折只用于训练。
应用 TimeSeriesSplit 进行穿插验证的次要益处如下:
- 它放弃了察看的程序。这个问题在有序数据集 (如工夫序列) 中十分重要。
- 它生成了很多拆分。几次拆分后能够取得更持重的评估。如果数据集不大,这一点尤其重要。
TimeSeriesSplit 的次要毛病是跨折叠的训练样本量是不统一的。这是什么意思?
假如将该办法利用于图 1 所示的 5 次分折。在第一次迭代中,所有可用观测值的 20% 用于训练。然而,这个数字在上次迭代中是 80%。因而,初始迭代可能不能代表残缺的工夫序列。这个问题会影响性能预计。
那么如何解决这个问题?
蒙特卡罗穿插验证
蒙特卡罗穿插验证 (MonteCarloCV) 是一种能够用于工夫序列的办法。这个想法是在不同的随机终点来获取一个工夫周期的数据,上面是这种办法的可视化形容:
像 TimeSeriesSplit 一样,MonteCarloCV 也保留了观测的工夫程序。它还会保留多次重复预计过程。
MonteCarloCV 与 TimeSeriesSplit 的区别次要有两个方面:
- 对于训练和验证样本量,应用 TimeSeriesSplit 时训练集的大小会减少。在 MonteCarloCV 中,训练集的大小在每次迭代过程中都是固定的,这样能够避免训练规模不能代表整个数据;
- 随机的分折,在 MonteCarloCV 中,验证原点是随机抉择的。这个原点标记着训练集的完结和验证的开始。在 TimeSeriesSplit 的状况下,这个点是确定的。它是依据迭代次数事后定义的。
MonteCarloCV 最后由 Picard 和 Cook 应用。详细信息能够查看参考文献[1]。
通过具体钻研 MonteCarloCV。这包含与 TimeSeriesSplit 等其余办法的比拟。MonteCarloCV 能够取得更好的预计,所以我始终在应用它。你能够在参考文献 [2] 中查看残缺的钻研。
可怜的是,scikit-learn 不提供 MonteCarloCV 的实现。所以,咱们决定本人手动实现它:
from typing import List, Generator
import numpy as np
from sklearn.model_selection._split import _BaseKFold
from sklearn.utils.validation import indexable, _num_samples
class MonteCarloCV(_BaseKFold):
def __init__(self,
n_splits: int,
train_size: float,
test_size: float,
gap: int = 0):
"""
Monte Carlo Cross-Validation
Holdout applied in multiple testing periods
Testing origin (time-step where testing begins) is randomly chosen according to a monte carlo simulation
:param n_splits: (int) Number of monte carlo repetitions in the procedure
:param train_size: (float) Train size, in terms of ratio of the total length of the series
:param test_size: (float) Test size, in terms of ratio of the total length of the series
:param gap: (int) Number of samples to exclude from the end of each train set before the test set.
"""
self.n_splits = n_splits
self.n_samples = -1
self.gap = gap
self.train_size = train_size
self.test_size = test_size
self.train_n_samples = 0
self.test_n_samples = 0
self.mc_origins = []
def split(self, X, y=None, groups=None) -> Generator:
"""Generate indices to split data into training and test set.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Training data, where `n_samples` is the number of samples
and `n_features` is the number of features.
y : array-like of shape (n_samples,)
Always ignored, exists for compatibility.
groups : array-like of shape (n_samples,)
Always ignored, exists for compatibility.
Yields
------
train : ndarray
The training set indices for that split.
test : ndarray
The testing set indices for that split.
"""
X, y, groups = indexable(X, y, groups)
self.n_samples = _num_samples(X)
self.train_n_samples = int(self.n_samples * self.train_size) - 1
self.test_n_samples = int(self.n_samples * self.test_size) - 1
# Make sure we have enough samples for the given split parameters
if self.n_splits > self.n_samples:
raise ValueError(f'Cannot have number of folds={self.n_splits} greater'
f'than the number of samples={self.n_samples}.'
)
if self.train_n_samples - self.gap <= 0:
raise ValueError(f'The gap={self.gap} is too big for number of training samples'
f'={self.train_n_samples} with testing samples={self.test_n_samples} and gap={self.gap}.'
)
indices = np.arange(self.n_samples)
selection_range = np.arange(self.train_n_samples + 1, self.n_samples - self.test_n_samples - 1)
self.mc_origins = \
np.random.choice(a=selection_range,
size=self.n_splits,
replace=True)
for origin in self.mc_origins:
if self.gap > 0:
train_end = origin - self.gap + 1
else:
train_end = origin - self.gap
train_start = origin - self.train_n_samples - 1
test_end = origin + self.test_n_samples
yield (indices[train_start:train_end],
indices[origin:test_end],
)
def get_origins(self) -> List[int]:
return self.mc_origins
MonteCarloCV 承受四个参数:
n_splitting: 分折或迭代的次数。这个值趋向于 10;
training_size: 每次迭代时训练集的大小与工夫序列大小的比值;
test_size: 相似于 training_size,但用于验证集;
gap: 拆散训练集和验证集的察看数。与 TimeSeriesSplits 一样,此参数的值默认为 0(无间隙)。
每次迭代的训练和验证大小取决于输出数据。我发现一个 0.6/0.1 的分区工作得很好。也就是说,在每次迭代中,60% 的数据被用于训练。10% 的察看后果用于验证。
理论应用的例子
上面是配置的一个例子:
from sklearn.datasets import make_regression
from src.mccv import MonteCarloCV
X, y = make_regression(n_samples=120)
mccv = MonteCarloCV(n_splits=5,
train_size=0.6,
test_size=0.1,
gap=0)
for train_index, test_index in mccv.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
该实现也与 scikit-learn 兼容。以下是如何联合 GridSearchCV:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
model = RandomForestRegressor()
param_search = {'n_estimators': [10, 100]}
gsearch = GridSearchCV(estimator=model, cv=mccv, param_grid=param_search)
gsearch.fit(X, y)
我心愿你发现 MonteCarloCV 有用!
援用
[1] Picard, Richard R., and R. Dennis Cook.“Cross-validation of regression models.”Journal of the American Statistical Association 79.387 (1984): 575–583.
[2] Vitor Cerqueira, Luis Torgo, and Igor Mozetič.“Evaluating time series forecasting models: An empirical study on performance estimation methods.”Machine Learning 109.11 (2020): 1997–2028.
https://avoid.overfit.cn/post/d6ab5b4cd0e5476c91cae97c4564deb9
作者:Vitor Cerqueira