工业蒸汽量预测(最新版本下篇)
5. 模型验证
5.1 模型评估的概念与正则化
5.1.1 过拟合与欠拟合
### 获取并绘制数据集
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(666)
x = np.random.uniform(-3.0, 3.0, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x**2 + x + 2 + np.random.normal(0, 1, size=100)
plt.scatter(x, y)
plt.show()
应用线性回归拟合数据
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.score(X, y)
# 输入:0.4953707811865009
0.4953707811865009
准确率为 0.495,比拟低,直线拟合数据的水平较低。
### 应用均方误差判断拟合水平
from sklearn.metrics import mean_squared_error
y_predict = lin_reg.predict(X)
mean_squared_error(y, y_predict)
# 输入:3.0750025765636577
3.0750025765636577
### 绘制拟合后果
y_predict = lin_reg.predict(X)
plt.scatter(x, y)
plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r')
plt.show()
5.1.2 回归模型的评估指标和调用办法
### 应用多项式回归拟合
# * 封装 Pipeline 管道
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
def PolynomialRegression(degree):
return Pipeline([('poly', PolynomialFeatures(degree=degree)),
('std_scaler', StandardScaler()),
('lin_reg', LinearRegression())
])
- 应用 Pipeline 拟合数据:degree = 2
poly2_reg = PolynomialRegression(degree=2)
poly2_reg.fit(X, y)
y2_predict = poly2_reg.predict(X)
# 比拟真值和预测值的均方误差
mean_squared_error(y, y2_predict)
# 输入:1.0987392142417856
1.0987392142417856
- 绘制拟合后果
plt.scatter(x, y)
plt.plot(np.sort(x), y2_predict[np.argsort(x)], color='r')
plt.show()
- 调整 degree = 10
poly10_reg = PolynomialRegression(degree=10)
poly10_reg.fit(X, y)
y10_predict = poly10_reg.predict(X)
mean_squared_error(y, y10_predict)
# 输入:1.0508466763764164
plt.scatter(x, y)
plt.plot(np.sort(x), y10_predict[np.argsort(x)], color='r')
plt.show()
- 调整 degree = 100
poly100_reg = PolynomialRegression(degree=100)
poly100_reg.fit(X, y)
y100_predict = poly100_reg.predict(X)
mean_squared_error(y, y100_predict)
# 输入:0.6874357783433694
plt.scatter(x, y)
plt.plot(np.sort(x), y100_predict[np.argsort(x)], color='r')
plt.show()
-
剖析
- degree=2:均方误差为 1.0987392142417856;
- degree=10:均方误差为 1.0508466763764164;
- degree=100:均方误差为 0.6874357783433694;
- degree 越大拟合的成果越好,因为样本点是肯定的,咱们总能找到一条曲线将所有的样本点拟合,也就是说将所有的样本点都齐全落在这根曲线上,使得整体的均方误差为 0;
- 红色曲线并不是所计算出的拟合曲线,而此红色曲线只是原有的数据点对应的 y 的预测值连贯进去的后果,而且有的中央没有数据点,因而连贯的后果和原来的曲线不一样;
5.1.3 穿插验证
- 穿插验证迭代器
K 折穿插验证:KFold 将所有的样例划分为 k 个组,称为折叠 (fold)(如果 k = n,这等价于 Leave One Out(留一)策略),都具备雷同的大小(如果可能)。预测函数学习时应用 k – 1 个折叠中的数据,最初一个剩下的折叠会用于测试。
K 折反复屡次:RepeatedKFold 反复 K-Fold n 次。当须要运行时能够应用它 KFold n 次,在每次反复中产生不同的宰割。
留一穿插验证:LeaveOneOut (或 LOO) 是一个简略的穿插验证。每个学习集都是通过除了一个样本以外的所有样本创立的,测试集是被留下的样本。因而,对于 n 个样本,咱们有 n 个不同的训练集和 n 个不同的测试集。这种穿插验证程序不会节约太多数据,因为只有一个样本是从训练集中删除掉的:
留 P 穿插验证:LeavePOut 与 LeaveOneOut 十分类似,因为它通过从整个汇合中删除 p 个样本来创立所有可能的 训练 / 测试集。对于 n 个样本,这产生了 {n \choose p} 个 训练 - 测试 对。与 LeaveOneOut 和 KFold 不同,当 p > 1 时,测试集会重叠。
用户自定义数据集划分:ShuffleSplit 迭代器将会生成一个用户给定数量的独立的训练 / 测试数据划分。样例首先被打散而后划分为一对训练测试汇合。
设置每次生成的随机数雷同:能够通过设定明确的 random_state,使得伪随机生成器的后果能够反复。
- 基于类标签、具备分层的穿插验证迭代器
如何解决样本不均衡问题?应用 StratifiedKFold 和 StratifiedShuffleSplit 分层抽样。一些分类问题在指标类别的散布上可能体现出很大的不平衡性:例如,可能会呈现比正样本多数倍的负样本。在这种状况下,倡议采纳如 StratifiedKFold 和 StratifiedShuffleSplit 中实现的分层抽样办法,确保绝对的类别频率在每个训练和验证 折叠 中大抵保留。
StratifiedKFold是 k-fold 的变种,会返回 stratified(分层)的折叠:每个小汇合中,各个类别的样例比例大抵和残缺数据集中雷同。
StratifiedShuffleSplit是 ShuffleSplit 的一个变种,会返回间接的划分,比方:创立一个划分,然而划分中每个类的比例和残缺数据集中的雷同。
- 用于分组数据的穿插验证迭代器
如何进一步测试模型的泛化能力?留出一组特定的不属于测试集和训练集的数据。有时咱们想晓得在一组特定的 groups 上训练的模型是否能很好地实用于看不见的 group。为了掂量这一点,咱们须要确保验证对象中的所有样本来自配对训练折叠中齐全没有示意的组。
GroupKFold是 k-fold 的变体,它确保同一个 group 在测试和训练集中都不被示意。例如,如果数据是从不同的 subjects 取得的,每个 subject 有多个样本,并且如果模型足够灵便以高度人物指定的特色中学习,则可能无奈推广到新的 subject。GroupKFold 能够检测到这种过拟合的状况。
LeaveOneGroupOut是一个穿插验证计划,它依据第三方提供的 array of integer groups(整数组的数组)来提供样本。这个组信息能够用来编码任意域特定的预约义穿插验证折叠。
每个训练集都是由除特定组别以外的所有样本形成的。
LeavePGroupsOut相似于 LeaveOneGroupOut,但为每个训练 / 测试集删除与 P 组无关的样本。
GroupShuffleSplit迭代器是 ShuffleSplit 和 LeavePGroupsOut 的组合,它生成一个随机划分分区的序列,其中为每个分组提供了一个组子集。
- 工夫序列宰割
TimeSeriesSplit是 k-fold 的一个变体,它首先返回 k 折作为训练数据集,并且 (k+1) 折作为测试数据集。请留神,与规范的穿插验证办法不同,间断的训练集是超过前者的超集。另外,它将所有的残余数据增加到第一个训练分区,它总是用来训练模型。
from sklearn.model_selection import train_test_split,cross_val_score,cross_validate # 穿插验证所需的函数
from sklearn.model_selection import KFold,LeaveOneOut,LeavePOut,ShuffleSplit # 穿插验证所需的子集划分办法
from sklearn.model_selection import StratifiedKFold,StratifiedShuffleSplit # 分层宰割
from sklearn.model_selection import GroupKFold,LeaveOneGroupOut,LeavePGroupsOut,GroupShuffleSplit # 分组宰割
from sklearn.model_selection import TimeSeriesSplit # 工夫序列宰割
from sklearn import datasets # 自带数据集
from sklearn import svm # SVM 算法
from sklearn import preprocessing # 预处理模块
from sklearn.metrics import recall_score # 模型度量
iris = datasets.load_iris() # 加载数据集
print('样本集大小:',iris.data.shape,iris.target.shape)
# =================================== 数据集划分, 训练模型 ==========================
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0) # 穿插验证划分训练集和测试集.test_size 为测试集所占的比例
print('训练集大小:',X_train.shape,y_train.shape) # 训练集样本大小
print('测试集大小:',X_test.shape,y_test.shape) # 测试集样本大小
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train) # 应用训练集训练模型
print('准确率:',clf.score(X_test, y_test)) # 计算测试集的度量值(准确率)# 如果波及到归一化,则在测试集上也要应用训练集模型提取的归一化函数。scaler = preprocessing.StandardScaler().fit(X_train) # 通过训练集取得归一化函数模型。(也就是先减几,再除以几的函数)。在训练集和测试集上都应用这个归一化函数
X_train_transformed = scaler.transform(X_train)
clf = svm.SVC(kernel='linear', C=1).fit(X_train_transformed, y_train) # 应用训练集训练模型
X_test_transformed = scaler.transform(X_test)
print(clf.score(X_test_transformed, y_test)) # 计算测试集的度量值(准确度)# =================================== 间接调用穿插验证评估模型 ==========================
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5) #cv 为迭代次数。print(scores) # 打印输出每次迭代的度量值(准确度)print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) # 获取置信区间。(也就是均值和方差)# =================================== 多种度量后果 ======================================
scoring = ['precision_macro', 'recall_macro'] # precision_macro 为精度,recall_macro 为召回率
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,cv=5, return_train_score=True)
sorted(scores.keys())
print('测试后果:',scores) # scores 类型为字典。蕴含训练得分,拟合次数,score-times(得分次数)# ================================== K 折穿插验证、留一穿插验证、留 p 穿插验证、随机排列穿插验证 ==========================================
# k 折划分子集
kf = KFold(n_splits=2)
for train, test in kf.split(iris.data):
print("k 折划分:%s %s" % (train.shape, test.shape))
break
# 留一划分子集
loo = LeaveOneOut()
for train, test in loo.split(iris.data):
print("留一划分:%s %s" % (train.shape, test.shape))
break
# 留 p 划分子集
lpo = LeavePOut(p=2)
for train, test in loo.split(iris.data):
print("留 p 划分:%s %s" % (train.shape, test.shape))
break
# 随机排列划分子集
ss = ShuffleSplit(n_splits=3, test_size=0.25,random_state=0)
for train_index, test_index in ss.split(iris.data):
print("随机排列划分:%s %s" % (train.shape, test.shape))
break
# ================================== 分层 K 折穿插验证、分层随机穿插验证 ==========================================
skf = StratifiedKFold(n_splits=3) #各个类别的比例大抵和残缺数据集中雷同
for train, test in skf.split(iris.data, iris.target):
print("分层 K 折划分:%s %s" % (train.shape, test.shape))
break
skf = StratifiedShuffleSplit(n_splits=3) # 划分中每个类的比例和残缺数据集中的雷同
for train, test in skf.split(iris.data, iris.target):
print("分层随机划分:%s %s" % (train.shape, test.shape))
break
# ================================== 组 k-fold 穿插验证、留一组穿插验证、留 P 组穿插验证、Group Shuffle Split==========================================
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]
# k 折分组
gkf = GroupKFold(n_splits=3) # 训练集和测试集属于不同的组
for train, test in gkf.split(X, y, groups=groups):
print("组 k-fold 宰割:%s %s" % (train, test))
# 留一分组
logo = LeaveOneGroupOut()
for train, test in logo.split(X, y, groups=groups):
print("留一组宰割:%s %s" % (train, test))
# 留 p 分组
lpgo = LeavePGroupsOut(n_groups=2)
for train, test in lpgo.split(X, y, groups=groups):
print("留 P 组宰割:%s %s" % (train, test))
# 随机分组
gss = GroupShuffleSplit(n_splits=4, test_size=0.5, random_state=0)
for train, test in gss.split(X, y, groups=groups):
print("随机宰割:%s %s" % (train, test))
# ================================== 工夫序列宰割 ==========================================
tscv = TimeSeriesSplit(n_splits=3)
TimeSeriesSplit(max_train_size=None, n_splits=3)
for train, test in tscv.split(iris.data):
print("工夫序列宰割:%s %s" % (train, test))
样本集大小:(150, 4) (150,)
训练集大小:(90, 4) (90,)
测试集大小:(60, 4) (60,)
准确率:0.9666666666666667
0.9333333333333333
[0.96666667 1. 0.96666667 0.96666667 1.]
Accuracy: 0.98 (+/- 0.03)
测试后果:{'fit_time': array([0.000494 , 0.0005343 , 0.00048256, 0.00053048, 0.00047898]), 'score_time': array([0.00132895, 0.00126219, 0.00118518, 0.00140405, 0.00118995]), 'test_precision_macro': array([0.96969697, 1. , 0.96969697, 0.96969697, 1.]), 'train_precision_macro': array([0.97674419, 0.97674419, 0.99186992, 0.98412698, 0.98333333]), 'test_recall_macro': array([0.96666667, 1. , 0.96666667, 0.96666667, 1.]), 'train_recall_macro': array([0.975 , 0.975 , 0.99166667, 0.98333333, 0.98333333])}
k 折划分:(75,) (75,)
留一划分:(149,) (1,)
留 p 划分:(149,) (1,)
随机排列划分:(149,) (1,)
分层 K 折划分:(100,) (50,)
分层随机划分:(135,) (15,)
组 k-fold 宰割:[0 1 2 3 4 5] [6 7 8 9]
组 k-fold 宰割:[0 1 2 6 7 8 9] [3 4 5]
组 k-fold 宰割:[3 4 5 6 7 8 9] [0 1 2]
留一组宰割:[3 4 5 6 7 8 9] [0 1 2]
留一组宰割:[0 1 2 6 7 8 9] [3 4 5]
留一组宰割:[0 1 2 3 4 5] [6 7 8 9]
留 P 组宰割:[6 7 8 9] [0 1 2 3 4 5]
留 P 组宰割:[3 4 5] [0 1 2 6 7 8 9]
留 P 组宰割:[0 1 2] [3 4 5 6 7 8 9]
随机宰割:[0 1 2] [3 4 5 6 7 8 9]
随机宰割:[3 4 5] [0 1 2 6 7 8 9]
随机宰割:[3 4 5] [0 1 2 6 7 8 9]
随机宰割:[3 4 5] [0 1 2 6 7 8 9]
工夫序列宰割:[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38] [39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
63 64 65 66 67 68 69 70 71 72 73 74 75]
工夫序列宰割:[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75] [ 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
112]
工夫序列宰割:[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107
108 109 110 111 112] [113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
149]
5.2 网格搜寻
Grid Search:一种调参伎俩;穷举搜寻:在所有候选的参数抉择中,通过循环遍历,尝试每一种可能性,体现最好的参数就是最终的后果。其原理就像是在数组里找最大值。
5.2.1 简略的网格搜寻
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=0)
print("Size of training set:{} size of testing set:{}".format(X_train.shape[0],X_test.shape[0]))
#### grid search start
best_score = 0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
svm = SVC(gamma=gamma,C=C)# 对于每种参数可能的组合,进行一次训练;svm.fit(X_train,y_train)
score = svm.score(X_test,y_test)
if score > best_score:# 找到体现最好的参数
best_score = score
best_parameters = {'gamma':gamma,'C':C}
#### grid search end
print("Best score:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
Size of training set:112 size of testing set:38
Best score:0.97
Best parameters:{'gamma': 0.001, 'C': 100}
5.2.2 Grid Search with Cross Validation(具备穿插验证的网格搜寻)
X_trainval,X_test,y_trainval,y_test = train_test_split(iris.data,iris.target,random_state=0)
X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval,random_state=1)
print("Size of training set:{} size of validation set:{} size of testing set:{}".format(X_train.shape[0],X_val.shape[0],X_test.shape[0]))
best_score = 0.0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
svm = SVC(gamma=gamma,C=C)
svm.fit(X_train,y_train)
score = svm.score(X_val,y_val)
if score > best_score:
best_score = score
best_parameters = {'gamma':gamma,'C':C}
svm = SVC(**best_parameters) #应用最佳参数,构建新的模型
svm.fit(X_trainval,y_trainval) #应用训练集和验证集进行训练,more data always results in good performance.
test_score = svm.score(X_test,y_test) # evaluation 模型评估
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Best score on test set:{:.2f}".format(test_score))
Size of training set:84 size of validation set:28 size of testing set:38
Best score on validation set:0.96
Best parameters:{'gamma': 0.001, 'C': 10}
Best score on test set:0.92
from sklearn.model_selection import cross_val_score
best_score = 0.0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
svm = SVC(gamma=gamma,C=C)
scores = cross_val_score(svm,X_trainval,y_trainval,cv=5) #5 折穿插验证
score = scores.mean() #取平均数
if score > best_score:
best_score = score
best_parameters = {"gamma":gamma,"C":C}
svm = SVC(**best_parameters)
svm.fit(X_trainval,y_trainval)
test_score = svm.score(X_test,y_test)
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Score on testing set:{:.2f}".format(test_score))
Best score on validation set:0.97
Best parameters:{'gamma': 0.1, 'C': 10}
Score on testing set:0.97
穿插验证常常与网格搜寻进行联合,作为参数评估的一种办法,这种办法叫做 grid search with cross validation。sklearn 因而设计了一个这样的类 GridSearchCV,这个类实现了 fit,predict,score 等办法,被当做了一个 estimator,应用 fit 办法,该过程中:(1)搜寻到最佳参数;(2)实例化了一个最佳参数的 estimator;
from sklearn.model_selection import GridSearchCV
#把要调整的参数以及其候选值 列出来;param_grid = {"gamma":[0.001,0.01,0.1,1,10,100],
"C":[0.001,0.01,0.1,1,10,100]}
print("Parameters:{}".format(param_grid))
grid_search = GridSearchCV(SVC(),param_grid,cv=5) #实例化一个 GridSearchCV 类
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=10)
grid_search.fit(X_train,y_train) #训练,找到最优的参数,同时应用最优的参数实例化一个新的 SVC estimator。print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test)))
print("Best parameters:{}".format(grid_search.best_params_))
print("Best score on train set:{:.2f}".format(grid_search.best_score_))
Parameters:{'gamma': [0.001, 0.01, 0.1, 1, 10, 100], 'C': [0.001, 0.01, 0.1, 1, 10, 100]}
Test set score:0.97
Best parameters:{'C': 10, 'gamma': 0.1}
Best score on train set:0.98
5.2.3 学习曲线
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
digits = load_digits()
X, y = digits.data, digits.target
title = "Learning Curves (Naive Bayes)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
estimator = GaussianNB()
plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)
<module 'matplotlib.pyplot' from '/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/pyplot.py'>
title = "Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
# SVC is more expensive so we do a lower number of CV iterations:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = SVC(gamma=0.001)
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)
<module 'matplotlib.pyplot' from '/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/pyplot.py'>
5.2.4 验证曲线
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn. model_selection import validation_curve
digits = load_digits()
X, y = digits.data, digits.target
param_range = np.logspace(-6, -1, 5)
train_scores, test_scores = validation_curve(SVC(), X, y, param_name="gamma", param_range=param_range,
cv=10, scoring="accuracy", n_jobs=1)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.title("Validation Curve with SVM")
plt.xlabel("$\gamma$")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")
plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2, color="r")
plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
color="g")
plt.fill_between(param_range, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2, color="g")
plt.legend(loc="best")
plt.show()
5.3 工业蒸汽赛题模型验证
5.3.1 模型过拟合与欠拟合
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
from sklearn.linear_model import LinearRegression #线性回归
from sklearn.neighbors import KNeighborsRegressor #K 近邻回归
from sklearn.tree import DecisionTreeRegressor #决策树回归
from sklearn.ensemble import RandomForestRegressor #随机森林回归
from sklearn.svm import SVR #反对向量回归
import lightgbm as lgb #lightGbm 模型
from sklearn.model_selection import train_test_split # 切分数据
from sklearn.metrics import mean_squared_error #评估指标
from sklearn.linear_model import SGDRegressor
# 下载须要用到的数据集
!wget http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_test.txt
!wget http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_train.txt
--2023-03-24 22:17:50-- http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_test.txt
正在解析主机 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)... 49.7.22.39
正在连接 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)|49.7.22.39|:80... 已连贯。已收回 HTTP 申请,正在期待回应... 200 OK
长度:466959 (456K)
正在保留至:“zhengqi_test.txt.2”zhengqi_test.txt.2 100%[===================>] 456.01K --.-KB/s in 0.03s
2023-03-24 22:17:51 (13.2 MB/s) - 已保留“zhengqi_test.txt.2”[466959/466959])
--2023-03-24 22:17:51-- http://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/Industrial_Steam_Forecast/zhengqi_train.txt
正在解析主机 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)... 49.7.22.39
正在连接 tianchi-media.oss-cn-beijing.aliyuncs.com (tianchi-media.oss-cn-beijing.aliyuncs.com)|49.7.22.39|:80... 已连贯。已收回 HTTP 申请,正在期待回应... 200 OK
长度:714370 (698K)
正在保留至:“zhengqi_train.txt.2”zhengqi_train.txt.2 100%[===================>] 697.63K --.-KB/s in 0.04s
2023-03-24 22:17:51 (17.8 MB/s) - 已保留“zhengqi_train.txt.2”[714370/714370])
train_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
test_data = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')
from sklearn import preprocessing
features_columns = [col for col in train_data.columns if col not in ['target']]
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler = min_max_scaler.fit(train_data[features_columns])
train_data_scaler = min_max_scaler.transform(train_data[features_columns])
test_data_scaler = min_max_scaler.transform(test_data[features_columns])
train_data_scaler = pd.DataFrame(train_data_scaler)
train_data_scaler.columns = features_columns
test_data_scaler = pd.DataFrame(test_data_scaler)
test_data_scaler.columns = features_columns
train_data_scaler['target'] = train_data['target']
from sklearn.decomposition import PCA #主成分分析法
#PCA 办法降维
#保留 16 个主成分
pca = PCA(n_components=16)
new_train_pca_16 = pca.fit_transform(train_data_scaler.iloc[:,0:-1])
new_test_pca_16 = pca.transform(test_data_scaler)
new_train_pca_16 = pd.DataFrame(new_train_pca_16)
new_test_pca_16 = pd.DataFrame(new_test_pca_16)
new_train_pca_16['target'] = train_data_scaler['target']
# 采纳 pca 保留 16 维特色的数据
new_train_pca_16 = new_train_pca_16.fillna(0)
train = new_train_pca_16[new_test_pca_16.columns]
target = new_train_pca_16['target']
# 切分数据 训练数据 80% 验证数据 20%
train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0)
#### 欠拟合
clf = SGDRegressor(max_iter=500, tol=1e-2)
clf.fit(train_data, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data))
score_test = mean_squared_error(test_target, clf.predict(test_data))
print("SGDRegressor train MSE:", score_train)
print("SGDRegressor test MSE:", score_test)
SGDRegressor train MSE: 0.15125847407064866
SGDRegressor test MSE: 0.15565698772176442
### 过拟合
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)
train_data_poly = poly.fit_transform(train_data)
test_data_poly = poly.transform(test_data)
clf = SGDRegressor(max_iter=1000, tol=1e-3)
clf.fit(train_data_poly, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data_poly))
score_test = mean_squared_error(test_target, clf.predict(test_data_poly))
print("SGDRegressor train MSE:", score_train)
print("SGDRegressor test MSE:", score_test)
SGDRegressor train MSE: 0.13230725829556678
SGDRegressor test MSE: 0.14475818228220433
### 失常拟合
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
train_data_poly = poly.fit_transform(train_data)
test_data_poly = poly.transform(test_data)
clf = SGDRegressor(max_iter=1000, tol=1e-3)
clf.fit(train_data_poly, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data_poly))
score_test = mean_squared_error(test_target, clf.predict(test_data_poly))
print("SGDRegressor train MSE:", score_train)
print("SGDRegressor test MSE:", score_test)
SGDRegressor train MSE: 0.13399656558429307
SGDRegressor test MSE: 0.14255473176638828
5.3.2 模型正则化
L2 范数正则化
poly = PolynomialFeatures(3)
train_data_poly = poly.fit_transform(train_data)
test_data_poly = poly.transform(test_data)
clf = SGDRegressor(max_iter=1000, tol=1e-3, penalty= 'L2', alpha=0.0001)
clf.fit(train_data_poly, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data_poly))
score_test = mean_squared_error(test_target, clf.predict(test_data_poly))
print("SGDRegressor train MSE:", score_train)
print("SGDRegressor test MSE:", score_test)
SGDRegressor train MSE: 0.1344679787727263
SGDRegressor test MSE: 0.14283084627234435
L1 范数正则化
poly = PolynomialFeatures(3)
train_data_poly = poly.fit_transform(train_data)
test_data_poly = poly.transform(test_data)
clf = SGDRegressor(max_iter=1000, tol=1e-3, penalty= 'L1', alpha=0.00001)
clf.fit(train_data_poly, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data_poly))
score_test = mean_squared_error(test_target, clf.predict(test_data_poly))
print("SGDRegressor train MSE:", score_train)
print("SGDRegressor test MSE:", score_test)
SGDRegressor train MSE: 0.13516056789895906
SGDRegressor test MSE: 0.14330444056183564
ElasticNet L1 和 L2 范数加权正则化
poly = PolynomialFeatures(3)
train_data_poly = poly.fit_transform(train_data)
test_data_poly = poly.transform(test_data)
clf = SGDRegressor(max_iter=1000, tol=1e-3, penalty= 'elasticnet', l1_ratio=0.9, alpha=0.00001)
clf.fit(train_data_poly, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data_poly))
score_test = mean_squared_error(test_target, clf.predict(test_data_poly))
print("SGDRegressor train MSE:", score_train)
print("SGDRegressor test MSE:", score_test)
SGDRegressor train MSE: 0.13409834594770004
SGDRegressor test MSE: 0.14238154901534278
5.3.3 模型穿插验证
简略穿插验证 Hold-out-menthod
# 简略穿插验证
from sklearn.model_selection import train_test_split # 切分数据
# 切分数据 训练数据 80% 验证数据 20%
train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0)
clf = SGDRegressor(max_iter=1000, tol=1e-3)
clf.fit(train_data, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data))
score_test = mean_squared_error(test_target, clf.predict(test_data))
print("SGDRegressor train MSE:", score_train)
print("SGDRegressor test MSE:", score_test)
SGDRegressor train MSE: 0.14143759510386256
SGDRegressor test MSE: 0.14691862910491496
K 折穿插验证 K-fold CV
# 5 折穿插验证
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
for k, (train_index, test_index) in enumerate(kf.split(train)):
train_data,test_data,train_target,test_target = train.values[train_index],train.values[test_index],target[train_index],target[test_index]
clf = SGDRegressor(max_iter=1000, tol=1e-3)
clf.fit(train_data, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data))
score_test = mean_squared_error(test_target, clf.predict(test_data))
print(k, "折", "SGDRegressor train MSE:", score_train)
print(k, "折", "SGDRegressor test MSE:", score_test, '\n')
0 折 SGDRegressor train MSE: 0.14989313756469505
0 折 SGDRegressor test MSE: 0.10630068590577227
1 折 SGDRegressor train MSE: 0.1335269045335198
1 折 SGDRegressor test MSE: 0.18239988520454367
2 折 SGDRegressor train MSE: 0.14713477627139634
2 折 SGDRegressor test MSE: 0.13314646232843022
3 折 SGDRegressor train MSE: 0.14067731027537836
3 折 SGDRegressor test MSE: 0.16311142798019898
4 折 SGDRegressor train MSE: 0.13809527090941803
4 折 SGDRegressor test MSE: 0.16535259610698216
留一法 LOO CV
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
num = 100
for k, (train_index, test_index) in enumerate(loo.split(train)):
train_data,test_data,train_target,test_target = train.values[train_index],train.values[test_index],target[train_index],target[test_index]
clf = SGDRegressor(max_iter=1000, tol=1e-3)
clf.fit(train_data, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data))
score_test = mean_squared_error(test_target, clf.predict(test_data))
print(k, "个", "SGDRegressor train MSE:", score_train)
print(k, "个", "SGDRegressor test MSE:", score_test, '\n')
if k >= 9:
break
0 个 SGDRegressor train MSE: 0.14167336296809338
0 个 SGDRegressor test MSE: 0.013368856967176993
1 个 SGDRegressor train MSE: 0.14158431010604786
1 个 SGDRegressor test MSE: 0.12481451551630947
2 个 SGDRegressor train MSE: 0.14150252555121376
2 个 SGDRegressor test MSE: 0.03855470133268372
3 个 SGDRegressor train MSE: 0.14164982490586497
3 个 SGDRegressor test MSE: 0.004218299742968551
4 个 SGDRegressor train MSE: 0.1415724024144491
4 个 SGDRegressor test MSE: 0.012171393307787685
5 个 SGDRegressor train MSE: 0.14164330849085816
5 个 SGDRegressor test MSE: 0.13457429896691775
6 个 SGDRegressor train MSE: 0.14162839258823134
6 个 SGDRegressor test MSE: 0.022584321520003964
7 个 SGDRegressor train MSE: 0.14156535630118358
7 个 SGDRegressor test MSE: 0.0007881735114026308
8 个 SGDRegressor train MSE: 0.14161403732956687
8 个 SGDRegressor test MSE: 0.09236755222443295
9 个 SGDRegressor train MSE: 0.1416518678123776
9 个 SGDRegressor test MSE: 0.049938663947863705
留 P 法 LPO CV
from sklearn.model_selection import LeavePOut
lpo = LeavePOut(p=10)
num = 100
for k, (train_index, test_index) in enumerate(lpo.split(train)):
train_data,test_data,train_target,test_target = train.values[train_index],train.values[test_index],target[train_index],target[test_index]
clf = SGDRegressor(max_iter=1000, tol=1e-3)
clf.fit(train_data, train_target)
score_train = mean_squared_error(train_target, clf.predict(train_data))
score_test = mean_squared_error(test_target, clf.predict(test_data))
print(k, "10 个", "SGDRegressor train MSE:", score_train)
print(k, "10 个", "SGDRegressor test MSE:", score_test, '\n')
if k >= 9:
break
0 10 个 SGDRegressor train MSE: 0.14188547241073846
0 10 个 SGDRegressor test MSE: 0.04919852578302554
1 10 个 SGDRegressor train MSE: 0.1419628899970283
1 10 个 SGDRegressor test MSE: 0.0452239727984194
2 10 个 SGDRegressor train MSE: 0.14213271221606072
2 10 个 SGDRegressor test MSE: 0.04699670484045908
3 10 个 SGDRegressor train MSE: 0.14197467153253543
3 10 个 SGDRegressor test MSE: 0.054453728030175695
4 10 个 SGDRegressor train MSE: 0.14187879341894122
4 10 个 SGDRegressor test MSE: 0.06924591926518929
5 10 个 SGDRegressor train MSE: 0.14201820586737332
5 10 个 SGDRegressor test MSE: 0.04544729649569867
6 10 个 SGDRegressor train MSE: 0.1420321877668132
6 10 个 SGDRegressor test MSE: 0.04932459950875607
7 10 个 SGDRegressor train MSE: 0.1419166425781182
7 10 个 SGDRegressor test MSE: 0.05328512633699939
8 10 个 SGDRegressor train MSE: 0.1413933355339114
8 10 个 SGDRegressor test MSE: 0.04634695705557035
9 10 个 SGDRegressor train MSE: 0.14188082336683486
9 10 个 SGDRegressor test MSE: 0.045133396081342994
5.3.4 模型超参空间及调参
穷举网格搜寻
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split # 切分数据
# 切分数据 训练数据 80% 验证数据 20%
train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0)
randomForestRegressor = RandomForestRegressor()
parameters = {'n_estimators':[50, 100, 200],
'max_depth':[1, 2, 3]
}
clf = GridSearchCV(randomForestRegressor, parameters, cv=5)
clf.fit(train_data, train_target)
score_test = mean_squared_error(test_target, clf.predict(test_data))
print("RandomForestRegressor GridSearchCV test MSE:", score_test)
sorted(clf.cv_results_.keys())
RandomForestRegressor GridSearchCV test MSE: 0.2595696984416692
['mean_fit_time',
'mean_score_time',
'mean_test_score',
'param_max_depth',
'param_n_estimators',
'params',
'rank_test_score',
'split0_test_score',
'split1_test_score',
'split2_test_score',
'split3_test_score',
'split4_test_score',
'std_fit_time',
'std_score_time',
'std_test_score']
随机参数优化
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split # 切分数据
# 切分数据 训练数据 80% 验证数据 20%
train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0)
randomForestRegressor = RandomForestRegressor()
parameters = {'n_estimators':[10, 50],
'max_depth':[1, 2, 5]
}
clf = RandomizedSearchCV(randomForestRegressor, parameters, cv=5)
clf.fit(train_data, train_target)
score_test = mean_squared_error(test_target, clf.predict(test_data))
print("RandomForestRegressor RandomizedSearchCV test MSE:", score_test)
sorted(clf.cv_results_.keys())
RandomForestRegressor RandomizedSearchCV test MSE: 0.1952974248358807
['mean_fit_time',
'mean_score_time',
'mean_test_score',
'param_max_depth',
'param_n_estimators',
'params',
'rank_test_score',
'split0_test_score',
'split1_test_score',
'split2_test_score',
'split3_test_score',
'split4_test_score',
'std_fit_time',
'std_score_time',
'std_test_score']
Lgb 调参
!pip install lightgbm
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1)
Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.33.6)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.19.5)
Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.22.1)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.3.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
clf = lgb.LGBMRegressor(num_leaves=21)#num_leaves=31
parameters = {'learning_rate': [0.01, 0.1],
'n_estimators': [20, 40]
}
clf = GridSearchCV(clf, parameters, cv=5)
clf.fit(train_data, train_target)
print('Best parameters found by grid search are:', clf.best_params_)
score_test = mean_squared_error(test_target, clf.predict(test_data))
print("LGBMRegressor RandomizedSearchCV test MSE:", score_test)
Lgb 线下验证
train_data2 = pd.read_csv('./zhengqi_train.txt',sep='\t')
test_data2 = pd.read_csv('./zhengqi_test.txt',sep='\t')
train_data2_f = train_data2[test_data2.columns].values
train_data2_target = train_data2['target'].values
[报错信息:
TypeError: __init__() got an unexpected keyword argument ‘n_folds’](https://blog.csdn.net/qq_35781239/article/details/100866176?o…^v76^insert_down38,201^v4^add_ask,239^v2^insert_chatgpt&utm_term=__init__%28%29%20got%20multiple%20values%20for%20argument%20n_splits&spm=1018.2226.3001.4187)
# lgb 模型
from sklearn.model_selection import KFold
import lightgbm as lgb
import numpy as np
# 5 折穿插验证
Folds=5
kf = KFold(n_splits=Folds, random_state=100, shuffle=True)
#留神参数批改
# 导入谬误的 KFold 包
# from sklearn.cross_validation import KFold 曾经淘汰,须要改为 from sklearn.model_selection import KFold,具体信息参见 Sklearn 官网文档
# 应用谬误的参数
# kf = KFold(titanic.shape[0], n_folds=3, random_state=1)因为 sklearn 的更新,Kfold 的参数曾经更改,n_folds 更改为 n_splits,前文代码更改为 kf = KFold(n_splits=3, shuffle=False, random_state=1),如果不更改,会产生报错 TypeError: __init__() got multiple values for argument 'n_splits'
# 除此之外,for train, test in kf: 同时更改为 for train, test in kf.split(titanic[predictions]): 此时相当于用 predictions 来进行折叠穿插划分。# 记录训练和预测 MSE
MSE_DICT = {'train_mse':[],
'test_mse':[]}
# 线下训练预测
for i, (train_index, test_index) in enumerate (kf.split(train_data2_f)):
# lgb 树模型
lgb_reg = lgb.LGBMRegressor(
learning_rate=0.01,
max_depth=-1,
n_estimators=100,
boosting_type='gbdt',
random_state=100,
objective='regression',
)
# 切分训练集和预测集
X_train_KFold, X_test_KFold = train_data2_f[train_index], train_data2_f[test_index]
y_train_KFold, y_test_KFold = train_data2_target[train_index], train_data2_target[test_index]
# 训练模型
# reg.fit(X_train_KFold, y_train_KFold)
lgb_reg.fit(
X=X_train_KFold,y=y_train_KFold,
eval_set=[(X_train_KFold, y_train_KFold),(X_test_KFold, y_test_KFold)],
eval_names=['Train','Test'],
early_stopping_rounds=10,
eval_metric='MSE',
verbose=50
)
# 训练集预测 测试集预测
y_train_KFold_predict = lgb_reg.predict(X_train_KFold,num_iteration=lgb_reg.best_iteration_)
y_test_KFold_predict = lgb_reg.predict(X_test_KFold,num_iteration=lgb_reg.best_iteration_)
print('第 {} 折 训练和预测 训练 MSE 预测 MSE'.format(i))
train_mse = mean_squared_error(y_train_KFold_predict, y_train_KFold)
print('------\n', '训练 MSE\n', train_mse, '\n------')
test_mse = mean_squared_error(y_test_KFold_predict, y_test_KFold)
print('------\n', '预测 MSE\n', test_mse, '\n------\n')
MSE_DICT['train_mse'].append(train_mse)
MSE_DICT['test_mse'].append(test_mse)
print('------\n', '训练 MSE\n', MSE_DICT['train_mse'], '\n', np.mean(MSE_DICT['train_mse']), '\n------')
print('------\n', '预测 MSE\n', MSE_DICT['test_mse'], '\n', np.mean(MSE_DICT['test_mse']), '\n------')
5.3.5 学习曲线和验证曲线
### 学习曲线
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
X = train_data2[test_data2.columns].values
y = train_data2['target'].values
title = "LinearRegression"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
# cv = model_selection.ShuffleSplit(X.shape[0], n_splits=100,
# test_size=0.2, random_state=0)
cv = model_selection.ShuffleSplit(n_splits=100,
test_size=0.2, random_state=0)
estimator = SGDRegressor()
plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=-1)
Automatically created module for IPython interactive environment
<module 'matplotlib.pyplot' from '/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/pyplot.py'>
TypeError:__init __()为参数’n_splits’取得了多个值
### 验证曲线
print(__doc__)
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import validation_curve
X = train_data2[test_data2.columns].values
y = train_data2['target'].values
# max_iter=1000, tol=1e-3, penalty= 'L1', alpha=0.00001
param_range = [0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001]
train_scores, test_scores = validation_curve(SGDRegressor(max_iter=1000, tol=1e-3, penalty= 'L1'), X, y, param_name="alpha", param_range=param_range,
cv=10, scoring='r2', n_jobs=1)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.title("Validation Curve with SGDRegressor")
plt.xlabel("alpha")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")
plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2, color="r")
plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
color="g")
plt.fill_between(param_range, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2, color="g")
plt.legend(loc="best")
plt.show()
Automatically created module for IPython interactive environment
6. 特色优化
6.1 定义特色构造方法,结构特色
# 导入数据
import pandas as pd
train_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
test_data = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')
epsilon=1e-5
#组穿插特色,能够自行定义,如减少:x*x/y, log(x)/y 等等
func_dict = {
'add': lambda x,y: x+y,
'mins': lambda x,y: x-y,
'div': lambda x,y: x/(y+epsilon),
'multi': lambda x,y: x*y
}
### 定义特色结构的函数
def auto_features_make(train_data,test_data,func_dict,col_list):
train_data, test_data = train_data.copy(), test_data.copy()
for col_i in col_list:
for col_j in col_list:
for func_name, func in func_dict.items():
for data in [train_data,test_data]:
func_features = func(data[col_i],data[col_j])
col_func_features = '-'.join([col_i,func_name,col_j])
data[col_func_features] = func_features
return train_data,test_data
### 对训练集和测试集数据进行特色结构
train_data2, test_data2 = auto_features_make(train_data,test_data,func_dict,col_list=test_data.columns)
from sklearn.decomposition import PCA #主成分分析法
#PCA 办法降维
pca = PCA(n_components=500)
train_data2_pca = pca.fit_transform(train_data2.iloc[:,0:-1])
test_data2_pca = pca.transform(test_data2)
train_data2_pca = pd.DataFrame(train_data2_pca)
test_data2_pca = pd.DataFrame(test_data2_pca)
train_data2_pca['target'] = train_data2['target']
X_train2 = train_data2[test_data2.columns].values
y_train = train_data2['target']
6.2 基于 lightgbm 对结构特色进行训练和评估
# ls_validation i
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
import numpy as np
# 5 折穿插验证,版本迭代参数更新
Folds=5
kf = KFold(n_splits=Folds, shuffle=True, random_state=2019)
# 版本批改导致用法有不同:# 1. n_folds 参数批改为 n_splits
# 2. train_data.shape[0]参数被去除。所以你的这一行批改为 kf =KFold(n_splits=3, random_state=1)
# 而后在你前面要用到 kf 的中央,比方原来的:# for i, (train_index, test_index) in enumerate(kf):
# 批改成:# for i, (train_index, test_index) in enumerate(kf.split(train)):# train 就是你的训练数据
# 记录训练和预测 MSE
MSE_DICT = {'train_mse':[],
'test_mse':[]}
# 线下训练预测
for i, (train_index, test_index) in enumerate(kf.split(X_train2)):
# lgb 树模型
lgb_reg = lgb.LGBMRegressor(
learning_rate=0.01,
max_depth=-1,
n_estimators=100, #记得批改
boosting_type='gbdt',
random_state=2019,
objective='regression',
)
# 切分训练集和预测集
X_train_KFold, X_test_KFold = X_train2[train_index], X_train2[test_index]
y_train_KFold, y_test_KFold = y_train[train_index], y_train[test_index]
# 训练模型
lgb_reg.fit(
X=X_train_KFold,y=y_train_KFold,
eval_set=[(X_train_KFold, y_train_KFold),(X_test_KFold, y_test_KFold)],
eval_names=['Train','Test'],
early_stopping_rounds=10, #记得批改
eval_metric='MSE',
verbose=50
)
# 训练集预测 测试集预测
y_train_KFold_predict = lgb_reg.predict(X_train_KFold,num_iteration=lgb_reg.best_iteration_)
y_test_KFold_predict = lgb_reg.predict(X_test_KFold,num_iteration=lgb_reg.best_iteration_)
print('第 {} 折 训练和预测 训练 MSE 预测 MSE'.format(i))
train_mse = mean_squared_error(y_train_KFold_predict, y_train_KFold)
print('------\n', '训练 MSE\n', train_mse, '\n------')
test_mse = mean_squared_error(y_test_KFold_predict, y_test_KFold)
print('------\n', '预测 MSE\n', test_mse, '\n------\n')
MSE_DICT['train_mse'].append(train_mse)
MSE_DICT['test_mse'].append(test_mse)
print('------\n', '训练 MSE\n', MSE_DICT['train_mse'], '\n', np.mean(MSE_DICT['train_mse']), '\n------')
print('------\n', '预测 MSE\n', MSE_DICT['test_mse'], '\n', np.mean(MSE_DICT['test_mse']), '\n------')
Training until validation scores don't improve for 10 rounds
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
/tmp/ipykernel_5027/2900171053.py in <module>
49 early_stopping_rounds=10, #记得批改
50 eval_metric='MSE',
---> 51 verbose=50
52 )
53
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
777 verbose=verbose, feature_name=feature_name,
778 categorical_feature=categorical_feature,
--> 779 callbacks=callbacks, init_model=init_model)
780 return self
781
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
615 evals_result=evals_result, fobj=self._fobj, feval=eval_metrics_callable,
616 verbose_eval=verbose, feature_name=feature_name,
--> 617 callbacks=callbacks, init_model=init_model)
618
619 if evals_result:
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
250 evaluation_result_list=None))
251
--> 252 booster.update(fobj=fobj)
253
254 evaluation_result_list = []
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/lightgbm/basic.py in update(self, train_set, fobj)
2458 _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
2459 self.handle,
-> 2460 ctypes.byref(is_finished)))
2461 self.__is_predicted_cur_iter = [False for _ in range_(self.__num_dataset)]
2462 return is_finished.value == 1
KeyboardInterrupt:
7. 模型交融
上面把上一章要害流程在跑一边
# 导入包
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
import seaborn as sns
# modelling
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RepeatedKFold, cross_val_score,cross_val_predict,KFold
from sklearn.metrics import make_scorer,mean_squared_error
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.svm import LinearSVR, SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor,AdaBoostRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import PolynomialFeatures,MinMaxScaler,StandardScaler
#load_dataset
with open("./zhengqi_train.txt") as fr:
data_train=pd.read_table(fr,sep="\t")
with open("./zhengqi_test.txt") as fr_test:
data_test=pd.read_table(fr_test,sep="\t")
# 合并数据
#merge train_set and test_set
data_train["oringin"]="train"
data_test["oringin"]="test"
data_all=pd.concat([data_train,data_test],axis=0,ignore_index=True)
# 删除相干特色
data_all.drop(["V5","V9","V11","V17","V22","V28"],axis=1,inplace=True)
# 数据最大最小归一化
# normalise numeric columns
cols_numeric=list(data_all.columns)
cols_numeric.remove("oringin")
def scale_minmax(col):
return (col-col.min())/(col.max()-col.min())
scale_cols = [col for col in cols_numeric if col!='target']
data_all[scale_cols] = data_all[scale_cols].apply(scale_minmax,axis=0)
# #Check effect of Box-Cox transforms on distributions of continuous variables
# # 画图:探查特色和标签相干信息
# fcols = 6
# frows = len(cols_numeric)-1
# plt.figure(figsize=(4*fcols,4*frows))
# i=0
# for var in cols_numeric:
# if var!='target':
# dat = data_all[[var, 'target']].dropna()
# i+=1
# plt.subplot(frows,fcols,i)
# sns.distplot(dat[var] , fit=stats.norm);
# plt.title(var+'Original')
# plt.xlabel('')
# i+=1
# plt.subplot(frows,fcols,i)
# _=stats.probplot(dat[var], plot=plt)
# plt.title('skew='+'{:.4f}'.format(stats.skew(dat[var])))
# plt.xlabel('')
# plt.ylabel('')
# i+=1
# plt.subplot(frows,fcols,i)
# plt.plot(dat[var], dat['target'],'.',alpha=0.5)
# plt.title('corr='+'{:.2f}'.format(np.corrcoef(dat[var], dat['target'])[0][1]))
# i+=1
# plt.subplot(frows,fcols,i)
# trans_var, lambda_var = stats.boxcox(dat[var].dropna()+1)
# trans_var = scale_minmax(trans_var)
# sns.distplot(trans_var , fit=stats.norm);
# plt.title(var+'Tramsformed')
# plt.xlabel('')
# i+=1
# plt.subplot(frows,fcols,i)
# _=stats.probplot(trans_var, plot=plt)
# plt.title('skew='+'{:.4f}'.format(stats.skew(trans_var)))
# plt.xlabel('')
# plt.ylabel('')
# i+=1
# plt.subplot(frows,fcols,i)
# plt.plot(trans_var, dat['target'],'.',alpha=0.5)
# plt.title('corr='+'{:.2f}'.format(np.corrcoef(trans_var,dat['target'])[0][1]))
对特色进行 Box-Cox 变换,使其满足正态性
Box-Cox 变换是 Box 和 Cox 在 1964 年提出的一种狭义幂变换办法,是统计建模中罕用的一种数据变换,用于间断的响应变量不满足正态分布的状况。Box-Cox 变换之后,能够肯定水平上减小不可观测的误差和预测变量的相关性。Box-Cox 变换的次要特点是引入一个参数,通过数据自身预计该参数进而确定应采取的数据变换模式,Box-Cox 变换能够显著地改善数据的正态性、对称性和方差相等性,对许多理论数据都是卓有成效的
cols_transform=data_all.columns[0:-2]
for col in cols_transform:
# transform column
data_all.loc[:,col], _ = stats.boxcox(data_all.loc[:,col]+1)
# 标签数据统计转换后的数据,计算分位数画图展现(基于正态分布)print(data_all.target.describe())
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
sns.distplot(data_all.target.dropna() , fit=stats.norm);
plt.subplot(1,2,2)
_=stats.probplot(data_all.target.dropna(), plot=plt)
# 标签数据对数变换数据,使数据更合乎正态,并画图展现
#Log Transform SalePrice to improve normality
sp = data_train.target
data_train.target1 =np.power(1.5,sp)
print(data_train.target1.describe())
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
sns.distplot(data_train.target1.dropna(),fit=stats.norm);
plt.subplot(1,2,2)
_=stats.probplot(data_train.target1.dropna(), plot=plt)
# 获取训练和测试数据
# function to get training samples
def get_training_data():
# extract training samples
from sklearn.model_selection import train_test_split
df_train = data_all[data_all["oringin"]=="train"]
df_train["label"]=data_train.target1
# split SalePrice and features
y = df_train.target
X = df_train.drop(["oringin","target","label"],axis=1)
X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.3,random_state=100)
return X_train,X_valid,y_train,y_valid
# extract test data (without SalePrice)
def get_test_data():
df_test = data_all[data_all["oringin"]=="test"].reset_index(drop=True)
return df_test.drop(["oringin","target"],axis=1)
# 评分函数
from sklearn.metrics import make_scorer
# metric for evaluation
def rmse(y_true, y_pred):
diff = y_pred - y_true
sum_sq = sum(diff**2)
n = len(y_pred)
return np.sqrt(sum_sq/n)
def mse(y_ture,y_pred):
return mean_squared_error(y_ture,y_pred)
# scorer to be used in sklearn model fitting
rmse_scorer = make_scorer(rmse, greater_is_better=False)
mse_scorer = make_scorer(mse, greater_is_better=False)
# 获取异样数据,并画图
# function to detect outliers based on the predictions of a model
def find_outliers(model, X, y, sigma=3):
# predict y values using model
try:
y_pred = pd.Series(model.predict(X), index=y.index)
# if predicting fails, try fitting the model first
except:
model.fit(X,y)
y_pred = pd.Series(model.predict(X), index=y.index)
# calculate residuals between the model prediction and true y values
resid = y - y_pred
mean_resid = resid.mean()
std_resid = resid.std()
# calculate z statistic, define outliers to be where |z|>sigma
z = (resid - mean_resid)/std_resid
outliers = z[abs(z)>sigma].index
# print and plot the results
print('R2=',model.score(X,y))
print('rmse=',rmse(y, y_pred))
print("mse=",mean_squared_error(y,y_pred))
print('---------------------------------------')
print('mean of residuals:',mean_resid)
print('std of residuals:',std_resid)
print('---------------------------------------')
print(len(outliers),'outliers:')
print(outliers.tolist())
plt.figure(figsize=(15,5))
ax_131 = plt.subplot(1,3,1)
plt.plot(y,y_pred,'.')
plt.plot(y.loc[outliers],y_pred.loc[outliers],'ro')
plt.legend(['Accepted','Outlier'])
plt.xlabel('y')
plt.ylabel('y_pred');
ax_132=plt.subplot(1,3,2)
plt.plot(y,y-y_pred,'.')
plt.plot(y.loc[outliers],y.loc[outliers]-y_pred.loc[outliers],'ro')
plt.legend(['Accepted','Outlier'])
plt.xlabel('y')
plt.ylabel('y - y_pred');
ax_133=plt.subplot(1,3,3)
z.plot.hist(bins=50,ax=ax_133)
z.loc[outliers].plot.hist(color='r',bins=50,ax=ax_133)
plt.legend(['Accepted','Outlier'])
plt.xlabel('z')
plt.savefig('outliers.png')
return outliers
# get training data
from sklearn.linear_model import Ridge
X_train, X_valid,y_train,y_valid = get_training_data()
test=get_test_data()
# find and remove outliers using a Ridge model
outliers = find_outliers(Ridge(), X_train, y_train)
# permanently remove these outliers from the data
#df_train = data_all[data_all["oringin"]=="train"]
#df_train["label"]=data_train.target1
#df_train=df_train.drop(outliers)
X_outliers=X_train.loc[outliers]
y_outliers=y_train.loc[outliers]
X_t=X_train.drop(outliers)
y_t=y_train.drop(outliers)
# 应用删除异常的数据进行模型训练
def get_trainning_data_omitoutliers():
y1=y_t.copy()
X1=X_t.copy()
return X1,y1
# 采纳网格搜寻训练模型
from sklearn.preprocessing import StandardScaler
def train_model(model, param_grid=[], X=[], y=[],
splits=5, repeats=5):
# get unmodified training data, unless data to use already specified
if len(y)==0:
X,y = get_trainning_data_omitoutliers()
#poly_trans=PolynomialFeatures(degree=2)
#X=poly_trans.fit_transform(X)
#X=MinMaxScaler().fit_transform(X)
# create cross-validation method
rkfold = RepeatedKFold(n_splits=splits, n_repeats=repeats)
# perform a grid search if param_grid given
if len(param_grid)>0:
# setup grid search parameters
gsearch = GridSearchCV(model, param_grid, cv=rkfold,
scoring="neg_mean_squared_error",
verbose=1, return_train_score=True)
# search the grid
gsearch.fit(X,y)
# extract best model from the grid
model = gsearch.best_estimator_
best_idx = gsearch.best_index_
# get cv-scores for best model
grid_results = pd.DataFrame(gsearch.cv_results_)
cv_mean = abs(grid_results.loc[best_idx,'mean_test_score'])
cv_std = grid_results.loc[best_idx,'std_test_score']
# no grid search, just cross-val score for given model
else:
grid_results = []
cv_results = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=rkfold)
cv_mean = abs(np.mean(cv_results))
cv_std = np.std(cv_results)
# combine mean and std cv-score in to a pandas series
cv_score = pd.Series({'mean':cv_mean,'std':cv_std})
# predict y using the fitted model
y_pred = model.predict(X)
# print stats on model performance
print('----------------------')
print(model)
print('----------------------')
print('score=',model.score(X,y))
print('rmse=',rmse(y, y_pred))
print('mse=',mse(y, y_pred))
print('cross_val: mean=',cv_mean,', std=',cv_std)
# residual plots
y_pred = pd.Series(y_pred,index=y.index)
resid = y - y_pred
mean_resid = resid.mean()
std_resid = resid.std()
z = (resid - mean_resid)/std_resid
n_outliers = sum(abs(z)>3)
plt.figure(figsize=(15,5))
ax_131 = plt.subplot(1,3,1)
plt.plot(y,y_pred,'.')
plt.xlabel('y')
plt.ylabel('y_pred');
plt.title('corr = {:.3f}'.format(np.corrcoef(y,y_pred)[0][1]))
ax_132=plt.subplot(1,3,2)
plt.plot(y,y-y_pred,'.')
plt.xlabel('y')
plt.ylabel('y - y_pred');
plt.title('std resid = {:.3f}'.format(std_resid))
ax_133=plt.subplot(1,3,3)
z.plot.hist(bins=50,ax=ax_133)
plt.xlabel('z')
plt.title('{:.0f} samples with z>3'.format(n_outliers))
return model, cv_score, grid_results
# places to store optimal models and scores
opt_models = dict()
score_models = pd.DataFrame(columns=['mean','std'])
# no. k-fold splits
splits=5
# no. k-fold iterations
repeats=5
7.1 繁多模型预测成果
7.1.1 岭回归
model = 'Ridge'
opt_models[model] = Ridge()
alph_range = np.arange(0.25,6,0.25)
param_grid = {'alpha': alph_range}
opt_models[model],cv_score,grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=repeats)
cv_score.name = model
score_models = score_models.append(cv_score)
plt.figure()
plt.errorbar(alph_range, abs(grid_results['mean_test_score']),
abs(grid_results['std_test_score'])/np.sqrt(splits*repeats))
plt.xlabel('alpha')
plt.ylabel('score')
7.1.2 Lasso 回归
model = 'Lasso'
opt_models[model] = Lasso()
alph_range = np.arange(1e-4,1e-3,4e-5)
param_grid = {'alpha': alph_range}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=repeats)
cv_score.name = model
score_models = score_models.append(cv_score)
plt.figure()
plt.errorbar(alph_range, abs(grid_results['mean_test_score']),abs(grid_results['std_test_score'])/np.sqrt(splits*repeats))
plt.xlabel('alpha')
plt.ylabel('score')
7.1.3 ElasticNet 回归
model ='ElasticNet'
opt_models[model] = ElasticNet()
param_grid = {'alpha': np.arange(1e-4,1e-3,1e-4),
'l1_ratio': np.arange(0.1,1.0,0.1),
'max_iter':[100000]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
7.1.4 SVR 回归
model='LinearSVR'
opt_models[model] = LinearSVR()
crange = np.arange(0.1,1.0,0.1)
param_grid = {'C':crange,
'max_iter':[1000]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=repeats)
cv_score.name = model
score_models = score_models.append(cv_score)
plt.figure()
plt.errorbar(crange, abs(grid_results['mean_test_score']),abs(grid_results['std_test_score'])/np.sqrt(splits*repeats))
plt.xlabel('C')
plt.ylabel('score')
7.1.5 KNN 最近邻
model = 'KNeighbors'
opt_models[model] = KNeighborsRegressor()
param_grid = {'n_neighbors':np.arange(3,11,1)}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
plt.figure()
plt.errorbar(np.arange(3,11,1), abs(grid_results['mean_test_score']),abs(grid_results['std_test_score'])/np.sqrt(splits*1))
plt.xlabel('n_neighbors')
plt.ylabel('score')
7.1.6 GBDT 模型
model = 'GradientBoosting'
opt_models[model] = GradientBoostingRegressor()
param_grid = {'n_estimators':[150,250,350],
'max_depth':[1,2,3],
'min_samples_split':[5,6,7]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
7.1.7XGB 模型
model = 'XGB'
opt_models[model] = XGBRegressor()
param_grid = {'n_estimators':[100,200,300,400,500],
'max_depth':[1,2,3],
}
opt_models[model], cv_score,grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=splits, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
7.1.8 随机森林模型
model = 'RandomForest'
opt_models[model] = RandomForestRegressor()
param_grid = {'n_estimators':[100,150,200],
'max_features':[8,12,16,20,24],
'min_samples_split':[2,4,6]}
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid=param_grid,
splits=5, repeats=1)
cv_score.name = model
score_models = score_models.append(cv_score)
7.2 模型预测 – 多模型 Bagging
def model_predict(test_data,test_y=[],stack=False):
#poly_trans=PolynomialFeatures(degree=2)
#test_data1=poly_trans.fit_transform(test_data)
#test_data=MinMaxScaler().fit_transform(test_data)
i=0
y_predict_total=np.zeros((test_data.shape[0],))
for model in opt_models.keys():
if model!="LinearSVR" and model!="KNeighbors":
y_predict=opt_models[model].predict(test_data)
y_predict_total+=y_predict
i+=1
if len(test_y)>0:
print("{}_mse:".format(model),mean_squared_error(y_predict,test_y))
y_predict_mean=np.round(y_predict_total/i,3)
if len(test_y)>0:
print("mean_mse:",mean_squared_error(y_predict_mean,test_y))
else:
y_predict_mean=pd.Series(y_predict_mean)
return y_predict_mean
# Bagging 预测
model_predict(X_valid,y_valid)
7.3 模型交融 Stacking
7.3.1 模型交融 stacking 简略示例
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import itertools
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
## 次要应用 pip install mlxtend 装置 mlxtend
from mlxtend.classifier import EnsembleVoteClassifier
from mlxtend.data import iris_data
from mlxtend.plotting import plot_decision_regions
%matplotlib inline
# Initializing Classifiers
clf1 = LogisticRegression(random_state=0)
clf2 = RandomForestClassifier(random_state=0)
clf3 = SVC(random_state=0, probability=True)
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3], weights=[2, 1, 1], voting='soft')
# Loading some example data
X, y = iris_data()
X = X[:,[0, 2]]
# Plotting Decision Regions
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10, 8))
for clf, lab, grd in zip([clf1, clf2, clf3, eclf],
['Logistic Regression', 'Random Forest', 'RBF kernel SVM', 'Ensemble'],
itertools.product([0, 1], repeat=2)):
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
plt.title(lab)
plt.show()
7.3.2 工业蒸汽多模型交融 stacking
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from scipy import sparse
import xgboost
import lightgbm
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor,ExtraTreesRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
def stacking_reg(clf,train_x,train_y,test_x,clf_name,kf,label_split=None):
train=np.zeros((train_x.shape[0],1))
test=np.zeros((test_x.shape[0],1))
test_pre=np.empty((folds,test_x.shape[0],1))
cv_scores=[]
for i,(train_index,test_index) in enumerate(kf.split(train_x,label_split)):
tr_x=train_x[train_index]
tr_y=train_y[train_index]
te_x=train_x[test_index]
te_y = train_y[test_index]
if clf_name in ["rf","ada","gb","et","lr","lsvc","knn"]:
clf.fit(tr_x,tr_y)
pre=clf.predict(te_x).reshape(-1,1)
train[test_index]=pre
test_pre[i,:]=clf.predict(test_x).reshape(-1,1)
cv_scores.append(mean_squared_error(te_y, pre))
elif clf_name in ["xgb"]:
train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1)
test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1)
z = clf.DMatrix(test_x, label=te_y, missing=-1)
params = {'booster': 'gbtree',
'eval_metric': 'rmse',
'gamma': 1,
'min_child_weight': 1.5,
'max_depth': 5,
'lambda': 10,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.7,
'eta': 0.03,
'tree_method': 'exact',
'seed': 2017,
'nthread': 12
}
num_round = 1000 #记得批改
early_stopping_rounds = 10 #批改
watchlist = [(train_matrix, 'train'),
(test_matrix, 'eval')
]
if test_matrix:
model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,
early_stopping_rounds=early_stopping_rounds
)
pre= model.predict(test_matrix,ntree_limit=model.best_ntree_limit).reshape(-1,1)
train[test_index]=pre
test_pre[i, :]= model.predict(z, ntree_limit=model.best_ntree_limit).reshape(-1,1)
cv_scores.append(mean_squared_error(te_y, pre))
elif clf_name in ["lgb"]:
train_matrix = clf.Dataset(tr_x, label=tr_y)
test_matrix = clf.Dataset(te_x, label=te_y)
#z = clf.Dataset(test_x, label=te_y)
#z=test_x
params = {
'boosting_type': 'gbdt',
'objective': 'regression_l2',
'metric': 'mse',
'min_child_weight': 1.5,
'num_leaves': 2**5,
'lambda_l2': 10,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.7,
'learning_rate': 0.03,
'tree_method': 'exact',
'seed': 2017,
'nthread': 12,
'silent': True,
}
num_round = 10000
early_stopping_rounds = 100
if test_matrix:
model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,
early_stopping_rounds=early_stopping_rounds
)
pre= model.predict(te_x,num_iteration=model.best_iteration).reshape(-1,1)
train[test_index]=pre
test_pre[i, :]= model.predict(test_x, num_iteration=model.best_iteration).reshape(-1,1)
cv_scores.append(mean_squared_error(te_y, pre))
else:
raise IOError("Please add new clf.")
print("%s now score is:"%clf_name,cv_scores)
test[:]=test_pre.mean(axis=0)
print("%s_score_list:"%clf_name,cv_scores)
print("%s_score_mean:"%clf_name,np.mean(cv_scores))
return train.reshape(-1,1),test.reshape(-1,1)
模型交融 stacking 基学习器
def rf_reg(x_train, y_train, x_valid, kf, label_split=None):
randomforest = RandomForestRegressor(n_estimators=100, max_depth=20, n_jobs=-1, random_state=2017, max_features="auto",verbose=1)
rf_train, rf_test = stacking_reg(randomforest, x_train, y_train, x_valid, "rf", kf, label_split=label_split)
return rf_train, rf_test,"rf_reg"
def ada_reg(x_train, y_train, x_valid, kf, label_split=None):
adaboost = AdaBoostRegressor(n_estimators=30, random_state=2017, learning_rate=0.01)
ada_train, ada_test = stacking_reg(adaboost, x_train, y_train, x_valid, "ada", kf, label_split=label_split)
return ada_train, ada_test,"ada_reg"
def gb_reg(x_train, y_train, x_valid, kf, label_split=None):
gbdt = GradientBoostingRegressor(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2017,max_depth=5,verbose=1)
gbdt_train, gbdt_test = stacking_reg(gbdt, x_train, y_train, x_valid, "gb", kf, label_split=label_split)
return gbdt_train, gbdt_test,"gb_reg"
def et_reg(x_train, y_train, x_valid, kf, label_split=None):
extratree = ExtraTreesRegressor(n_estimators=100, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1)
et_train, et_test = stacking_reg(extratree, x_train, y_train, x_valid, "et", kf, label_split=label_split)
return et_train, et_test,"et_reg"
def lr_reg(x_train, y_train, x_valid, kf, label_split=None):
lr_reg=LinearRegression(n_jobs=-1)
lr_train, lr_test = stacking_reg(lr_reg, x_train, y_train, x_valid, "lr", kf, label_split=label_split)
return lr_train, lr_test, "lr_reg"
def xgb_reg(x_train, y_train, x_valid, kf, label_split=None):
xgb_train, xgb_test = stacking_reg(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split)
return xgb_train, xgb_test,"xgb_reg"
def lgb_reg(x_train, y_train, x_valid, kf, label_split=None):
lgb_train, lgb_test = stacking_reg(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split)
return lgb_train, lgb_test,"lgb_reg"
模型交融 stacking 预测
def stacking_pred(x_train, y_train, x_valid, kf, clf_list, label_split=None, clf_fin="lgb", if_concat_origin=True):
for k, clf_list in enumerate(clf_list):
clf_list = [clf_list]
column_list = []
train_data_list=[]
test_data_list=[]
for clf in clf_list:
train_data,test_data,clf_name=clf(x_train, y_train, x_valid, kf, label_split=label_split)
train_data_list.append(train_data)
test_data_list.append(test_data)
column_list.append("clf_%s" % (clf_name))
train = np.concatenate(train_data_list, axis=1)
test = np.concatenate(test_data_list, axis=1)
if if_concat_origin:
train = np.concatenate([x_train, train], axis=1)
test = np.concatenate([x_valid, test], axis=1)
print(x_train.shape)
print(train.shape)
print(clf_name)
print(clf_name in ["lgb"])
if clf_fin in ["rf","ada","gb","et","lr","lsvc","knn"]:
if clf_fin in ["rf"]:
clf = RandomForestRegressor(n_estimators=100, max_depth=20, n_jobs=-1, random_state=2017, max_features="auto",verbose=1)
elif clf_fin in ["ada"]:
clf = AdaBoostRegressor(n_estimators=30, random_state=2017, learning_rate=0.01)
elif clf_fin in ["gb"]:
clf = GradientBoostingRegressor(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2017,max_depth=5,verbose=1)
elif clf_fin in ["et"]:
clf = ExtraTreesRegressor(n_estimators=100, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1)
elif clf_fin in ["lr"]:
clf = LinearRegression(n_jobs=-1)
clf.fit(train, y_train)
pre = clf.predict(test).reshape(-1,1)
return pred
elif clf_fin in ["xgb"]:
clf = xgboost
train_matrix = clf.DMatrix(train, label=y_train, missing=-1)
test_matrix = clf.DMatrix(train, label=y_train, missing=-1)
params = {'booster': 'gbtree',
'eval_metric': 'rmse',
'gamma': 1,
'min_child_weight': 1.5,
'max_depth': 5,
'lambda': 10,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.7,
'eta': 0.03,
'tree_method': 'exact',
'seed': 2017,
'nthread': 12
}
num_round = 1000
early_stopping_rounds = 10
watchlist = [(train_matrix, 'train'),
(test_matrix, 'eval')
]
model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,
early_stopping_rounds=early_stopping_rounds
)
pre = model.predict(test,ntree_limit=model.best_ntree_limit).reshape(-1,1)
return pre
elif clf_fin in ["lgb"]:
print(clf_name)
clf = lightgbm
train_matrix = clf.Dataset(train, label=y_train)
test_matrix = clf.Dataset(train, label=y_train)
params = {
'boosting_type': 'gbdt',
'objective': 'regression_l2',
'metric': 'mse',
'min_child_weight': 1.5,
'num_leaves': 2**5,
'lambda_l2': 10,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.7,
'learning_rate': 0.03,
'tree_method': 'exact',
'seed': 2017,
'nthread': 12,
'silent': True,
}
num_round = 1000
early_stopping_rounds = 10
model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,
early_stopping_rounds=early_stopping_rounds
)
print('pred')
pre = model.predict(test,num_iteration=model.best_iteration).reshape(-1,1)
print(pre)
return pre
# #load_dataset
with open("./zhengqi_train.txt") as fr:
data_train=pd.read_table(fr,sep="\t")
with open("./zhengqi_test.txt") as fr_test:
data_test=pd.read_table(fr_test,sep="\t")
### K 折穿插验证
from sklearn.model_selection import StratifiedKFold, KFold
folds = 5
seed = 1
kf = KFold(n_splits=5, shuffle=True, random_state=0)
### 训练集和测试集数据
x_train = data_train[data_test.columns].values
x_valid = data_test[data_test.columns].values
y_train = data_train['target'].values
### 应用 lr_reg 和 lgb_reg 进行交融预测
clf_list = [lr_reg, lgb_reg]
#clf_list = [lr_reg, rf_reg]
## 很容易过拟合
pred = stacking_pred(x_train, y_train, x_valid, kf, clf_list, label_split=None, clf_fin="lgb", if_concat_origin=True)
print(pred)
8. 总结
本我的项目次要解说了数据探索性剖析:查看变量间相关性以及找出要害变量;数据特色工程对数据精进:异样值解决、归一化解决以及特色降维;在进行归回模型训练波及支流 ML 模型:决策树、随机森林,lightgbm 等;在模型验证方面:解说了相干评估指标以及穿插验证等;同时用 lgb 对特色进行优化;最初进行基于 stacking 形式模型交融。
原我的项目链接:https://www.heywhale.com/home/column/64141d6b1c8c8b518ba97dcc
参考链接:https://tianchi.aliyun.com/course/278/3427
本地端码源码见下方链接
https://download.csdn.net/download/sinat_39620217/87630189
自己最近打算整合 ML、DRL、NLP 等相干畛域的体系化我的项目课程,不便入门同学疾速把握相干常识。申明:局部我的项目为网络经典我的项目不便大家疾速学习,后续会一直削减实战环节(较量、论文、事实利用等)。
- 对于机器学习这块布局为:根底入门机器学习算法 —> 简略我的项目实战 —> 数据建模较量 —–> 相干事实中利用场景问题解决。一条路线帮忙大家学习,疾速实战。
- 对于深度强化学习这块布局为:根底单智能算法教学(gym 环境为主)—-> 支流多智能算法教学(gym 环境为主)—-> 单智能多智能题实战(论文复现偏业务如:无人机优化调度、电力资源调度等我的项目利用)
- 自然语言解决相干布局:除了单点算法技术外,次要围绕常识图谱构建进行:信息抽取相干技术(含智能标注)—> 常识交融 —-> 常识推理 —-> 图谱利用
上述对于你把握后的期许:
- 对于 ML,心愿你后续能够乱杀数学建模相干较量(加入就获奖保底,top 还是难的须要钻研)
- 能够理论解决事实中一些优化调度问题,而非停留在 gym 环境下的一些游戏 demo 玩玩。(更深层次可能须要本人钻研了,难度还是很大的)
- 把握可常识图谱全流程构建其中各个重要环节算法,蕴含图数据库相干常识。
这三块畛域耦合状况比拟大,后续会通过比方:搜寻举荐零碎整个我的项目进行耦合,各项算法都会耦合在其中。举例:常识图谱就会用到(图算法、NLP、ML 相干算法),搜寻举荐零碎(除了该畛域召回粗排精排重排混排等算法外,还有强化学习、常识图谱等耦合在其中)。饼画的有点大,前面缓缓实现。