共计 15796 个字符,预计需要花费 40 分钟才能阅读完成。
线性回归是机器学习中最简略的算法,它能够通过不同的形式进行训练。在本文中,咱们将介绍以下回归算法:线性回归、Robust 回归、Ridge 回归、LASSO 回归、Elastic Net、多项式回归、多层感知机、随机森林回归和反对向量机。除此以外,本文还将介绍用于评估回归模型的最罕用指标,包含均方误差 (MSE)、均方根误差 (RMSE) 和均匀绝对误差 (MAE)。
导入库和读取数据
import pandas as pd | |
import numpy as np | |
import matplotlib.pyplot as plt | |
import seaborn as sns | |
import hvplot.pandas | |
%matplotlib inline | |
sns.set_style("whitegrid") | |
plt.style.use("fivethirtyeight") | |
USAhousing = pd.read_csv('../usa-housing/USA_Housing.csv') | |
USAhousing.head() |
探索性数据分析 (EDA)
下一步将创立一些简略的图表来检查数据。进行 EDA 将帮忙咱们相熟数据和取得数据的信息,尤其是对回归模型影响最大的异样值。
USAhousing.info() | |
<class 'pandas.core.frame.DataFrame'> | |
RangeIndex: 5000 entries, 0 to 4999 | |
Data columns (total 7 columns): | |
# Column Non-Null Count Dtype | |
--- ------ -------------- ----- | |
0 Avg. Area Income 5000 non-null float64 | |
1 Avg. Area House Age 5000 non-null float64 | |
2 Avg. Area Number of Rooms 5000 non-null float64 | |
3 Avg. Area Number of Bedrooms 5000 non-null float64 | |
4 Area Population 5000 non-null float64 | |
5 Price 5000 non-null float64 | |
6 Address 5000 non-null object | |
dtypes: float64(6), object(1) | |
memory usage: 273.6+ KB |
查看数据集的形容
USAhousing.describe()
训练前的筹备
咱们将从训练一个线性回归模型开始,训练之前须要确定数据的特色和指标,训练的特色的 X,指标变量的 y,在本例中咱们的指标为 Price 列。
之后,将数据分成训练集和测试集。咱们将在训练集上训练咱们的模型,而后应用测试集来评估模型。
from sklearn.model_selection import train_test_split | |
X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', | |
'Avg. Area Number of Bedrooms', 'Area Population']] | |
y = USAhousing['Price'] | |
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) |
为了评估回归模型还创立了一些辅助函数。
from sklearn import metrics | |
from sklearn.model_selection import cross_val_score | |
def cross_val(model): | |
pred = cross_val_score(model, X, y, cv=10) | |
return pred.mean() | |
def print_evaluate(true, predicted): | |
mae = metrics.mean_absolute_error(true, predicted) | |
mse = metrics.mean_squared_error(true, predicted) | |
rmse = np.sqrt(metrics.mean_squared_error(true, predicted)) | |
r2_square = metrics.r2_score(true, predicted) | |
print('MAE:', mae) | |
print('MSE:', mse) | |
print('RMSE:', rmse) | |
print('R2 Square', r2_square) | |
print('__________________________________') | |
def evaluate(true, predicted): | |
mae = metrics.mean_absolute_error(true, predicted) | |
mse = metrics.mean_squared_error(true, predicted) | |
rmse = np.sqrt(metrics.mean_squared_error(true, predicted)) | |
r2_square = metrics.r2_score(true, predicted) | |
return mae, mse, rmse, r2_square |
训练回归模型
对于线性回归而言,个别都会有以下的假如:
线性假如:线性回归假如输出和输入之间的关系是线性的。所以可能须要转换数据以使关系线性化(例如,指数关系的对数转换)。
去除乐音:线性回归假如您的输出和输入变量没有噪声。这对于输入变量最重要,如果可能心愿删除输入变量 (y) 中的异样值。
去除共线性:当具备高度相干的输出变量时,线性回归将会过拟合。须要将输出数据进行相关性计算并删除最相干的。
高斯分布:如果输出和输入变量具备高斯分布,线性回归将会做出更牢靠的预测。对于散布的转换能够对变量应用变换(例如 log 或 BoxCox)以使它们的散布看起来更像高斯分布。
对数据进行解决:应用标准化或归一化从新调整输出变量,线性回归通常会做出更牢靠的预测。
from sklearn.preprocessing import StandardScaler | |
from sklearn.pipeline import Pipeline | |
pipeline = Pipeline([('std_scalar', StandardScaler()) | |
]) | |
X_train = pipeline.fit_transform(X_train) | |
X_test = pipeline.transform(X_test) |
上面咱们开始进行回归回归算法的示例
1、线性回归和评估指标
from sklearn.linear_model import LinearRegression | |
lin_reg = LinearRegression(normalize=True) | |
lin_reg.fit(X_train,y_train) |
有了第一个模型,那么就要晓得评估模型的指标,以下是回归问题的三个常见评估指标:
均匀绝对误差 (MAE) 是误差绝对值的平均值:
均方误差 (MSE) 是均方误差的平均值:
均方根误差 (RMSE) 是均方误差的平方根:
这三个指标中:
- MAE 是最容易了解的,因为它是平均误差。
- MSE 比 MAE 更受欢迎,因为 MSE“惩办”更大的谬误,这在事实世界中往往很有用。
- RMSE 比 MSE 更受欢迎,因为 RMSE 能够用“y”单位解释
这些都是损失函数,咱们的训练指标就是最小化他们。
test_pred = lin_reg.predict(X_test) | |
train_pred = lin_reg.predict(X_train) | |
print('Test set evaluation:\n_____________________________________') | |
print_evaluate(y_test, test_pred) | |
print('Train set evaluation:\n_____________________________________') | |
print_evaluate(y_train, train_pred) | |
results_df = pd.DataFrame(data=[["Linear Regression", *evaluate(y_test, test_pred) , | |
cross_val(LinearRegression())]], | |
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"]) | |
Test set evaluation: | |
_____________________________________ | |
MAE: 81135.56609336878 | |
MSE: 10068422551.40088 | |
RMSE: 100341.52954485436 | |
R2 Square 0.9146818498754016 | |
__________________________________ | |
Train set evaluation: | |
_____________________________________ | |
MAE: 81480.49973174892 | |
MSE: 10287043161.197224 | |
RMSE: 101425.06180031257 | |
R2 Square 0.9192986579075526 | |
__________________________________ |
2、Robust 回归
Robust 回归是一种回归剖析模式,它的指标是克服传统参数和非参数办法的一些局限性,旨在不受根底数据生成过程违反回归假如的适度影响。
当数据蕴含异样值时,则会思考 Robust 回归。在存在异样值的状况下,最小二乘预计效率低下并且可能存在偏差。因为最小二乘预测被拖向离群值,并且因为预计的方差被人为夸张,后果是离群值能够被覆盖了。
随机样本共识——RANSAC
随机样本共识 (RANSAC) 是一种迭代办法,它从一组察看到的蕴含异样值的数据中预计数学模型的参数,而异样值不会对估计值产生影响。因而它也能够了解为一种异样值检测办法。
一个根本的假如是,数据由“内值”和“异样值”组成,“内值”即数据的散布能够用一组模型参数来解释,但可能受噪声影响,“异样值”是不合乎模型的数据。RANSAC 还假如,给定一组 (通常很小) 内点,存在一个程序能够预计模型的参数,以最优地解释或拟合该数据。
from sklearn.linear_model import RANSACRegressor | |
model = RANSACRegressor(base_estimator=LinearRegression(), max_trials=100) | |
model.fit(X_train, y_train) | |
test_pred = model.predict(X_test) | |
train_pred = model.predict(X_train) | |
print('Test set evaluation:\n_____________________________________') | |
print_evaluate(y_test, test_pred) | |
print('====================================') | |
print('Train set evaluation:\n_____________________________________') | |
print_evaluate(y_train, train_pred) | |
results_df_2 = pd.DataFrame(data=[["Robust Regression", *evaluate(y_test, test_pred) , cross_val(RANSACRegressor())]], | |
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"] | |
) | |
results_df = results_df.append(results_df_2, ignore_index=True) | |
Test set evaluation: | |
_____________________________________ | |
MAE: 84645.31069259303 | |
MSE: 10996805871.555056 | |
RMSE: 104865.65630155115 | |
R2 Square 0.9068148829222649 | |
__________________________________ | |
==================================== | |
Train set evaluation: | |
_____________________________________ | |
MAE: 84956.48056962446 | |
MSE: 11363196455.35414 | |
RMSE: 106598.29480509592 | |
R2 Square 0.9108562888249323 | |
_________________________________ |
3、Ridge 回归
Ridge 回归通过对系数的大小施加惩办来解决一般最小二乘法的一些问题。Ridge 系数最小化惩办残差平方和
alpha >= 0 是管制膨胀量的复杂性参数:alpha 值越大,膨胀量越大,因而系数对共线性的鲁棒性更强。
Ridge 回归是一个 L2 惩办模型。将权重的平方和增加到最小二乘老本函数。
from sklearn.linear_model import Ridge | |
model = Ridge(alpha=100, solver='cholesky', tol=0.0001, random_state=42) | |
model.fit(X_train, y_train) | |
pred = model.predict(X_test) | |
test_pred = model.predict(X_test) | |
train_pred = model.predict(X_train) | |
print('Test set evaluation:\n_____________________________________') | |
print_evaluate(y_test, test_pred) | |
print('====================================') | |
print('Train set evaluation:\n_____________________________________') | |
print_evaluate(y_train, train_pred) | |
results_df_2 = pd.DataFrame(data=[["Ridge Regression", *evaluate(y_test, test_pred) , cross_val(Ridge())]], | |
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"] | |
) | |
results_df = results_df.append(results_df_2, ignore_index=True) | |
Test set evaluation: | |
_____________________________________ | |
MAE: 81428.64835535336 | |
MSE: 10153269900.892609 | |
RMSE: 100763.43533689494 | |
R2 Square 0.9139628674464607 | |
__________________________________ | |
==================================== | |
Train set evaluation: | |
_____________________________________ | |
MAE: 81972.39058585509 | |
MSE: 10382929615.14346 | |
RMSE: 101896.66145239233 | |
R2 Square 0.9185464334441484 | |
__________________________________ |
4、LASSO 回归
LASSO 回归是一种预计稠密系数的线性模型。在数学上,它由一个用 L1 先验作为正则化器训练的线性模型组成。最小化的指标函数是:
from sklearn.linear_model import Lasso | |
model = Lasso(alpha=0.1, | |
precompute=True, | |
# warm_start=True, | |
positive=True, | |
selection='random', | |
random_state=42) | |
model.fit(X_train, y_train) | |
test_pred = model.predict(X_test) | |
train_pred = model.predict(X_train) | |
print('Test set evaluation:\n_____________________________________') | |
print_evaluate(y_test, test_pred) | |
print('====================================') | |
print('Train set evaluation:\n_____________________________________') | |
print_evaluate(y_train, train_pred) | |
results_df_2 = pd.DataFrame(data=[["Lasso Regression", *evaluate(y_test, test_pred) , cross_val(Lasso())]], | |
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"] | |
) | |
results_df = results_df.append(results_df_2, ignore_index=True) | |
Test set evaluation: | |
_____________________________________ | |
MAE: 81135.6985172622 | |
MSE: 10068453390.364521 | |
RMSE: 100341.68321472648 | |
R2 Square 0.914681588551116 | |
__________________________________ | |
==================================== | |
Train set evaluation: | |
_____________________________________ | |
MAE: 81480.63002185506 | |
MSE: 10287043196.634295 | |
RMSE: 101425.0619750084 | |
R2 Square 0.9192986576295505 | |
__________________________________ |
5、Elastic Net
Elastic Net 应用 L1 和 L2 先验作为正则化器进行训练。这种组合容许学习一个稠密模型,其中很少有像 Lasso 那样的非零权重,同时依然放弃 Ridge 的正则化属性。
当多个特色互相关联时,Elastic Net 络很有用。Lasso 可能会随机抉择关联特色其中之一,而 Elastic Net 可能会同时抉择两者。Elastic Net 最小化的指标函数是:
from sklearn.linear_model import ElasticNet | |
model = ElasticNet(alpha=0.1, l1_ratio=0.9, selection='random', random_state=42) | |
model.fit(X_train, y_train) | |
test_pred = model.predict(X_test) | |
train_pred = model.predict(X_train) | |
print('Test set evaluation:\n_____________________________________') | |
print_evaluate(y_test, test_pred) | |
print('====================================') | |
print('Train set evaluation:\n_____________________________________') | |
print_evaluate(y_train, train_pred) | |
results_df_2 = pd.DataFrame(data=[["Elastic Net Regression", *evaluate(y_test, test_pred) , cross_val(ElasticNet())]], | |
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"] | |
) | |
results_df = results_df.append(results_df_2, ignore_index=True) | |
Test set evaluation: | |
_____________________________________ | |
MAE: 81184.43147330945 | |
MSE: 10078050168.470106 | |
RMSE: 100389.49232100991 | |
R2 Square 0.9146002670381437 | |
__________________________________ | |
==================================== | |
Train set evaluation: | |
_____________________________________ | |
MAE: 81577.88831531754 | |
MSE: 10299274948.101461 | |
RMSE: 101485.34351373829 | |
R2 Square 0.9192027001474953 | |
__________________________________ |
6、多项式回归
机器学习中的一种常见模式是应用在数据的非线性函数上训练的线性模型。这种办法放弃了线性办法通常疾速的性能,同时容许它们适应更宽泛的数据。
能够通过从系数结构多项式特色来扩大简略的线性回归。在规范线性回归中,可能有一个看起来像这样的二维数据模型:
如果咱们想对数据拟合抛物面而不是立体,咱们能够将特色组合成二阶多项式,使模型看起来像这样:
这依然是一个线性模型:那么如果咱们创立一个新的变量
通过从新标记数据,那么公式能够写成
能够看到到生成的多项式回归属于下面的同一类线性模型(即模型在 w 中是线性的),并且能够通过雷同的技术求解。通过思考应用这些基函数构建的高维空间内的线性拟合,该模型能够灵便地拟合更宽泛的数据范畴。
from sklearn.preprocessing import PolynomialFeatures | |
poly_reg = PolynomialFeatures(degree=2) | |
X_train_2_d = poly_reg.fit_transform(X_train) | |
X_test_2_d = poly_reg.transform(X_test) | |
lin_reg = LinearRegression(normalize=True) | |
lin_reg.fit(X_train_2_d,y_train) | |
test_pred = lin_reg.predict(X_test_2_d) | |
train_pred = lin_reg.predict(X_train_2_d) | |
print('Test set evaluation:\n_____________________________________') | |
print_evaluate(y_test, test_pred) | |
print('====================================') | |
print('Train set evaluation:\n_____________________________________') | |
print_evaluate(y_train, train_pred) | |
results_df_2 = pd.DataFrame(data=[["Polynomail Regression", *evaluate(y_test, test_pred), 0]], | |
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', 'Cross Validation'] | |
) | |
results_df = results_df.append(results_df_2, ignore_index=True) | |
Test set evaluation: | |
_____________________________________ | |
MAE: 81174.51844119698 | |
MSE: 10081983997.620703 | |
RMSE: 100409.0832426066 | |
R2 Square 0.9145669324195059 | |
__________________________________ | |
==================================== | |
Train set evaluation: | |
_____________________________________ | |
MAE: 81363.0618562117 | |
MSE: 10266487151.007816 | |
RMSE: 101323.67517519198 | |
R2 Square 0.9194599187853729 | |
__________________________________ |
7、随机梯度降落
梯度降落是一种十分通用的优化算法,可能为各种问题找到最佳解决方案。梯度降落的个别思维是迭代地调整参数以最小化老本函数。梯度降落测量误差函数绝对于参数向量的部分梯度,它沿着梯度降落的方向后退。一旦梯度为零,就达到了最小值。
from sklearn.linear_model import SGDRegressor | |
sgd_reg = SGDRegressor(n_iter_no_change=250, penalty=None, eta0=0.0001, max_iter=100000) | |
sgd_reg.fit(X_train, y_train) | |
test_pred = sgd_reg.predict(X_test) | |
train_pred = sgd_reg.predict(X_train) | |
print('Test set evaluation:\n_____________________________________') | |
print_evaluate(y_test, test_pred) | |
print('====================================') | |
print('Train set evaluation:\n_____________________________________') | |
print_evaluate(y_train, train_pred) | |
results_df_2 = pd.DataFrame(data=[["Stochastic Gradient Descent", *evaluate(y_test, test_pred), 0]], | |
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', 'Cross Validation'] | |
) | |
results_df = results_df.append(results_df_2, ignore_index=True) | |
Test set evaluation: | |
_____________________________________ | |
MAE: 81135.56682170597 | |
MSE: 10068422777.172981 | |
RMSE: 100341.53066987259 | |
R2 Square 0.914681847962246 | |
__________________________________ | |
==================================== | |
Train set evaluation: | |
_____________________________________ | |
MAE: 81480.49901528798 | |
MSE: 10287043161.228634 | |
RMSE: 101425.06180046742 | |
R2 Square 0.9192986579073061 | |
__________________________________ |
8、多层感知机
多层感知机绝对于简略回归工作的益处是简略的线性回归模型只能学习特色和指标之间的线性关系,因而无奈学习简单的非线性关系。因为每一层都存在激活函数,多层感知机有能力学习特色和指标之间的简单关系。
from tensorflow.keras.models import Sequential | |
from tensorflow.keras.layers import Input, Dense, Activation, Dropout | |
from tensorflow.keras.optimizers import Adam | |
X_train = np.array(X_train) | |
X_test = np.array(X_test) | |
y_train = np.array(y_train) | |
y_test = np.array(y_test) | |
model = Sequential() | |
model.add(Dense(X_train.shape[1], activation='relu')) | |
model.add(Dense(32, activation='relu')) | |
# model.add(Dropout(0.2)) | |
model.add(Dense(64, activation='relu')) | |
# model.add(Dropout(0.2)) | |
model.add(Dense(128, activation='relu')) | |
# model.add(Dropout(0.2)) | |
model.add(Dense(512, activation='relu')) | |
model.add(Dropout(0.1)) | |
model.add(Dense(1)) | |
model.compile(optimizer=Adam(0.00001), loss='mse') | |
r = model.fit(X_train, y_train, | |
validation_data=(X_test,y_test), | |
batch_size=1, | |
epochs=100) | |
pd.DataFrame({'True Values': y_test, 'Predicted Values': pred}).hvplot.scatter(x='True Values', y='Predicted Values') | |
pd.DataFrame(r.history) |
pd.DataFrame(r.history).hvplot.line(y=['loss', 'val_loss'])
test_pred = model.predict(X_test) | |
train_pred = model.predict(X_train) | |
print('Test set evaluation:\n_____________________________________') | |
print_evaluate(y_test, test_pred) | |
print('Train set evaluation:\n_____________________________________') | |
print_evaluate(y_train, train_pred) | |
results_df_2 = pd.DataFrame(data=[["Artficial Neural Network", *evaluate(y_test, test_pred), 0]], | |
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', 'Cross Validation'] | |
) | |
results_df = results_df.append(results_df_2, ignore_index=True) | |
Test set evaluation: | |
_____________________________________ | |
MAE: 101035.09313018023 | |
MSE: 16331712517.46175 | |
RMSE: 127795.58880282899 | |
R2 Square 0.8616077649459881 | |
__________________________________ | |
Train set evaluation: | |
_____________________________________ | |
MAE: 102671.5714851714 | |
MSE: 17107402549.511665 | |
RMSE: 130795.2695991398 | |
R2 Square 0.8657932776379376 | |
__________________________________ |
9、随机森林回归
from sklearn.ensemble import RandomForestRegressor | |
rf_reg = RandomForestRegressor(n_estimators=1000) | |
rf_reg.fit(X_train, y_train) | |
test_pred = rf_reg.predict(X_test) | |
train_pred = rf_reg.predict(X_train) | |
print('Test set evaluation:\n_____________________________________') | |
print_evaluate(y_test, test_pred) | |
print('Train set evaluation:\n_____________________________________') | |
print_evaluate(y_train, train_pred) | |
results_df_2 = pd.DataFrame(data=[["Random Forest Regressor", *evaluate(y_test, test_pred), 0]], | |
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', 'Cross Validation'] | |
) | |
results_df = results_df.append(results_df_2, ignore_index=True) | |
Test set evaluation: | |
_____________________________________ | |
MAE: 94032.15903928125 | |
MSE: 14073007326.955029 | |
RMSE: 118629.70676417871 | |
R2 Square 0.8807476597554337 | |
__________________________________ | |
Train set evaluation: | |
_____________________________________ | |
MAE: 35289.68268023927 | |
MSE: 1979246136.9966476 | |
RMSE: 44488.71921056671 | |
R2 Square 0.9844729124701823 | |
__________________________________ |
10、反对向量机
from sklearn.svm import SVR | |
svm_reg = SVR(kernel='rbf', C=1000000, epsilon=0.001) | |
svm_reg.fit(X_train, y_train) | |
test_pred = svm_reg.predict(X_test) | |
train_pred = svm_reg.predict(X_train) | |
print('Test set evaluation:\n_____________________________________') | |
print_evaluate(y_test, test_pred) | |
print('Train set evaluation:\n_____________________________________') | |
print_evaluate(y_train, train_pred) | |
results_df_2 = pd.DataFrame(data=[["SVM Regressor", *evaluate(y_test, test_pred), 0]], | |
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', 'Cross Validation'] | |
) | |
results_df = results_df.append(results_df_2, ignore_index=True) | |
Test set evaluation: | |
_____________________________________ | |
MAE: 87205.73051021634 | |
MSE: 11720932765.275513 | |
RMSE: 108263.25676458987 | |
R2 Square 0.9006787511983232 | |
__________________________________ | |
Train set evaluation: | |
_____________________________________ | |
MAE: 73692.5684807321 | |
MSE: 9363827731.411337 | |
RMSE: 96766.87310960986 | |
R2 Square 0.9265412370487783 | |
__________________________________ | |
后果比照
以上就是咱们常见的 10 个回归算法,上面看看后果的比照
results_df
results_df.set_index('Model', inplace=True) | |
results_df['R2 Square'].plot(kind='barh', figsize=(12, 8)) |
能够看到,尽管本例的差异很小(这是因为数据集的起因),然而每个算法还是有轻微的差异的,咱们能够依据不同的理论状况抉择体现较好的算法。
总结
在本文中,咱们介绍了机器学习中的常见的线性回归算法包含:
- 常见的线性回归模型(Ridge、Lasso、ElasticNet……)
- 模型应用的办法
- 采纳学习算法对模型中的系数进行预计
- 如何评估线性回归模型
如果你对代码感兴趣,本文的残缺源代码在这里:
https://avoid.overfit.cn/post/80b712f97fce48418be96916262f9f81
作者:Fares Sayah