关于人工智能:使用Python预测缺失值

作者 |Sadrach Pierre, Ph.D.
编译 |VK
起源 |Towards Data Science

对于数据科学家来说，解决失落的数据是数据清理和模型开发过程中的一个重要局部。通常状况下，实在数据蕴含多个稠密字段或蕴含谬误值的字段。在这篇文章中，咱们将探讨如何建设能够用来填补数据中缺失或谬误值的模型。

出于咱们的目标，咱们将应用能够在这里找到的葡萄酒数据集:https://www.kaggle.com/zynici…

import pandas as pd
df = pd.read_csv("winemag-data-130k-v2.csv")

接下来，让咱们输入前五行数据：

print(df.head())

让咱们从这些数据中随机抽取 500 条记录。这将有助于放慢模型训练和测试，只管读者能够很容易地对其进行批改：

import pandas as pd
df = pd.read_csv("winemag-data-130k-v2.csv").sample(n=500, random_state = 42)

当初，让咱们打印与数据对应的信息，这将使咱们理解哪些列短少值：

print(df.info())

有几个列的非空值小于 500，这与短少的值绝对应。首先，让咱们思考建设一个模型，用“points”来估算缺失的“price”值。首先，让咱们打印“price”和“points”之间的相关性：

print("Correlation:", df['points'].corr(df['price']))

咱们看到了一个强劲的正相干。让咱们建设一个线性回归模型，用“points”来预测“price”。首先，让咱们从“scikit learn”导入“LinearRegresssion”模块：

from sklearn.linear_model import LinearRegression

当初，让咱们为训练和测试拆分数据。咱们心愿可能预测缺失值，但咱们应该应用实在值“price”来验证咱们的预测。让咱们通过只抉择正价格值来筛选短少的值：

import numpy as np 
df_filter = df[df['price'] > 0].copy()

咱们还能够初始化用于存储预测和理论值的列表：

y_pred = []
y_true = []

咱们将应用 K -fold 穿插验证来验证咱们的模型。让咱们从“scikit learn”导入“KFolds”模块。咱们将应用 10 折来验证咱们的模型：

from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state = 42)
for train_index, test_index in kf.split(df_filter):
    df_test = df_filter.iloc[test_index]
    df_train = df_filter.iloc[train_index]

咱们当初能够定义咱们的输出和输入：

for train_index, test_index in kf.split(df_filter):
    ...
    X_train = np.array(df_train['points']).reshape(-1, 1)     
    y_train = np.array(df_train['price']).reshape(-1, 1)
    X_test = np.array(df_test['points']).reshape(-1, 1)  
    y_test = np.array(df_test['price']).reshape(-1, 1)

并拟合咱们的线性回归模型：

for train_index, test_index in kf.split(df_filter):
    ...
    model = LinearRegression()
    model.fit(X_train, y_train)

当初让咱们生成并存储咱们的预测：

for train_index, test_index in kf.split(df_filter):
    ...
    y_pred.append(model.predict(X_test)[0])
    y_true.append(y_test[0])

当初让咱们评估一下模型的性能。让咱们用均方误差来评估模型的性能：

print("Mean Square Error:", mean_squared_error(y_true, y_pred))

并不太好。咱们能够通过训练平均价格加上一个标准差来改善这一点：

df_filter = df[df['price'] <= df['price'].mean() + df['price'].std()].copy()
...
print("Mean Square Error:", mean_squared_error(y_true, y_pred))

尽管这大大提高了性能，但其代价是无奈精确估算葡萄酒的 price。与应用繁多特色的回归模型预测价格不同，咱们能够应用树基模型，例如随机森林模型，它能够解决类别和数值变量。

让咱们建设一个随机森林回归模型，应用“country”、“province”、“variety”、“winery”和“points”来预测葡萄酒的“price”。首先，让咱们将分类变量转换为可由随机森林模型解决的分类代码：

df['country_cat'] = df['country'].astype('category')
df['country_cat'] = df['country_cat'].cat.codes

df['province_cat'] = df['province'].astype('category')
df['province_cat'] = df['province_cat'].cat.codes

df['winery_cat'] = df['winery'].astype('category')
df['winery_cat'] = df['winery_cat'].cat.codes

df['variety_cat'] = df['variety'].astype('category')
df['variety_cat'] = df['variety_cat'].cat.codes

让咱们将随机样本大小减少到 5000：

df = pd.read_csv("winemag-data-130k-v2.csv").sample(n=5000, random_state = 42)

接下来，让咱们从 scikit learn 导入随机森林回归器模块。咱们还能够定义用于训练模型的特色列表：

from sklearn.ensemble import RandomForestRegressor
features = ['points', 'country_cat', 'province_cat', 'winery_cat', 'variety_cat']

让咱们用一个随机森林来训练咱们的模型，它有 1000 个估计量，最大深度为 1000。而后，让咱们生成预测并将其附加到新列表中：

for train_index, test_index in kf.split(df_filter):
    df_test = df_filter.iloc[test_index]
    df_train = df_filter.iloc[train_index]
    
    X_train = np.array(df_train[features])
    y_train = np.array(df_train['price'])
    X_test = np.array(df_test[features])
    y_test = np.array(df_test['price'])
    model = RandomForestRegressor(n_estimators = 1000, max_depth = 1000, random_state = 42)
    model.fit(X_train, y_train)
    
    y_pred_rf.append(model.predict(X_test)[0])
    y_true_rf.append(y_test[0])

最初，让咱们评估随机森林和线性回归模型的均方误差：

print("Mean Square Error (Linear Regression):", mean_squared_error(y_true, y_pred))
print("Mean Square Error (Random Forest):", mean_squared_error(y_pred_rf, y_true_rf))

咱们看到随机森林模型具备优越的性能。当初，让咱们应用咱们的模型预测缺失的价格值，并显示 price 预测：

df_missing = df[df['price'].isnull()].copy()

X_test_lr = np.array(df_missing['points']).reshape(-1, 1)
X_test_rf = np.array(df_missing[features])

X_train_lr = np.array(df_filter['points']).reshape(-1, 1)    
y_train_lr = np.array(df_filter['price']).reshape(-1, 1)

X_train_rf = np.array(df_filter[features])
y_train_rf = np.array(df_filter['price'])

model_lr = LinearRegression()
model_lr.fit(X_train_lr, y_train_lr)
print("Linear regression predictions:", model_lr.predict(X_test_lr)[0][0])

model_rf = RandomForestRegressor(n_estimators = 1000, max_depth = 1000, random_state = 42)
model_rf.fit(X_train_rf, y_train_rf)
print("Random forests regression predictions:", model_rf.predict(X_test_rf)[0])

我就到此为止，但我激励你尝试一下特征选择和超参数调整，看看是否能够进步性能。此外，我激励你扩大此数据进行插补模型，以填补“region_1”和“designation”等分类字段中的缺失值。在这里，你能够构建一个基于树的分类模型，依据分类和数值特色来预测所列类别的缺失值。

总而言之，在这篇文章中，咱们探讨了如何建设机器学习模型，咱们能够用来填补数据中的缺失值。首先，咱们建设了一个线性回归模型，用以预测葡萄酒的价格。而后，咱们建设了一个随机森林模型，用“points”和其余分类变量来预测葡萄酒价格。咱们发现，随机森林模型显著优于基于线性回归的数据插补模型。本文中的代码能够在 GitHub 上找到。谢谢你的浏览！

Github 链接：https://github.com/spierre91/…

原文链接：https://towardsdatascience.co…

欢送关注磐创 AI 博客站：
http://panchuang.net/

sklearn 机器学习中文官网文档：
http://sklearn123.com/

欢送关注磐创博客资源汇总站：
http://docs.panchuang.net/

论断