共计 12398 个字符,预计需要花费 31 分钟才能阅读完成。
💡 作者:韩信子 @ShowMeAI
📘 数据分析实战系列:https://www.showmeai.tech/tutorials/40
📘 机器学习实战系列:https://www.showmeai.tech/tutorials/41
📘 本文地址:https://www.showmeai.tech/article-detail/300
📢 申明:版权所有,转载请分割平台与作者并注明出处
📢 珍藏 ShowMeAI 查看更多精彩内容
一份来自『RESEARCH AND MARKETS』的二手车报告预计,从 2022 年到 2030 年,寰球二手车市场将以 6.1% 的复合年增长率增长,到 2030 年达到 2.67 万亿美元。人工智能技术的宽泛应用减少了车主和买家之间的透明度,晋升了购买体验,极大地推动了二手车市场的增长。
基于机器学习对二手车交易价格进行预估,这一技术曾经在二手车交易平台中宽泛应用。在本篇内容中,ShowMeAI 会残缺构建用于二手车价格预估的模型,并部署成 web 利用。
💡 数据分析解决 & 特色工程
本案例波及的数据集能够在 🏆 kaggle 汽车价格预测 获取,也能够在 ShowMeAI 的百度网盘地址间接下载。
🏆 实战数据集下载(百度网盘):公众号『ShowMeAI 钻研核心』回复『实战 』,或者点击 这里 获取本文 [[11] 构建 AI 模型并部署 Web 利用,预测二手车价格](https://www.showmeai.tech/art…)『CarPrice 二手车价格预测数据集』
⭐ ShowMeAI 官网 GitHub:https://github.com/ShowMeAI-Hub
① 数据摸索
数据分析解决波及的工具和技能,欢送大家查阅 ShowMeAI 对应的教程和工具速查表,快学快用。
- 图解数据分析:从入门到精通系列教程
- 数据迷信工具库速查表 | Pandas 速查表
- 数据迷信工具库速查表 | Seaborn 速查表
咱们先加载数据并初步查看信息。
import numpy as np | |
import pandas as pd | |
import seaborn as sns | |
import matplotlib.pyplot as plt | |
import pickle | |
%matplotlib.inline | |
df=pd.read_csv('CarPrice_Assignment.csv') | |
df.head() |
数据 Dataframe 的数据预览如下:
咱们对属性字段做点剖析,看看哪些字段与价格最相干,咱们先计算相关性矩阵
df.corr()
再对相关性进行热力求可视化。
sns.set(rc={"figure.figsize":(20, 20)}) | |
sns.heatmap(df.corr(), annot = True) |
其中各字段和 price 的相关性如下图所示,咱们能够看到其中有些字段和后果之间有十分强的相关性。
咱们能够对数值型字段,别离和 price 指标字段进行绘图详细分析,如下:
for col in df.columns: | |
if df[col].dtypes != 'object': | |
sns.lmplot(data = df, x = col, y = 'price') |
可视化后果图如下:
咱们把一些与价格相关性低(r<0.15)的字段删除掉:
df.drop(['car_ID'], axis = 1, inplace = True) | |
to_drop = ['peakrpm', 'compressionratio', 'stroke', 'symboling'] | |
df.drop(df[to_drop], axis = 1, inplace = True) |
② 特色工程
特色工程波及的办法技能,欢送大家查阅 ShowMeAI 对应的教程文章,快学快用。
- 机器学习实战 | 机器学习特色工程最全解读
车名列包含品牌和型号,咱们对其拆分并仅保留品牌:
df['CarName'] = df['CarName'].apply(lambda x: x.split()[0])
输入:
咱们发现有一些车品牌的别称或者拼写错误,咱们做一点数据荡涤如下:
df['CarName'] = df['CarName'].str.lower() | |
df['CarName']=df['CarName'].replace({'vw':'volkswagen','vokswagen':'volkswagen','toyouta':'toyota','maxda':'mazda','porcshce':'porsche'}) |
再对不同车品牌的数量做绘图,如下:
sns.set(rc={'figure.figsize':(30,10)}) | |
sns.countplot(data = df, x='CarName') |
③ 特色编码 & 数据变换
上面咱们要做进一步的特色工程:
- 类别型特色
大部分机器学习模型并不能解决类别型数据,咱们会手动对其进行编码操作。类别型特色的编码能够采纳 序号编码 或者 独热向量编码(具体参见 ShowMeAI 文章 机器学习实战 | 机器学习特色工程最全解读),独热向量编码示意图如下:
- 数值型特色
针对不同的模型,有不同的解决形式,比方幅度缩放和散布调整。
上面咱们先将数据集的字段分为两类:类别型和数值型:
categorical = [] | |
numerical = [] | |
for col in df.columns: | |
if df[col].dtypes == 'object': | |
categorical.append(col) | |
else: | |
numerical.append(col) |
上面咱们应用 pandas 中的哑变量变换操作把所有标记为“categorical”的特色进行独热向量编码。
# 独热向量编码 | |
x1 = pd.get_dummies(df[categorical], drop_first = False) | |
x2 = df[numerical] | |
X = pd.concat([x2,x1], axis = 1) | |
X.drop('price', axis = 1, inplace = True) |
上面咱们对数值型特色进行解决,首先咱们看看标签字段 price,咱们先绘制一下它的散布,如下:
sns.histplot(data=df, x="price", kde=True)
大家从图上能够看出这是一个有偏散布。咱们对它做一个对数解决,以使其更靠近正态分布。(另外一个考量是,如果咱们以对数后的后果作为标签来建模学习,那还原回 price 的过程,会应用指数操作,这能保障咱们失去的价格肯定是负数),代码如下:
# 修复偏态散布 | |
df["price_log"]=np.log(df["price"]) | |
sns.histplot(data=df, x="price_log", kde=True) |
校对过后的数据分布更靠近正态分布了,做过这些根底解决之后,咱们筹备开始建模了。
💡 机器学习建模
① 数据集切分 & 数据变换
让咱们拆分数据集为训练和测试集,并对其进行根本的数据变换操作:
#切分数据 | |
from sklearn.model_selection import train_test_split | |
y = df['price_log'] | |
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.333, random_state=1) | |
#特色工程 - 幅度缩放 | |
from sklearn.preprocessing import StandardScaler | |
sc= StandardScaler() | |
X_train[:, :(len(x1.columns))]= sc.fit_transform(X_train[:, :(len(x1.columns))]) | |
X_test[:, :(len(x1.columns))]= sc.transform(X_test[:, :(len(x1.columns))]) |
② 建模 & 调优
建模波及的办法技能,欢送大家查阅 ShowMeAI 对应的教程文章,快学快用。
- 机器学习实战 | SKLearn 最全利用指南
咱们这里的数据集并不大(样本数不多),基于模型复杂度和成果思考,咱们先测试 4 个模型,看看哪一个体现最好。
- Lasso regression
- Ridge regression
- 随机森林回归器
- XGBoost 回归器
咱们先从 scikit-learn 导入对应的模型,如下:
# 回归模型 | |
from sklearn.linear_model import Lasso, Ridge | |
from sklearn.ensemble import RandomForestRegressor | |
import xgboost as xgb |
③ 建模 pipeline
为了让整个建模过程更加紧凑简介,咱们创立一个 pipeline 来训练和调优模型。具体步骤为:
- 应用随机超参数训练评估每个模型。
- 应用网格搜寻调优每个模型的超参数。
- 用找到的最佳参数从新训练评估模型。
咱们先从 scikit-learn 导入网格搜寻:
from sklearn.model_selection import GridSearchCV
接着咱们构建一个全面的评估指标函数,打印每个拟合模型的指标(R 平方、均方根误差和均匀绝对误差等):
def metrics(model): | |
res_r2 = [] | |
res_RMSE = [] | |
res_MSE = [] | |
model.fit(X_train, y_train) | |
Y_pred = model.predict(X_test) | |
#计算 R 方 | |
r2 = round(r2_score(y_test, Y_pred),4) | |
print('R2_Score:', r2) | |
res_r2.append(r2) | |
#计算 RMSE | |
rmse = round(mean_squared_error(np.exp(y_test),np.exp(Y_pred), squared=False), 2) | |
print("RMSE:",rmse) | |
res_RMSE.append(rmse) | |
#计算 MAE | |
mse = round(mean_absolute_error(np.exp(y_test),np.exp(Y_pred)), 2) | |
print("MAE:", mse) | |
res_MSE.append(mse) |
上面要构建 pipeline 了:
# 候选模型 | |
models={'rfr':RandomForestRegressor(bootstrap=False, max_depth=15, max_features='sqrt', min_samples_split=2, n_estimators=100), | |
'lasso':Lasso(alpha=0.005, fit_intercept=True), | |
'ridge':Ridge(alpha = 10, fit_intercept=True), 'xgb':xgb.XGBRegressor(bootstrap=True, max_depth=2, max_features = 'auto', min_sample_split = 2, n_estimators = 100) | |
} | |
# 不同的模型不同建模办法 | |
for mod in models: | |
if mod == 'rfr' or mod == 'xgb': | |
print('Untuned metrics for:', mod) | |
metrics(models[mod]) | |
print('\n') | |
print('Starting grid search for:', mod) | |
params = {"n_estimators" : [10,100, 1000, 2000, 4000, 6000], | |
"max_features" : ["auto", "sqrt", "log2"], | |
"max_depth" : [2, 4, 8, 12, 15], | |
"min_samples_split" : [2,4,8], | |
"bootstrap": [True, False], | |
} | |
if mod == 'rfr': | |
rfr = RandomForestRegressor() | |
grid = GridSearchCV(rfr, params, verbose=5, cv=2) | |
grid.fit(X_train, y_train) | |
print("Best score:", grid.best_score_) | |
print("Best: params", grid.best_params_) | |
else: | |
xgboost = xgb.XGBRegressor() | |
grid = GridSearchCV(xgboost, params, verbose=5, cv=2) | |
grid.fit(X_train, y_train) | |
print("Best score:", grid.best_score_) | |
print("Best: params", grid.best_params_) | |
else: | |
print('Untuned metrics for:', mod) | |
metrics(models[mod]) | |
print('\n') | |
print('Starting grid search for:', mod) | |
params = {"alpha": [0.005, 0.05, 0.1, 1, 10, 100, 290, 500], | |
"fit_intercept": [True, False] | |
} | |
if mod == 'lasso': | |
lasso = Lasso() | |
grid = GridSearchCV(lasso, params, verbose = 5, cv = 2) | |
grid.fit(X_train, y_train) | |
print("Best score:", grid.best_score_) | |
print("Best: params", grid.best_params_) | |
else: | |
ridge = Ridge() | |
grid = GridSearchCV(ridge, params, verbose = 5, cv = 2) | |
grid.fit(X_train, y_train) | |
print("Best score:", grid.best_score_) | |
print("Best: params", grid.best_params_) |
以下是随机调整模型的后果:
在未调超参数的状况下,咱们看到差别不大的 R 方后果,但 Lasso 的误差最小。
咱们再看看网格搜寻的后果,以找到每个模型的最佳参数:
当初让咱们将这些参数利用于每个模型,并查看后果:
调参后的后果相比默认超参数,都有晋升,但 Lasso 回归仍旧有最佳的成果(与本例的数据集样本量和特色相关性无关),咱们最终保留 Lasso 回归模型并存储模型到本地。
lasso_reg = Lasso(alpha = 0.005, fit_intercept = True) | |
pickle.dump(lasso_reg, open('model.pkl','wb')) |
💡 web 利用开发
上面咱们把下面失去的模型部署到网页端,造成一个能够实时预估的利用,咱们这里应用 gradio 库来开发 Web 应用程序,理论的 web 利用预估蕴含上面的步骤:
- 用户在网页表单中输出数据
- 解决数据(特色编码 & 变换)
- 数据处理以匹配模型输出格局
- 预测并出现给用户的价格
① 根本开发
首先,咱们导入原始数据集和做过数据处理(独热向量编码)的数据集,并保留它们各自的列。
# df 的列 | |
#Columns of the df | |
df = pd.read_csv('df_columns') | |
df.drop(['Unnamed: 0','price'], axis = 1, inplace=True) | |
cols = df.columns | |
# df 的哑变量列 | |
dummy = pd.read_csv('dummy_df') | |
dummy.drop('Unnamed: 0', axis = 1, inplace=True) | |
cols_to_use = dummy.columns |
接下来,对于类别型特色,咱们构建 web 利用端下拉选项:
# 构建利用中的候选值 | |
# 车品牌首字母大写 | |
cars = df['CarName'].unique().tolist() | |
carNameCap = [] | |
for col in cars: | |
carNameCap.append(col.capitalize()) | |
#fueltype 字段 | |
fuel = df['fueltype'].unique().tolist() | |
fuelCap = [] | |
for fu in fuel: | |
fuelCap.append(fu.capitalize()) | |
#carbod, engine type, fuel systems 等字段 | |
carb = df['carbody'].unique().tolist() | |
engtype = df['enginetype'].unique().tolist() | |
fuelsys = df['fuelsystem'].unique().tolist() |
OK,咱们会针对下面这些模型预估须要用到的类别型字段,开发下拉性能并增加候选项。
上面咱们定义一个函数进行数据处理,并预估返回价格:
# 数据变换处理以匹配模型 | |
def transform(data): | |
# 数据幅度缩放 | |
sc = StandardScaler() | |
# 导入模型 | |
model= pickle.load(open('model.pkl','rb')) | |
# 新数据 Dataframe | |
new_df = pd.DataFrame([data],columns = cols) | |
# 辨别类别型和数值型特色 | |
cat = [] | |
num = [] | |
for col in new_df.columns: | |
if new_df[col].dtypes == 'object': | |
cat.append(col) | |
else: | |
num.append(col) | |
x1_new = pd.get_dummies(new_df[cat], drop_first = False) | |
x2_new = new_df[num] | |
X_new = pd.concat([x2_new,x1_new], axis = 1) | |
final_df = pd.DataFrame(columns = cols_to_use) | |
final_df = pd.concat([final_df, X_new]) | |
final_df = final_df.fillna(0) | |
X_new = final_df.values | |
X_new[:, :(len(x1_new.columns))]= sc.fit_transform(X_new[:, | |
:(len(x1_new.columns))]) | |
output = model.predict(X_new) | |
return "The price of the car" + str(round(np.exp(output)[0],2)) + "$" |
上面咱们在 gradio web 应用程序中创立元素,咱们会为类别型字段构建下拉菜单或复选框,为数值型字段构建输入框。参考代码如下:
# 类别型 | |
car = gr.Dropdown(label = "Car brand", choices=carNameCap) | |
# 数值型 | |
curbweight = gr.Slider(label = "Weight of the car (in pounds)", minimum = 500, maximum = 6000) |
当初,让咱们在界面中增加所有内容:
所有就绪就能够部署了!
② 部署
上面咱们把下面失去利用部署一下,首先咱们对于利用的 ip 和端口做一点设定
export GRADIO_SERVER_NAME=0.0.0.0 | |
export GRADIO_SERVER_PORT="$PORT" |
大家确定应用 pip 装置好下述依赖:
numpy | |
pandas | |
scikit-learn | |
gradio | |
Flask | |
argparse | |
gunicorn | |
rq |
接着运行 python WebApp.py
就能够测试应用程序了,WebApp.py
内容如下:
import gradio as gr | |
import numpy as np | |
import pandas as pd | |
import pickle | |
from sklearn.preprocessing import StandardScaler | |
# 数据字典 | |
asp = { | |
'Standard':'std', | |
'Turbo':'turbo' | |
} | |
drivew = { | |
'Rear wheel drive': 'rwd', | |
'Front wheel drive': 'fwd', | |
'4 wheel drive': '4wd' | |
} | |
cylnum = { | |
2: 'two', | |
3: 'three', | |
4: 'four', | |
5: 'five', | |
6: 'six', | |
8: 'eight', | |
12: 'twelve' | |
} | |
# 原始 df 字段名 | |
df = pd.read_csv('df_columns') | |
df.drop(['Unnamed: 0','price'], axis = 1, inplace=True) | |
cols = df.columns | |
# 独热向量编码过后的字段名 | |
dummy = pd.read_csv('dummy_df') | |
dummy.drop('Unnamed: 0', axis = 1, inplace=True) | |
cols_to_use = dummy.columns | |
# 车品牌名 | |
cars = df['CarName'].unique().tolist() | |
carNameCap = [] | |
for col in cars: | |
carNameCap.append(col.capitalize()) | |
# fuel | |
fuel = df['fueltype'].unique().tolist() | |
fuelCap = [] | |
for fu in fuel: | |
fuelCap.append(fu.capitalize()) | |
#For carbod, engine type, fuel systme | |
carb = df['carbody'].unique().tolist() | |
engtype = df['enginetype'].unique().tolist() | |
fuelsys = df['fuelsystem'].unique().tolist() | |
#Function to model data to fit the model | |
def transform(data): | |
# 数值型幅度缩放 | |
sc= StandardScaler() | |
# 导入模型 | |
lasso_reg = pickle.load(open('model.pkl','rb')) | |
# 新数据 Dataframe | |
new_df = pd.DataFrame([data],columns = cols) | |
# 切分类别型与数值型字段 | |
cat = [] | |
num = [] | |
for col in new_df.columns: | |
if new_df[col].dtypes == 'object': | |
cat.append(col) | |
else: | |
num.append(col) | |
# 构建模型所需数据格式 | |
x1_new = pd.get_dummies(new_df[cat], drop_first = False) | |
x2_new = new_df[num] | |
X_new = pd.concat([x2_new,x1_new], axis = 1) | |
final_df = pd.DataFrame(columns = cols_to_use) | |
final_df = pd.concat([final_df, X_new]) | |
final_df = final_df.fillna(0) | |
final_df = pd.concat([final_df,dummy]) | |
X_new = final_df.values | |
X_new[:, :(len(x1_new.columns))]= sc.fit_transform(X_new[:, :(len(x1_new.columns))]) | |
print(X_new[-1].reshape(-1, 1)) | |
output = lasso_reg.predict(X_new[-1].reshape(1, -1)) | |
return "The price of the car" + str(round(np.exp(output)[0],2)) + "$" | |
# 预估价格的主函数 | |
def predict_price(car, fueltype, aspiration, doornumber, carbody, drivewheel, enginelocation, wheelbase, carlength, carwidth, | |
carheight, curbweight, enginetype, cylindernumber, enginesize, fuelsystem, boreratio, horsepower, citympg, highwaympg): | |
new_data = [car.lower(), fueltype.lower(), asp[aspiration], doornumber.lower(), carbody, drivew[drivewheel], enginelocation.lower(), | |
wheelbase, carlength, carwidth, carheight, curbweight, enginetype, cylnum[cylindernumber], enginesize, fuelsystem, | |
boreratio, horsepower, citympg, highwaympg] | |
return transform(new_data) | |
car = gr.Dropdown(label = "Car brand", choices=carNameCap) | |
fueltype = gr.Radio(label = "Fuel Type", choices = fuelCap) | |
aspiration = gr.Radio(label = "Aspiration type", choices = ["Standard", "Turbo"]) | |
doornumber = gr.Radio(label = "Number of doors", choices = ["Two", "Four"]) | |
carbody = gr.Dropdown(label ="Car body type", choices = carb) | |
drivewheel = gr.Radio(label = "Drive wheel", choices = ['Rear wheel drive', 'Front wheel drive', '4 wheel drive']) | |
enginelocation = gr.Radio(label = "Engine location", choices = ['Front', 'Rear']) | |
wheelbase = gr.Slider(label = "Distance between the wheels on the side of the car (in inches)", minimum = 50, maximum = 300) | |
carlength = gr.Slider(label = "Length of the car (in inches)", minimum = 50, maximum = 300) | |
carwidth = gr.Slider(label = "Width of the car (in inches)", minimum = 50, maximum = 300) | |
carheight = gr.Slider(label = "Height of the car (in inches)", minimum = 50, maximum = 300) | |
curbweight = gr.Slider(label = "Weight of the car (in pounds)", minimum = 500, maximum = 6000) | |
enginetype = gr.Dropdown(label = "Engine type", choices = engtype) | |
cylindernumber = gr.Radio(label = "Cylinder number", choices = [2, 3, 4, 5, 6, 8, 12]) | |
enginesize = gr.Slider(label = "Engine size (swept volume of all the pistons inside the cylinders)", minimum = 50, maximum = 500) | |
fuelsystem = gr.Dropdown(label = "Fuel system (link to ressource:", choices = fuelsys) | |
boreratio = gr.Slider(label = "Bore ratio (ratio between cylinder bore diameter and piston stroke)", minimum = 1, maximum = 6) | |
horsepower = gr.Slider(label = "Horse power of the car", minimum = 25, maximum = 400) | |
citympg = gr.Slider(label = "Mileage in city (in km)", minimum = 0, maximum = 100) | |
highwaympg = gr.Slider(label = "Mileage on highway (in km)", minimum = 0, maximum = 100) | |
Output = gr.Textbox() | |
app = gr.Interface(title="Predict the price of a car based on its specs", | |
fn=predict_price, | |
inputs=[car, | |
fueltype, | |
aspiration, | |
doornumber, | |
carbody, | |
drivewheel, | |
enginelocation, | |
wheelbase, | |
carlength, | |
carwidth, | |
carheight, | |
curbweight, | |
enginetype, | |
cylindernumber, | |
enginesize, | |
fuelsystem, | |
boreratio, | |
horsepower, | |
citympg, | |
highwaympg | |
], | |
outputs=Output) | |
app.launch() |
最终的利用后果如下,能够本人勾选与填入特色进行模型预估!
参考资料
- 🏆 实战数据集下载(百度网盘):公众号『ShowMeAI 钻研核心』回复『实战 』,或者点击 这里 获取本文 [[11] 构建 AI 模型并部署 Web 利用,预测二手车价格](https://www.showmeai.tech/art…)『CarPrice 二手车价格预测数据集』
- ⭐ ShowMeAI 官网 GitHub:https://github.com/ShowMeAI-Hub
- 📘 图解数据分析:从入门到精通系列教程 https://www.showmeai.tech/tutorials/33
- 📘 数据迷信工具库速查表 | Pandas 速查表 https://www.showmeai.tech/article-detail/101
- 📘 数据迷信工具库速查表 | Seaborn 速查表 https://www.showmeai.tech/article-detail/105
- 📘 机器学习实战 | 机器学习特色工程最全解读 https://www.showmeai.tech/article-detail/208
- 📘 机器学习实战 | SKLearn 最全利用指南 https://www.showmeai.tech/article-detail/203