关于机器学习:二手车价格预测-构建AI模型并部署Web应用-⛵

💡 作者：韩信子 @ShowMeAI
📘 数据分析实战系列：https://www.showmeai.tech/tutorials/40
📘 机器学习实战系列：https://www.showmeai.tech/tutorials/41
📘 本文地址：https://www.showmeai.tech/article-detail/300
📢 申明：版权所有，转载请分割平台与作者并注明出处
📢 珍藏 ShowMeAI 查看更多精彩内容

一份来自『RESEARCH AND MARKETS』的二手车报告预计，从 2022 年到 2030 年，寰球二手车市场将以 6.1% 的复合年增长率增长，到 2030 年达到 2.67 万亿美元。人工智能技术的宽泛应用减少了车主和买家之间的透明度，晋升了购买体验，极大地推动了二手车市场的增长。

基于机器学习对二手车交易价格进行预估，这一技术曾经在二手车交易平台中宽泛应用。在本篇内容中，ShowMeAI 会残缺构建用于二手车价格预估的模型，并部署成 web 利用。

本案例波及的数据集能够在 🏆 kaggle 汽车价格预测获取，也能够在 ShowMeAI 的百度网盘地址间接下载。

🏆 实战数据集下载（百度网盘）：公众号『ShowMeAI 钻研核心』回复『实战』，或者点击这里获取本文 [[11] 构建 AI 模型并部署 Web 利用，预测二手车价格](https://www.showmeai.tech/art…)『CarPrice 二手车价格预测数据集』

⭐ ShowMeAI 官网 GitHub：https://github.com/ShowMeAI-Hub

数据分析解决波及的工具和技能，欢送大家查阅 ShowMeAI 对应的教程和工具速查表，快学快用。

图解数据分析：从入门到精通系列教程

数据迷信工具库速查表 | Pandas 速查表

数据迷信工具库速查表 | Seaborn 速查表

咱们先加载数据并初步查看信息。

 import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
%matplotlib.inline
 
df=pd.read_csv('CarPrice_Assignment.csv')
df.head()

数据 Dataframe 的数据预览如下：

咱们对属性字段做点剖析，看看哪些字段与价格最相干，咱们先计算相关性矩阵

df.corr()

再对相关性进行热力求可视化。

 sns.set(rc={"figure.figsize":(20, 20)})
sns.heatmap(df.corr(), annot = True)

其中各字段和 price 的相关性如下图所示，咱们能够看到其中有些字段和后果之间有十分强的相关性。

咱们能够对数值型字段，别离和 price 指标字段进行绘图详细分析，如下：

 for col in df.columns: 
    if df[col].dtypes != 'object':
        sns.lmplot(data = df, x = col, y = 'price')

可视化后果图如下：

咱们把一些与价格相关性低（r<0.15）的字段删除掉：

 df.drop(['car_ID'], axis = 1, inplace = True) 
to_drop = ['peakrpm', 'compressionratio', 'stroke', 'symboling']
df.drop(df[to_drop], axis = 1, inplace = True)

特色工程波及的办法技能，欢送大家查阅 ShowMeAI 对应的教程文章，快学快用。

机器学习实战 | 机器学习特色工程最全解读

车名列包含品牌和型号，咱们对其拆分并仅保留品牌：

df['CarName'] = df['CarName'].apply(lambda x: x.split()[0])

输入：

咱们发现有一些车品牌的别称或者拼写错误，咱们做一点数据荡涤如下：

 df['CarName'] = df['CarName'].str.lower()
df['CarName']=df['CarName'].replace({'vw':'volkswagen','vokswagen':'volkswagen','toyouta':'toyota','maxda':'mazda','porcshce':'porsche'})

再对不同车品牌的数量做绘图，如下：

 sns.set(rc={'figure.figsize':(30,10)})
sns.countplot(data = df, x='CarName')

上面咱们要做进一步的特色工程：

类别型特色

大部分机器学习模型并不能解决类别型数据，咱们会手动对其进行编码操作。类别型特色的编码能够采纳序号编码或者独热向量编码（具体参见 ShowMeAI 文章 机器学习实战 | 机器学习特色工程最全解读），独热向量编码示意图如下：

数值型特色

针对不同的模型，有不同的解决形式，比方幅度缩放和散布调整。

上面咱们先将数据集的字段分为两类：类别型和数值型：

 categorical = []
numerical = []
for col in df.columns:
   if df[col].dtypes == 'object':
      categorical.append(col)
   else:
      numerical.append(col)

上面咱们应用 pandas 中的哑变量变换操作把所有标记为“categorical”的特色进行独热向量编码。

 # 独热向量编码
x1 = pd.get_dummies(df[categorical], drop_first = False)
x2 = df[numerical]
X = pd.concat([x2,x1], axis = 1)
X.drop('price', axis = 1, inplace = True)

上面咱们对数值型特色进行解决，首先咱们看看标签字段 price，咱们先绘制一下它的散布，如下：

sns.histplot(data=df, x="price", kde=True)

大家从图上能够看出这是一个有偏散布。咱们对它做一个对数解决，以使其更靠近正态分布。（另外一个考量是，如果咱们以对数后的后果作为标签来建模学习，那还原回 price 的过程，会应用指数操作，这能保障咱们失去的价格肯定是负数），代码如下：

 # 修复偏态散布 
df["price_log"]=np.log(df["price"])
sns.histplot(data=df, x="price_log", kde=True)

校对过后的数据分布更靠近正态分布了，做过这些根底解决之后，咱们筹备开始建模了。

让咱们拆分数据集为训练和测试集，并对其进行根本的数据变换操作：

 #切分数据 
from sklearn.model_selection import train_test_split
 
y = df['price_log']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.333, random_state=1)
 
#特色工程 - 幅度缩放
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()
X_train[:, :(len(x1.columns))]= sc.fit_transform(X_train[:, :(len(x1.columns))])
X_test[:, :(len(x1.columns))]= sc.transform(X_test[:, :(len(x1.columns))])

建模波及的办法技能，欢送大家查阅 ShowMeAI 对应的教程文章，快学快用。

机器学习实战 | SKLearn 最全利用指南

咱们这里的数据集并不大（样本数不多），基于模型复杂度和成果思考，咱们先测试 4 个模型，看看哪一个体现最好。

Lasso regression
Ridge regression
随机森林回归器
XGBoost 回归器

咱们先从 scikit-learn 导入对应的模型，如下：

 # 回归模型 
from sklearn.linear_model import Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

为了让整个建模过程更加紧凑简介，咱们创立一个 pipeline 来训练和调优模型。具体步骤为：

应用随机超参数训练评估每个模型。
应用网格搜寻调优每个模型的超参数。
用找到的最佳参数从新训练评估模型。

咱们先从 scikit-learn 导入网格搜寻：

from sklearn.model_selection import GridSearchCV

接着咱们构建一个全面的评估指标函数，打印每个拟合模型的指标（R 平方、均方根误差和均匀绝对误差等）：

 def metrics(model):
   res_r2 = []
   res_RMSE = []
   res_MSE = []
   model.fit(X_train, y_train)
   Y_pred = model.predict(X_test)   
 
   #计算 R 方
   r2 = round(r2_score(y_test, Y_pred),4)
   print('R2_Score:', r2)
   res_r2.append(r2)   
   
   #计算 RMSE
   rmse = round(mean_squared_error(np.exp(y_test),np.exp(Y_pred), squared=False), 2)
   print("RMSE:",rmse)
   res_RMSE.append(rmse)   
 
   #计算 MAE
   mse = round(mean_absolute_error(np.exp(y_test),np.exp(Y_pred)), 2)
   print("MAE:", mse)
   res_MSE.append(mse)

上面要构建 pipeline 了：

 # 候选模型
models={'rfr':RandomForestRegressor(bootstrap=False, max_depth=15, max_features='sqrt', min_samples_split=2, n_estimators=100),
   
   'lasso':Lasso(alpha=0.005, fit_intercept=True),
   
   'ridge':Ridge(alpha = 10, fit_intercept=True), 'xgb':xgb.XGBRegressor(bootstrap=True, max_depth=2, max_features = 'auto', min_sample_split = 2, n_estimators = 100)
}
 
# 不同的模型不同建模办法
for mod in models:
   if mod == 'rfr' or mod == 'xgb':
     print('Untuned metrics for:', mod)
     metrics(models[mod])
     print('\n')
     print('Starting grid search for:', mod)
     params = {"n_estimators"      : [10,100, 1000, 2000, 4000, 6000],
       "max_features"      : ["auto", "sqrt", "log2"],
       "max_depth"         : [2, 4, 8, 12, 15],
       "min_samples_split" : [2,4,8],
       "bootstrap": [True, False],
    }
    if mod == 'rfr':
       rfr = RandomForestRegressor()
       grid = GridSearchCV(rfr, params, verbose=5, cv=2)
       grid.fit(X_train, y_train)
       print("Best score:", grid.best_score_)
       print("Best: params", grid.best_params_)
    else:
       xgboost = xgb.XGBRegressor()
       grid = GridSearchCV(xgboost, params, verbose=5, cv=2)
       grid.fit(X_train, y_train)
       print("Best score:", grid.best_score_)
       print("Best: params", grid.best_params_)
   else:
      print('Untuned metrics for:', mod)
      metrics(models[mod])
      print('\n')
      print('Starting grid search for:', mod)
      params = {"alpha": [0.005, 0.05, 0.1, 1, 10, 100, 290, 500],
         "fit_intercept": [True, False]
      }
      if mod == 'lasso':
         lasso = Lasso()
         grid = GridSearchCV(lasso, params, verbose = 5, cv = 2)
         grid.fit(X_train, y_train)
         print("Best score:", grid.best_score_) 
         print("Best: params", grid.best_params_)
      else:
         ridge = Ridge()
         grid = GridSearchCV(ridge, params, verbose = 5, cv = 2)
         grid.fit(X_train, y_train)
         print("Best score:", grid.best_score_)
         print("Best: params", grid.best_params_)

以下是随机调整模型的后果：

在未调超参数的状况下，咱们看到差别不大的 R 方后果，但 Lasso 的误差最小。

咱们再看看网格搜寻的后果，以找到每个模型的最佳参数：

当初让咱们将这些参数利用于每个模型，并查看后果：

调参后的后果相比默认超参数，都有晋升，但 Lasso 回归仍旧有最佳的成果（与本例的数据集样本量和特色相关性无关），咱们最终保留 Lasso 回归模型并存储模型到本地。

 lasso_reg = Lasso(alpha = 0.005, fit_intercept = True)
pickle.dump(lasso_reg, open('model.pkl','wb'))

上面咱们把下面失去的模型部署到网页端，造成一个能够实时预估的利用，咱们这里应用 gradio 库来开发 Web 应用程序，理论的 web 利用预估蕴含上面的步骤：

用户在网页表单中输出数据
解决数据（特色编码 & 变换）
数据处理以匹配模型输出格局
预测并出现给用户的价格

首先，咱们导入原始数据集和做过数据处理（独热向量编码）的数据集，并保留它们各自的列。

 # df 的列
#Columns of the df
df = pd.read_csv('df_columns')
df.drop(['Unnamed: 0','price'], axis = 1, inplace=True)
cols = df.columns
 
# df 的哑变量列
dummy = pd.read_csv('dummy_df')
dummy.drop('Unnamed: 0', axis = 1, inplace=True)
cols_to_use = dummy.columns

接下来，对于类别型特色，咱们构建 web 利用端下拉选项：

 # 构建利用中的候选值
 
# 车品牌首字母大写
cars = df['CarName'].unique().tolist()
carNameCap = []
for col in cars:
   carNameCap.append(col.capitalize())
 
#fueltype 字段
fuel = df['fueltype'].unique().tolist()
fuelCap = []
for fu in fuel:
   fuelCap.append(fu.capitalize())
 
#carbod, engine type, fuel systems 等字段
carb = df['carbody'].unique().tolist()
engtype = df['enginetype'].unique().tolist()
fuelsys = df['fuelsystem'].unique().tolist()

OK，咱们会针对下面这些模型预估须要用到的类别型字段，开发下拉性能并增加候选项。

上面咱们定义一个函数进行数据处理，并预估返回价格：

 # 数据变换处理以匹配模型
def transform(data):
   # 数据幅度缩放
   sc = StandardScaler()
   
   # 导入模型
   model= pickle.load(open('model.pkl','rb'))
   
   # 新数据 Dataframe
   new_df = pd.DataFrame([data],columns = cols)   
   # 辨别类别型和数值型特色
   cat = []
   num = []
   for col in new_df.columns:
      if new_df[col].dtypes == 'object':
         cat.append(col)
      else:
         num.append(col)    
    x1_new = pd.get_dummies(new_df[cat], drop_first = False)
    x2_new = new_df[num]
    
    X_new = pd.concat([x2_new,x1_new], axis = 1)
    final_df = pd.DataFrame(columns = cols_to_use)
    final_df = pd.concat([final_df, X_new])
    final_df = final_df.fillna(0)
    X_new = final_df.values
    X_new[:, :(len(x1_new.columns))]= sc.fit_transform(X_new[:,
:(len(x1_new.columns))])    
    output = model.predict(X_new)
    return "The price of the car" + str(round(np.exp(output)[0],2)) + "$"

上面咱们在 gradio web 应用程序中创立元素，咱们会为类别型字段构建下拉菜单或复选框，为数值型字段构建输入框。参考代码如下：

 # 类别型
car = gr.Dropdown(label = "Car brand", choices=carNameCap)
# 数值型
curbweight = gr.Slider(label = "Weight of the car (in pounds)", minimum = 500, maximum = 6000)

当初，让咱们在界面中增加所有内容：

所有就绪就能够部署了！

上面咱们把下面失去利用部署一下，首先咱们对于利用的 ip 和端口做一点设定

 export GRADIO_SERVER_NAME=0.0.0.0
export GRADIO_SERVER_PORT="$PORT"

大家确定应用 pip 装置好下述依赖：

 numpy                            
pandas                             
scikit-learn                             
gradio                             
Flask                             
argparse                             
gunicorn                             
rq

接着运行 python WebApp.py 就能够测试应用程序了，WebApp.py内容如下：

 import gradio as gr
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler
 
# 数据字典
asp = {
    'Standard':'std',
   'Turbo':'turbo'
}
 
drivew = {
    'Rear wheel drive': 'rwd',
    'Front wheel drive': 'fwd', 
    '4 wheel drive': '4wd'
}
 
cylnum = {
    2: 'two',
    3: 'three', 
    4: 'four',
    5: 'five', 
    6: 'six', 
    8: 'eight',
    12: 'twelve'
}
 
# 原始 df 字段名
df = pd.read_csv('df_columns')
df.drop(['Unnamed: 0','price'], axis = 1, inplace=True)
cols = df.columns
 
# 独热向量编码过后的字段名
dummy = pd.read_csv('dummy_df')
dummy.drop('Unnamed: 0', axis = 1, inplace=True)
cols_to_use = dummy.columns
 
# 车品牌名
cars = df['CarName'].unique().tolist()
carNameCap = []
for col in cars:
    carNameCap.append(col.capitalize())
 
# fuel
fuel = df['fueltype'].unique().tolist()
fuelCap = []
for fu in fuel:
    fuelCap.append(fu.capitalize())
 
#For carbod, engine type, fuel systme
carb = df['carbody'].unique().tolist() 
engtype = df['enginetype'].unique().tolist()
fuelsys = df['fuelsystem'].unique().tolist()
 
#Function to model data to fit the model
def transform(data):
    # 数值型幅度缩放
    sc= StandardScaler()
 
    # 导入模型
    lasso_reg = pickle.load(open('model.pkl','rb'))
 
    # 新数据 Dataframe
    new_df = pd.DataFrame([data],columns = cols)
 
    # 切分类别型与数值型字段
    cat = []
    num = []
    for col in new_df.columns: 
        if new_df[col].dtypes == 'object': 
            cat.append(col)
        else: 
            num.append(col)
 
    # 构建模型所需数据格式
    x1_new = pd.get_dummies(new_df[cat], drop_first = False)
    x2_new = new_df[num]
    X_new = pd.concat([x2_new,x1_new], axis = 1)
    
    final_df = pd.DataFrame(columns = cols_to_use)
    final_df = pd.concat([final_df, X_new])
    final_df = final_df.fillna(0)
    final_df = pd.concat([final_df,dummy])
 
    X_new = final_df.values
    X_new[:, :(len(x1_new.columns))]= sc.fit_transform(X_new[:, :(len(x1_new.columns))])
    print(X_new[-1].reshape(-1, 1))
    output = lasso_reg.predict(X_new[-1].reshape(1, -1))
    return "The price of the car" + str(round(np.exp(output)[0],2)) + "$"
 
# 预估价格的主函数
def predict_price(car, fueltype, aspiration, doornumber, carbody, drivewheel, enginelocation, wheelbase, carlength, carwidth, 
                carheight, curbweight, enginetype, cylindernumber, enginesize, fuelsystem, boreratio, horsepower, citympg, highwaympg): 
 
    new_data = [car.lower(), fueltype.lower(), asp[aspiration], doornumber.lower(), carbody, drivew[drivewheel], enginelocation.lower(),
                wheelbase, carlength, carwidth, carheight, curbweight, enginetype, cylnum[cylindernumber], enginesize, fuelsystem, 
                boreratio, horsepower, citympg, highwaympg]
    
    return transform(new_data) 
 
 
car = gr.Dropdown(label = "Car brand", choices=carNameCap)
 
fueltype = gr.Radio(label = "Fuel Type", choices = fuelCap)
 
aspiration = gr.Radio(label = "Aspiration type", choices = ["Standard", "Turbo"])
 
doornumber = gr.Radio(label = "Number of doors", choices = ["Two", "Four"])
 
carbody = gr.Dropdown(label ="Car body type", choices = carb)
 
drivewheel = gr.Radio(label = "Drive wheel", choices = ['Rear wheel drive', 'Front wheel drive', '4 wheel drive'])
 
enginelocation = gr.Radio(label = "Engine location", choices = ['Front', 'Rear'])
 
wheelbase = gr.Slider(label = "Distance between the wheels on the side of the car (in inches)", minimum = 50, maximum = 300)
 
carlength = gr.Slider(label = "Length of the car (in inches)", minimum = 50, maximum = 300)
 
carwidth = gr.Slider(label = "Width of the car (in inches)", minimum = 50, maximum = 300)
 
carheight = gr.Slider(label = "Height of the car (in inches)", minimum = 50, maximum = 300)
 
curbweight = gr.Slider(label = "Weight of the car (in pounds)", minimum = 500, maximum = 6000)
 
enginetype = gr.Dropdown(label = "Engine type", choices = engtype)
 
cylindernumber = gr.Radio(label = "Cylinder number", choices = [2, 3, 4, 5, 6, 8, 12])
 
enginesize = gr.Slider(label = "Engine size (swept volume of all the pistons inside the cylinders)", minimum = 50, maximum = 500)
 
fuelsystem = gr.Dropdown(label = "Fuel system (link to ressource:", choices = fuelsys)
 
boreratio = gr.Slider(label = "Bore ratio (ratio between cylinder bore diameter and piston stroke)", minimum = 1, maximum = 6)
 
horsepower = gr.Slider(label = "Horse power of the car", minimum = 25, maximum = 400)
 
citympg = gr.Slider(label = "Mileage in city (in km)", minimum = 0, maximum = 100)
 
highwaympg = gr.Slider(label = "Mileage on highway (in km)", minimum = 0, maximum = 100)
 
Output = gr.Textbox()
 
app = gr.Interface(title="Predict the price of a car based on its specs", 
                    fn=predict_price,
                    inputs=[car,
                            fueltype,
                            aspiration,
                            doornumber,
                            carbody,
                            drivewheel, 
                            enginelocation, 
                            wheelbase,
                            carlength, 
                            carwidth, 
                            carheight, 
                            curbweight,
                            enginetype, 
                            cylindernumber, 
                            enginesize,
                            fuelsystem,
                            boreratio,
                            horsepower, 
                            citympg, 
                            highwaympg
                            ],
                    outputs=Output)
 
app.launch()

最终的利用后果如下，能够本人勾选与填入特色进行模型预估！

🏆 实战数据集下载（百度网盘）：公众号『ShowMeAI 钻研核心』回复『实战』，或者点击这里获取本文 [[11] 构建 AI 模型并部署 Web 利用，预测二手车价格](https://www.showmeai.tech/art…)『CarPrice 二手车价格预测数据集』
⭐ ShowMeAI 官网 GitHub：https://github.com/ShowMeAI-Hub
📘 图解数据分析：从入门到精通系列教程 https://www.showmeai.tech/tutorials/33
📘 数据迷信工具库速查表 | Pandas 速查表 https://www.showmeai.tech/article-detail/101
📘 数据迷信工具库速查表 | Seaborn 速查表 https://www.showmeai.tech/article-detail/105
📘 机器学习实战 | 机器学习特色工程最全解读 https://www.showmeai.tech/article-detail/208
📘 机器学习实战 | SKLearn 最全利用指南 https://www.showmeai.tech/article-detail/203

关于机器学习:二手车价格预测-构建AI模型并部署Web应用-⛵

💡 数据分析解决 & 特色工程

① 数据摸索

② 特色工程

③ 特色编码 & 数据变换

💡 机器学习建模

① 数据集切分 & 数据变换

② 建模 & 调优

③ 建模 pipeline

💡 web 利用开发

① 根本开发

② 部署

参考资料

Just My Socks（注册教程内含优惠码）

	import numpy as np
	import pandas as pd
	import seaborn as sns
	import matplotlib.pyplot as plt
	import pickle
	%matplotlib.inline

	df=pd.read_csv('CarPrice_Assignment.csv')
	df.head()

	sns.set(rc={"figure.figsize":(20, 20)})
	sns.heatmap(df.corr(), annot = True)

	for col in df.columns:
	if df[col].dtypes != 'object':
	sns.lmplot(data = df, x = col, y = 'price')

	df.drop(['car_ID'], axis = 1, inplace = True)
	to_drop = ['peakrpm', 'compressionratio', 'stroke', 'symboling']
	df.drop(df[to_drop], axis = 1, inplace = True)

	df['CarName'] = df['CarName'].str.lower()
	df['CarName']=df['CarName'].replace({'vw':'volkswagen','vokswagen':'volkswagen','toyouta':'toyota','maxda':'mazda','porcshce':'porsche'})

	sns.set(rc={'figure.figsize':(30,10)})
	sns.countplot(data = df, x='CarName')

	categorical = []
	numerical = []
	for col in df.columns:
	if df[col].dtypes == 'object':
	categorical.append(col)
	else:
	numerical.append(col)

	# 独热向量编码
	x1 = pd.get_dummies(df[categorical], drop_first = False)
	x2 = df[numerical]
	X = pd.concat([x2,x1], axis = 1)
	X.drop('price', axis = 1, inplace = True)

	# 修复偏态散布
	df["price_log"]=np.log(df["price"])
	sns.histplot(data=df, x="price_log", kde=True)

	#切分数据
	from sklearn.model_selection import train_test_split

	y = df['price_log']
	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.333, random_state=1)

	#特色工程 - 幅度缩放
	from sklearn.preprocessing import StandardScaler
	sc= StandardScaler()
	X_train[:, :(len(x1.columns))]= sc.fit_transform(X_train[:, :(len(x1.columns))])
	X_test[:, :(len(x1.columns))]= sc.transform(X_test[:, :(len(x1.columns))])

	# 回归模型
	from sklearn.linear_model import Lasso, Ridge
	from sklearn.ensemble import RandomForestRegressor
	import xgboost as xgb

	def metrics(model):
	res_r2 = []
	res_RMSE = []
	res_MSE = []
	model.fit(X_train, y_train)
	Y_pred = model.predict(X_test)

	#计算 R 方
	r2 = round(r2_score(y_test, Y_pred),4)
	print('R2_Score:', r2)
	res_r2.append(r2)

	#计算 RMSE
	rmse = round(mean_squared_error(np.exp(y_test),np.exp(Y_pred), squared=False), 2)
	print("RMSE:",rmse)
	res_RMSE.append(rmse)

	#计算 MAE
	mse = round(mean_absolute_error(np.exp(y_test),np.exp(Y_pred)), 2)
	print("MAE:", mse)
	res_MSE.append(mse)

	# 候选模型
	models={'rfr':RandomForestRegressor(bootstrap=False, max_depth=15, max_features='sqrt', min_samples_split=2, n_estimators=100),

	'lasso':Lasso(alpha=0.005, fit_intercept=True),

	'ridge':Ridge(alpha = 10, fit_intercept=True), 'xgb':xgb.XGBRegressor(bootstrap=True, max_depth=2, max_features = 'auto', min_sample_split = 2, n_estimators = 100)
	}

	# 不同的模型不同建模办法
	for mod in models:
	if mod == 'rfr' or mod == 'xgb':
	print('Untuned metrics for:', mod)
	metrics(models[mod])
	print('\n')
	print('Starting grid search for:', mod)
	params = {"n_estimators" : [10,100, 1000, 2000, 4000, 6000],
	"max_features" : ["auto", "sqrt", "log2"],
	"max_depth" : [2, 4, 8, 12, 15],
	"min_samples_split" : [2,4,8],
	"bootstrap": [True, False],
	}
	if mod == 'rfr':
	rfr = RandomForestRegressor()
	grid = GridSearchCV(rfr, params, verbose=5, cv=2)
	grid.fit(X_train, y_train)
	print("Best score:", grid.best_score_)
	print("Best: params", grid.best_params_)
	else:
	xgboost = xgb.XGBRegressor()
	grid = GridSearchCV(xgboost, params, verbose=5, cv=2)
	grid.fit(X_train, y_train)
	print("Best score:", grid.best_score_)
	print("Best: params", grid.best_params_)
	else:
	print('Untuned metrics for:', mod)
	metrics(models[mod])
	print('\n')
	print('Starting grid search for:', mod)
	params = {"alpha": [0.005, 0.05, 0.1, 1, 10, 100, 290, 500],
	"fit_intercept": [True, False]
	}
	if mod == 'lasso':
	lasso = Lasso()
	grid = GridSearchCV(lasso, params, verbose = 5, cv = 2)
	grid.fit(X_train, y_train)
	print("Best score:", grid.best_score_)
	print("Best: params", grid.best_params_)
	else:
	ridge = Ridge()
	grid = GridSearchCV(ridge, params, verbose = 5, cv = 2)
	grid.fit(X_train, y_train)
	print("Best score:", grid.best_score_)
	print("Best: params", grid.best_params_)

	lasso_reg = Lasso(alpha = 0.005, fit_intercept = True)
	pickle.dump(lasso_reg, open('model.pkl','wb'))

	# df 的列
	#Columns of the df
	df = pd.read_csv('df_columns')
	df.drop(['Unnamed: 0','price'], axis = 1, inplace=True)
	cols = df.columns

	# df 的哑变量列
	dummy = pd.read_csv('dummy_df')
	dummy.drop('Unnamed: 0', axis = 1, inplace=True)
	cols_to_use = dummy.columns

	# 构建利用中的候选值

	# 车品牌首字母大写
	cars = df['CarName'].unique().tolist()
	carNameCap = []
	for col in cars:
	carNameCap.append(col.capitalize())

	#fueltype 字段
	fuel = df['fueltype'].unique().tolist()
	fuelCap = []
	for fu in fuel:
	fuelCap.append(fu.capitalize())

	#carbod, engine type, fuel systems 等字段
	carb = df['carbody'].unique().tolist()
	engtype = df['enginetype'].unique().tolist()
	fuelsys = df['fuelsystem'].unique().tolist()

	# 数据变换处理以匹配模型
	def transform(data):
	# 数据幅度缩放
	sc = StandardScaler()

	# 导入模型
	model= pickle.load(open('model.pkl','rb'))

	# 新数据 Dataframe
	new_df = pd.DataFrame([data],columns = cols)
	# 辨别类别型和数值型特色
	cat = []
	num = []
	for col in new_df.columns:
	if new_df[col].dtypes == 'object':
	cat.append(col)
	else:
	num.append(col)
	x1_new = pd.get_dummies(new_df[cat], drop_first = False)
	x2_new = new_df[num]

	X_new = pd.concat([x2_new,x1_new], axis = 1)
	final_df = pd.DataFrame(columns = cols_to_use)
	final_df = pd.concat([final_df, X_new])
	final_df = final_df.fillna(0)
	X_new = final_df.values
	X_new[:, :(len(x1_new.columns))]= sc.fit_transform(X_new[:,
	:(len(x1_new.columns))])
	output = model.predict(X_new)
	return "The price of the car" + str(round(np.exp(output)[0],2)) + "$"

	# 类别型
	car = gr.Dropdown(label = "Car brand", choices=carNameCap)
	# 数值型
	curbweight = gr.Slider(label = "Weight of the car (in pounds)", minimum = 500, maximum = 6000)

	export GRADIO_SERVER_NAME=0.0.0.0
	export GRADIO_SERVER_PORT="$PORT"

	import gradio as gr
	import numpy as np
	import pandas as pd
	import pickle
	from sklearn.preprocessing import StandardScaler

	# 数据字典
	asp = {
	'Standard':'std',
	'Turbo':'turbo'
	}

	drivew = {
	'Rear wheel drive': 'rwd',
	'Front wheel drive': 'fwd',
	'4 wheel drive': '4wd'
	}

	cylnum = {
	2: 'two',
	3: 'three',
	4: 'four',
	5: 'five',
	6: 'six',
	8: 'eight',
	12: 'twelve'
	}

	# 原始 df 字段名
	df = pd.read_csv('df_columns')
	df.drop(['Unnamed: 0','price'], axis = 1, inplace=True)
	cols = df.columns

	# 独热向量编码过后的字段名
	dummy = pd.read_csv('dummy_df')
	dummy.drop('Unnamed: 0', axis = 1, inplace=True)
	cols_to_use = dummy.columns

	# 车品牌名
	cars = df['CarName'].unique().tolist()
	carNameCap = []
	for col in cars:
	carNameCap.append(col.capitalize())

	# fuel
	fuel = df['fueltype'].unique().tolist()
	fuelCap = []
	for fu in fuel:
	fuelCap.append(fu.capitalize())

	#For carbod, engine type, fuel systme
	carb = df['carbody'].unique().tolist()
	engtype = df['enginetype'].unique().tolist()
	fuelsys = df['fuelsystem'].unique().tolist()

	#Function to model data to fit the model
	def transform(data):
	# 数值型幅度缩放
	sc= StandardScaler()

	# 导入模型
	lasso_reg = pickle.load(open('model.pkl','rb'))

	# 新数据 Dataframe
	new_df = pd.DataFrame([data],columns = cols)

	# 切分类别型与数值型字段
	cat = []
	num = []
	for col in new_df.columns:
	if new_df[col].dtypes == 'object':
	cat.append(col)
	else:
	num.append(col)

	# 构建模型所需数据格式
	x1_new = pd.get_dummies(new_df[cat], drop_first = False)
	x2_new = new_df[num]
	X_new = pd.concat([x2_new,x1_new], axis = 1)

	final_df = pd.DataFrame(columns = cols_to_use)
	final_df = pd.concat([final_df, X_new])
	final_df = final_df.fillna(0)
	final_df = pd.concat([final_df,dummy])

	X_new = final_df.values
	X_new[:, :(len(x1_new.columns))]= sc.fit_transform(X_new[:, :(len(x1_new.columns))])
	print(X_new[-1].reshape(-1, 1))
	output = lasso_reg.predict(X_new[-1].reshape(1, -1))
	return "The price of the car" + str(round(np.exp(output)[0],2)) + "$"

	# 预估价格的主函数
	def predict_price(car, fueltype, aspiration, doornumber, carbody, drivewheel, enginelocation, wheelbase, carlength, carwidth,
	carheight, curbweight, enginetype, cylindernumber, enginesize, fuelsystem, boreratio, horsepower, citympg, highwaympg):

	new_data = [car.lower(), fueltype.lower(), asp[aspiration], doornumber.lower(), carbody, drivew[drivewheel], enginelocation.lower(),
	wheelbase, carlength, carwidth, carheight, curbweight, enginetype, cylnum[cylindernumber], enginesize, fuelsystem,
	boreratio, horsepower, citympg, highwaympg]

	return transform(new_data)


	car = gr.Dropdown(label = "Car brand", choices=carNameCap)

	fueltype = gr.Radio(label = "Fuel Type", choices = fuelCap)

	aspiration = gr.Radio(label = "Aspiration type", choices = ["Standard", "Turbo"])

	doornumber = gr.Radio(label = "Number of doors", choices = ["Two", "Four"])

	carbody = gr.Dropdown(label ="Car body type", choices = carb)

	drivewheel = gr.Radio(label = "Drive wheel", choices = ['Rear wheel drive', 'Front wheel drive', '4 wheel drive'])

	enginelocation = gr.Radio(label = "Engine location", choices = ['Front', 'Rear'])

	wheelbase = gr.Slider(label = "Distance between the wheels on the side of the car (in inches)", minimum = 50, maximum = 300)

	carlength = gr.Slider(label = "Length of the car (in inches)", minimum = 50, maximum = 300)

	carwidth = gr.Slider(label = "Width of the car (in inches)", minimum = 50, maximum = 300)

	carheight = gr.Slider(label = "Height of the car (in inches)", minimum = 50, maximum = 300)

	curbweight = gr.Slider(label = "Weight of the car (in pounds)", minimum = 500, maximum = 6000)

	enginetype = gr.Dropdown(label = "Engine type", choices = engtype)

	cylindernumber = gr.Radio(label = "Cylinder number", choices = [2, 3, 4, 5, 6, 8, 12])

	enginesize = gr.Slider(label = "Engine size (swept volume of all the pistons inside the cylinders)", minimum = 50, maximum = 500)

	fuelsystem = gr.Dropdown(label = "Fuel system (link to ressource:", choices = fuelsys)

	boreratio = gr.Slider(label = "Bore ratio (ratio between cylinder bore diameter and piston stroke)", minimum = 1, maximum = 6)

	horsepower = gr.Slider(label = "Horse power of the car", minimum = 25, maximum = 400)

	citympg = gr.Slider(label = "Mileage in city (in km)", minimum = 0, maximum = 100)

	highwaympg = gr.Slider(label = "Mileage on highway (in km)", minimum = 0, maximum = 100)

	Output = gr.Textbox()

	app = gr.Interface(title="Predict the price of a car based on its specs",
	fn=predict_price,
	inputs=[car,
	fueltype,
	aspiration,
	doornumber,
	carbody,
	drivewheel,
	enginelocation,
	wheelbase,
	carlength,
	carwidth,
	carheight,
	curbweight,
	enginetype,
	cylindernumber,
	enginesize,
	fuelsystem,
	boreratio,
	horsepower,
	citympg,
	highwaympg
	],
	outputs=Output)

	app.launch()

关于机器学习:二手车价格预测-构建AI模型并部署Web应用-⛵

💡 数据分析解决 & 特色工程

① 数据摸索

② 特色工程

③ 特色编码 & 数据变换

💡 机器学习建模

① 数据集切分 & 数据变换

② 建模 & 调优

③ 建模 pipeline

💡 web 利用开发

① 根本开发

② 部署

参考资料

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）