关于人工智能:AI带你省钱旅游精准预测民宿房源价格

作者：韩信子@ShowMeAI
数据分析实战系列：https://www.showmeai.tech/tutorials/40
机器学习实战系列：https://www.showmeai.tech/tutorials/41
本文地址：https://www.showmeai.tech/article-detail/316
申明：版权所有，转载请分割平台与作者并注明出处
珍藏ShowMeAI查看更多精彩内容

大家进来游览最关怀的问题之一就是住宿，在国外以 Airbnb 为代表的民宿互联网模式彻底改变了酒店业，很多游客更喜爱预订 Airbnb 而不是酒店，而在国内的美团飞猪等平台，也有大量的民宿入驻。

在当初这个信息通明凋谢的互联网时代，咱们是否收集数据信息，开发一个机器学习模型来预测房源价格，为本人的出行提供更智能化的信息呢？必定是能够的，上面ShowMeAI以Airbnb在大曼彻斯特地区的房源数据为例（截至 2022 年 3 月），来演示数据分析与开掘建模的全过程，同样的办法模式能够利用在大家相熟的国内平台上。

上面的我的项目业务和 Airbnb民宿数据 来源于 Inside Airbnb，蕴含无关 Airbnb 对住宅社区影响的数据和宣传。数据源能够在上述链接中获取，大家也能够拜访ShowMeAI的百度网盘地址，获取咱们为大家存储好的我的项目数据。

实战数据集下载（百度网盘）：公众号『ShowMeAI钻研核心』回复『实战』，或者点击这里获取本文 [[22]基于Airbnb数据的民宿房价预测模型](https://www.showmeai.tech/art...) 『Airbnb民宿数据』
⭐ ShowMeAI官网GitHub：https://github.com/ShowMeAI-Hub

业务问题

个别咱们须要在开始开掘和建模之前，深刻理解咱们的业务场景和数据状况，咱们先总结了一些在这个业务场景下咱们关怀的一些业务问题，咱们将通过数据分析开掘来实现这些业务问题的了解。

哪些地区或城镇的 Airbnb 房源最多？
最受欢迎的房型是什么？
大曼彻斯特地区的 Airbnb 房源价格特点是什么？
房源与房东的散布状况？
大曼彻斯特地区有哪些房型可供选择？
机器学习模型预测该地区 Airbnb 房源价格的思路是什么样的？
在预测大曼彻斯特地区 Airbnb 房源的价格时，哪些特色更重要？

数据读取与初探

咱们先导入本次须要应用到的剖析开掘与建模工具库

import numpy as npimport pandas as pdfrom tqdm.notebook import tqdm, trangeimport seaborn as sbimport matplotlib.pyplot as plt%matplotlib inlinefrom sklearn.linear_model import LinearRegressionfrom sklearn.linear_model import Lassofrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import r2_score, mean_squared_errorfrom sklearn.preprocessing import StandardScalerimport statsmodels.api as smfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.model_selection import GridSearchCVfrom sklearn.pipeline import Pipeline, FeatureUnionfrom sklearn.feature_selection import SelectFromModelfrom sklearn.ensemble import GradientBoostingRegressorfrom statsmodels.stats.outliers_influence import variance_inflation_factorfrom sklearn.inspection import permutation_importancepd.set_option('display.max_columns', None)pd.set_option('display.max_rows', None)

接下来咱们读取大曼彻斯特地区的房源数据

gm_listings = pd.read_csv('gm_listings-2.csv')gm_calendar = pd.read_csv('calendar-2.csv')gm_reviews = pd.read_csv('reviews-2.csv')

查看数据的根底信息如下

gm_listings.head()

gm_listings.shape# (3584, 74)gm_listings.columns

gm_calendar.head()

gm_reviews.head()

咱们对数据的初览能够看到，大曼彻斯特地区的房源数据集蕴含 3584 行和 78 列，蕴含无关房东、房源类型、区域和评级的信息。

数据荡涤

数据荡涤是机器学习建模利用的【特色工程】阶段的外围步骤，它波及的办法技能欢送大家查阅ShowMeAI对应的教程文章，快学快用。
机器学习实战 | 机器学习特色工程最全解读

字段荡涤

因为数据中的字段泛滥，有些字段比拟乱，咱们须要做一些数据荡涤的工作，数据蕴含一些带有URL的列，对最初的预测作用不大，咱们把它们荡涤掉。

# 删除url字段def drop_function(df):    df = df.drop(columns=['listing_url', 'description', 'host_thumbnail_url', 'host_picture_url', 'latitude', 'longitude', 'picture_url', 'host_url', 'host_location', 'neighbourhood', 'neighbourhood_cleansed', 'host_about', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped'])        return dfgm_df = drop_function(gm_listings)

删除过后的数据如下，洁净很多

缺失值解决

数据中也蕴含了一些缺失值，咱们对它们进行剖析解决：

# 查看缺失值百分比(gm_df.isnull().sum()/gm_df.shape[0])* 100

失去如下后果

id                                                0.000000scrape_id                                         0.000000last_scraped                                      0.000000name                                              0.000000neighborhood_overview                            41.266741host_id                                           0.000000host_name                                         0.000000host_since                                        0.000000host_response_time                               10.212054host_response_rate                               10.212054host_acceptance_rate                              5.636161host_is_superhost                                 0.000000host_neighbourhood                               91.657366host_listings_count                               0.000000host_total_listings_count                         0.000000host_verifications                                0.000000host_has_profile_pic                              0.000000host_identity_verified                            0.000000neighbourhood_group_cleansed                      0.000000property_type                                     0.000000room_type                                         0.000000accommodates                                      0.000000bathrooms                                       100.000000bathrooms_text                                    0.306920bedrooms                                          4.687500beds                                              2.120536amenities                                         0.000000price                                             0.000000minimum_nights                                    0.000000maximum_nights                                    0.000000minimum_minimum_nights                            0.000000maximum_minimum_nights                            0.000000minimum_maximum_nights                            0.000000maximum_maximum_nights                            0.000000minimum_nights_avg_ntm                            0.000000maximum_nights_avg_ntm                            0.000000calendar_updated                                100.000000number_of_reviews                                 0.000000number_of_reviews_ltm                             0.000000number_of_reviews_l30d                            0.000000first_review                                     19.810268last_review                                      19.810268review_scores_rating                             19.810268review_scores_accuracy                           20.089286review_scores_cleanliness                        20.089286review_scores_checkin                            20.089286review_scores_communication                      20.089286review_scores_location                           20.089286review_scores_value                              20.089286license                                         100.000000instant_bookable                                  0.000000calculated_host_listings_count                    0.000000calculated_host_listings_count_entire_homes       0.000000calculated_host_listings_count_private_rooms      0.000000calculated_host_listings_count_shared_rooms       0.000000reviews_per_month                                19.810268dtype: float64

咱们分几种不同的比例状况对缺失值进行解决：

高缺失比例的字段，如license、calendar_updated、bathrooms、host_neighborhood等蕴含90%以上的NaN值，包含neighborhood overview是41%的NaN，并且蕴含文本数据。咱们会间接剔除这些字段。
数值型字段，缺失不多的状况下，咱们用字段平均值进行填充。这保障了这些值的散布被保留下来。这些列包含bedrooms、beds、review_scores_rating、review_scores_accuracy和其余打分字段。
类别型字段，像bathrooms_text和host_response_time，咱们用众数进行填充。

# 剔除高缺失比例字段def drop_function_2(df):    df = df.drop(columns=['license', 'calendar_updated', 'bathrooms', 'host_neighbourhood', 'neighborhood_overview'])        return dfgm_df = drop_function_2(gm_df)# 均值填充def input_mean(df, column_list):    for columns in column_list:         df[columns].fillna(value = df[columns].mean(), inplace=True)        return dfcolumn_list = ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',              'review_scores_checkin', 'review_scores_communication', 'review_scores_location',              'review_scores_value', 'reviews_per_month',              'bedrooms', 'beds']gm_df = input_mean(gm_df, column_list)# 众数填充def input_mode(df, column_list):        for columns in column_list:                df[columns].fillna(value = df[columns].mode()[0], inplace=True)        return dfcolumn_list = ['first_review', 'last_review', 'bathrooms_text', 'host_acceptance_rate',                'host_response_rate', 'host_response_time']gm_df = input_mode(gm_df, column_list)

字段编码

host_is_superhost 和 has_availability 等列对应的字符串含意为 true 或 false，咱们对其编码替换为0或1。

gm_df = gm_df.replace({'host_is_superhost': 't', 'host_has_profile_pic': 't', 'host_identity_verified': 't', 'has_availability': 't', 'instant_bookable': 't'}, 1)gm_df = gm_df.replace({'host_is_superhost': 'f', 'host_has_profile_pic': 'f', 'host_identity_verified': 'f', 'has_availability': 'f', 'instant_bookable': 'f'}, 0)

咱们查看下替换后的数据分布

gm_df['host_is_superhost'].value_counts()

字段格局转换

价格相干的字段，目前还是字符串类型，蕴含“$”等符号，咱们对其解决并转换为数值型。

def string_to_int(df, column):    # 字符串替换清理    df[column] = df[column].str.replace("$", "")    df[column] = df[column].str.replace(",", "")        # 转为数值型    df[column] = pd.to_numeric(df[column]).astype(int)        return dfgm_df = string_to_int(gm_df, 'price')

列表型字段编码

像host_verifications和amenities这样的字段，取值为列表格局，咱们对其进行编码解决（用哑变量替换）。

# 查看列表型取值字段gm_df_copy = gm_df.copy()gm_df_copy['amenities'].head()

gm_df_copy['host_verifications'].head()

# 哑变量编码gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace('"', '')gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace(']', "")gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace('[', "")df_amenities = gm_df_copy['amenities'].str.get_dummies(sep = ",")gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace("'", "")gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace(']', "")gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace('[', "")df_host_ver = gm_df_copy['host_verifications'].str.get_dummies(sep = ",")

编码后的后果如下所示

df_amenities.head()df_host_ver.head()

# 删除原始字段gm_df = gm_df.drop(['host_verifications', 'amenities'], axis=1)

数据摸索

下一步咱们要进行更全面一些的探索性数据分析。

EDA数据分析局部波及的工具库，大家能够参考ShowMeAI制作的工具库速查表和教程进行学习和疾速应用。
数据迷信工具库速查表 | Pandas 速查表
图解数据分析：从入门到精通系列教程

哪些街区的房源最多？

gm_df['neighbourhood_group_cleansed'].value_counts()

bar_data = gm_df['neighbourhood_group_cleansed'].value_counts().sort_values()# 从bar_data构建新的dataframebar_data = pd.DataFrame(bar_data).reset_index()bar_data['size'] = bar_data['neighbourhood_group_cleansed']/gm_df['neighbourhood_group_cleansed'].count()# 排序 bar_data.sort_values(by='size', ascending=False)bar_data = bar_data.rename(columns={'index' : 'Towns', 'neighbourhood_group_cleansed' : 'number_of_listings',                        'size':'fraction_of_total'})#绘图展现#plt.figure(figsize=(10,10));bar_data.plot(kind='barh', x ='Towns', y='fraction_of_total', figsize=(8,6))plt.title('Towns with the Most listings');plt.xlabel('Fraction of Total Listings');

曼彻斯特镇领有大曼彻斯特地区的大部分房源，占总房源的 53% (1849)，其次是索尔福德，占总房源的 17% ；特拉福德，占总房源的 9%。

大曼彻斯特地区的 Airbnb 房源价格散布

gm_df['price'].mean(), gm_df['price'].min(), gm_df['price'].max(),gm_df['price'].median()# (143.47600446428572, 8, 7372, 79.0)

Airbnb 房源的均价为 143 美元，中位价为 79 美元，数据集中察看到的最高价格为 7372 美元。

# 划分价格档位区间labels = ['$0 - $100', '$100 - $200', '$200 - $300', '$300 - $400', '$400 - $500', '$500 - $1000', '$1000 - $8000']price_cuts = pd.cut(gm_df['price'], bins = [0, 100, 200, 300, 400, 500, 1000, 8000], right=True, labels= labels)# 从价格档构建dataframeprice_clusters = pd.DataFrame(price_cuts).rename(columns={'price': 'price_clusters'})# 拼接原始dataframegm_df = pd.concat([gm_df, price_clusters], axis=1)# 散布绘图def price_cluster_plot(df, column, title):        plt.figure(figsize=(8,6));    yx = sb.histplot(data = df[column]);        total = float(df[column].count())    for p in yx.patches:        width = p.get_width()        height = p.get_height()        yx.text(p.get_x() + p.get_width()/2.,height+5, '{:1.1f}%'.format((height/total)*100), ha='center')    yx.set_title(title);    plt.xticks(rotation=90)        return yxprice_cluster_plot(gm_df, column='price_clusters',                    title='Price distribution of Airbnb Listings in the Greater Manchester Area');

从下面的剖析和可视化后果能够看出，65.4% 的总房源价格在 0-100 美元之间，而价格在 100-200 美元的房源占总房源的 23.4%。不过咱们也察看到数据分布有很显著的长尾个性，也能够把特地高价的局部视作异样值，它们可能会对咱们的剖析有一些影响。

最受欢迎的房型是什么

# 基于评论量统计排序ax = gm_df.groupby('property_type').agg(    median_rating=('review_scores_rating', 'median'),number_of_reviews=('number_of_reviews', 'max')).sort_values(by='number_of_reviews', ascending=False).reset_index()ax.head()

在评论最多的前 10 种房产类型中， Entire rental unit 评论数量最多，其次是Private room in rental unit。

# 可视化bx = ax.loc[:10]bx =sb.boxplot(data =bx, x='median_rating', y='property_type')bx.set_xlim(4.5, 5)plt.title('Most Enjoyed Property types');plt.xlabel('Median Rating');plt.ylabel('Property Type')

房东与房源散布

# 持有房源最多的房东host_df = pd.DataFrame(gm_df['host_name'].value_counts()/gm_df['host_name'].count() *100).reset_index()host_df = host_df.rename(columns={'index':'name', 'host_name':'perc_count'})host_df.head(10)

host_df['perc_count'].loc[:10].sum()

从上述剖析能够看出，房源最多的前 10 名房东占房源总数的 13.6%。

大曼彻斯特地区提供的客房类型散布

gm_df['room_type'].value_counts()

# 散布绘图zx = sb.countplot(data=gm_df, x='room_type')total = float(gm_df['room_type'].count())for p in zx.patches:    width = p.get_width()    height = p.get_height()    zx.text(p.get_x() + p.get_width()/2.,height+5, '{:1.1f}%'.format((height/total)*100), ha='center')    zx.set_title('Plot showing different type of rooms available');    plt.xlabel('Room')

大部分客房是 整栋屋宇/公寓 ，占房源总数的 60%，其次是私人客房，占房源总数的 39%，共享房间 和 酒店房间 别离占房源的 0.7% 和 0.5%。

机器学习建模

上面咱们应用回归建模办法来对民宿房源价格进行预估。

特色工程

对于特色工程，欢送大家查阅ShowMeAI对应的教程文章，快学快用。
机器学习实战 | 机器学习特色工程最全解读

咱们首先对原始数据进行特色工程，失去适宜建模的数据特色。

# 查看此时的数据集gm_df.head()

# 回归数据集gm_regression_df = gm_df.copy()# 剔除无用字段gm_regression_df = gm_regression_df.drop(columns=['id', 'scrape_id', 'last_scraped', 'name', 'host_id', 'host_since', 'first_review', 'last_review', 'price_clusters', 'host_name'])# 再次查看数据gm_regression_df.head()

咱们发现host_response_rate 和 host_acceptance_rate字段带有百分号，咱们再做一点数据荡涤。

# 去除百分号并转换为数值型gm_regression_df['host_response_rate'] =  gm_regression_df['host_response_rate'].str.replace("%", "")gm_regression_df['host_acceptance_rate'] =  gm_regression_df['host_acceptance_rate'].str.replace("%", "")   # convert to intgm_regression_df['host_response_rate'] = pd.to_numeric(gm_regression_df['host_response_rate']).astype(int)gm_regression_df['host_acceptance_rate'] =  pd.to_numeric(gm_regression_df['host_acceptance_rate']).astype(int)# 查看转换后后果gm_regression_df['host_response_rate'].head()

bathrooms_text 列蕴含数字和文本数据的组合，咱们对其做一些解决

# 查看原始字段gm_regression_df['bathrooms_text'].value_counts()

# 切分与数据处理def split_bathroom(df, column, text, new_column):    df_2 = df[df[column].str.contains(text, case=False)]    df.loc[df[column].str.contains(text, case=False), new_column] = df_2[column]    return df# 利用上述函数gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='shared', new_column='shared_bath')gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='private', new_column='private_bath')# 查看shared_bath字段gm_regression_df['shared_bath'].value_counts()

# 查看private_bath字段gm_regression_df['private_bath'].value_counts()

gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private bath", "pb", case=False)gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private baths", "pbs", case=False)gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared bath", "sb", case=False)gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared baths", "sb", case=False)gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared half-bath", "sb", case=False)gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private half-bath", "sb", case=False)gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='bath', new_column='bathrooms_new')gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].str.split(" ", expand=True)gm_regression_df['private_bath'] = gm_regression_df['private_bath'].str.split(" ", expand=True)gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].str.split(" ", expand=True)# 填充缺失值为0 gm_regression_df = gm_regression_df.fillna(0)gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].replace(to_replace='Shared', value=0.5)gm_regression_df['private_bath'] = gm_regression_df['private_bath'].replace(to_replace='Private', value=0.5)gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].replace(to_replace='Half-bath', value=0.5)# 转成数值型gm_regression_df['shared_bath'] = pd.to_numeric(gm_regression_df['shared_bath']).astype(int)gm_regression_df['private_bath'] = pd.to_numeric(gm_regression_df['private_bath']).astype(int)gm_regression_df['bathrooms_new'] =  pd.to_numeric(gm_regression_df['bathrooms_new']).astype(int)# 查看解决后的字段gm_regression_df[['shared_bath', 'private_bath', 'bathrooms_new']].head()

上面咱们对类别型字段进行编码，依据字段含意的不同，咱们应用「序号编码」和「独热向量编码」等办法来实现。

# 序号编码def encoder(df):    for column in df[['neighbourhood_group_cleansed', 'property_type']].columns:        labels = df[column].astype('category').cat.categories.tolist()        replace_map = {column : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}        df.replace(replace_map, inplace=True)        print(replace_map)        return df gm_regression_df = encoder(gm_regression_df)

咱们对于host_response_time和room_type字段，应用独热向量编码（哑变量变换）

host_dummy = pd.get_dummies(gm_regression_df['host_response_time'], prefix='host_response')room_dummy = pd.get_dummies(gm_regression_df['room_type'], prefix='room_type')# 拼接编码后的字段gm_regression_df = pd.concat([gm_regression_df, host_dummy, room_dummy], axis=1)# 剔除原始字段gm_regression_df = gm_regression_df.drop(columns=['host_response_time', 'room_type'], axis=1)

咱们再把之前解决过的df_amenities做一点解决，再拼接到数据特色里

df_3 = pd.DataFrame(df_amenities.sum())features = df_3['amenities'][:150].to_list()amenities_updated = df_amenities.filter(items=(features))gm_regression_df = pd.concat([gm_regression_df, amenities_updated], axis=1)

查看一下最终数据的维度

gm_regression_df.shape# (3584, 198)

咱们最初失去了198个字段，为了防止特色之间的多重共线性，应用方差因子法（VIF）来抉择机器学习模型的特色。 VIF 大于 10 的特色被删除，因为这些特色的方差能够由数据集中的其余特色示意和解释。

# 计算VIFvif_model = gm_regression_df.drop(['price'], axis=1)vif_df = pd.DataFrame()vif_df['feature'] = vif_model.columnsvif_df['VIF'] = [variance_inflation_factor(vif_model.values, i) for i in range(len(vif_model.columns))]# 选出小于10的特色vif_df_new = vif_df[vif_df['VIF']<=10]feature_list =  vif_df_new['feature'].to_list()# 选出这些特色对应的数据model_df = gm_regression_df.filter(items=(feature_list))model_df.head()

咱们拼接上price指标标签字段，能够构建残缺的数据集

price_col = gm_regression_df['price']model_df = model_df.join(price_col)

机器学习算法

咱们在这里应用几个典型的回归算法，包含线性回归、RandomForestRegression、Lasso Regression 和 GradientBoostingRegression。

对于机器学习算法的利用办法，欢送大家查阅ShowMeAI对应的教程与文章，快学快用。
机器学习实战：手把手教你玩转机器学习系列
机器学习实战 | SKLearn入门与简略利用案例
机器学习实战 | SKLearn最全利用指南

线性回归建模

def linear_reg(df, test_size=0.3, random_state=42):    '''    构建模型并返回评估后果    输出: 数据dataframe     输入: 特色重要度与评估准则（RMSE与R-squared）    '''        X = df.drop(columns=['price'])    y = df[['price']]    X_columns = X.columns        # 切分训练集与测试集    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state=random_state)    # 线性回归分类器        clf = LinearRegression()        # 候选参数列表          parameters = {                  'n_jobs': [1, 2, 5, 10, 100],                  'fit_intercept': [True, False]                                   }        # 网格搜寻穿插验证调参        cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=3, verbose=3)      cv.fit(X_train,y_train)        # 测试集预估    pred = cv.predict(X_test)        # 模型评估    r2 = r2_score(y_test, pred)    mse = mean_squared_error(y_test, pred)    rmse = mse **.5         # 最佳参数    best_par = cv.best_params_    coefficients = cv.best_estimator_.coef_            #特色重要度    importance = np.abs(coefficients)    feature_importance = pd.DataFrame(importance, columns=X_columns).T    #feature_importance = feature_importance.T    feature_importance.columns = ['importance']    feature_importance = feature_importance.sort_values('importance', ascending=False)        print("The model performance for testing set")    print("--------------------------------------")    print('RMSE is {}'.format(rmse))    print('R2 score is {}'.format(r2))    print("\n")        return feature_importance, rmse, r2     linear_feat_importance, linear_rmse, linear_r2 = linear_reg(model_df)

随机森林建模

# 随机森林建模def random_forest(df):    '''    构建模型并返回评估后果    输出: 数据dataframe     输入: 特色重要度与评估准则（RMSE与R-squared）    '''        X = df.drop(['price'], axis=1)    X_columns = X.columns        y = df['price']    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)    # 随机森林模型            clf = RandomForestRegressor()        # 候选参数    parameters = {                'n_estimators': [50, 100, 200, 300, 400],                'max_depth': [2, 3, 4, 5],                 'max_depth': [80, 90, 100]                             }    # 网格搜寻穿插验证调参    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=5, verbose=3)    model = cv    model.fit(X_train, y_train)    # 测试集预估    pred = model.predict(X_test)    # 模型评估    mse = mean_squared_error(y_test, pred)    rmse = mse**.5    r2 = r2_score(y_test, pred)          # 最佳超参数    best_par = model.best_params_        # 特色重要度    r = permutation_importance(model, X_test, y_test,                           n_repeats=10,                           random_state=0)    perm = pd.DataFrame(columns=['AVG_Importance'], index=[i for i in X_train.columns])    perm['AVG_Importance'] = r.importances_mean    perm = perm.sort_values(by='AVG_Importance', ascending=False);        return rmse, r2, best_par, perm# 运行建模r_forest_rmse, r_forest_r2, r_fores_best_params, r_forest_importance = random_forest(model_df)

运行后果如下

Fitting 5 folds for each of 15 candidates, totalling 75 fits[CV 1/5] END ..................max_depth=80, n_estimators=50; total time=   2.4s[CV 2/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s[CV 3/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s[CV 4/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s[CV 5/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s[CV 1/5] END .................max_depth=80, n_estimators=100; total time=   3.8s[CV 2/5] END .................max_depth=80, n_estimators=100; total time=   3.8s[CV 3/5] END .................max_depth=80, n_estimators=100; total time=   3.9s[CV 4/5] END .................max_depth=80, n_estimators=100; total time=   3.8s[CV 5/5] END .................max_depth=80, n_estimators=100; total time=   3.8s[CV 1/5] END .................max_depth=80, n_estimators=200; total time=   7.5s[CV 2/5] END .................max_depth=80, n_estimators=200; total time=   7.7s[CV 3/5] END .................max_depth=80, n_estimators=200; total time=   7.7s[CV 4/5] END .................max_depth=80, n_estimators=200; total time=   7.6s[CV 5/5] END .................max_depth=80, n_estimators=200; total time=   7.6s[CV 1/5] END .................max_depth=80, n_estimators=300; total time=  11.3s[CV 2/5] END .................max_depth=80, n_estimators=300; total time=  11.4s[CV 3/5] END .................max_depth=80, n_estimators=300; total time=  11.7s[CV 4/5] END .................max_depth=80, n_estimators=300; total time=  11.4s[CV 5/5] END .................max_depth=80, n_estimators=300; total time=  11.4s[CV 1/5] END .................max_depth=80, n_estimators=400; total time=  15.1s[CV 2/5] END .................max_depth=80, n_estimators=400; total time=  16.4s[CV 3/5] END .................max_depth=80, n_estimators=400; total time=  15.6s[CV 4/5] END .................max_depth=80, n_estimators=400; total time=  15.2s[CV 5/5] END .................max_depth=80, n_estimators=400; total time=  15.6s[CV 1/5] END ..................max_depth=90, n_estimators=50; total time=   1.9s[CV 2/5] END ..................max_depth=90, n_estimators=50; total time=   1.9s[CV 3/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s[CV 4/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s[CV 5/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s[CV 1/5] END .................max_depth=90, n_estimators=100; total time=   3.9s[CV 2/5] END .................max_depth=90, n_estimators=100; total time=   3.9s[CV 3/5] END .................max_depth=90, n_estimators=100; total time=   4.0s[CV 4/5] END .................max_depth=90, n_estimators=100; total time=   3.9s[CV 5/5] END .................max_depth=90, n_estimators=100; total time=   3.9s[CV 1/5] END .................max_depth=90, n_estimators=200; total time=   8.7s[CV 2/5] END .................max_depth=90, n_estimators=200; total time=   8.1s[CV 3/5] END .................max_depth=90, n_estimators=200; total time=   8.1s[CV 4/5] END .................max_depth=90, n_estimators=200; total time=   7.7s[CV 5/5] END .................max_depth=90, n_estimators=200; total time=   8.0s[CV 1/5] END .................max_depth=90, n_estimators=300; total time=  11.6s[CV 2/5] END .................max_depth=90, n_estimators=300; total time=  11.8s[CV 3/5] END .................max_depth=90, n_estimators=300; total time=  12.2s[CV 4/5] END .................max_depth=90, n_estimators=300; total time=  12.0s[CV 5/5] END .................max_depth=90, n_estimators=300; total time=  13.2s[CV 1/5] END .................max_depth=90, n_estimators=400; total time=  15.6s[CV 2/5] END .................max_depth=90, n_estimators=400; total time=  15.9s[CV 3/5] END .................max_depth=90, n_estimators=400; total time=  16.1s[CV 4/5] END .................max_depth=90, n_estimators=400; total time=  15.7s[CV 5/5] END .................max_depth=90, n_estimators=400; total time=  15.8s[CV 1/5] END .................max_depth=100, n_estimators=50; total time=   1.9s[CV 2/5] END .................max_depth=100, n_estimators=50; total time=   2.0s[CV 3/5] END .................max_depth=100, n_estimators=50; total time=   2.0s[CV 4/5] END .................max_depth=100, n_estimators=50; total time=   2.0s[CV 5/5] END .................max_depth=100, n_estimators=50; total time=   2.0s[CV 1/5] END ................max_depth=100, n_estimators=100; total time=   4.0s[CV 2/5] END ................max_depth=100, n_estimators=100; total time=   4.0s[CV 3/5] END ................max_depth=100, n_estimators=100; total time=   4.1s[CV 4/5] END ................max_depth=100, n_estimators=100; total time=   4.0s[CV 5/5] END ................max_depth=100, n_estimators=100; total time=   4.0s[CV 1/5] END ................max_depth=100, n_estimators=200; total time=   7.8s[CV 2/5] END ................max_depth=100, n_estimators=200; total time=   7.9s[CV 3/5] END ................max_depth=100, n_estimators=200; total time=   8.1s[CV 4/5] END ................max_depth=100, n_estimators=200; total time=   7.9s[CV 5/5] END ................max_depth=100, n_estimators=200; total time=   7.8s[CV 1/5] END ................max_depth=100, n_estimators=300; total time=  11.8s[CV 2/5] END ................max_depth=100, n_estimators=300; total time=  12.0s[CV 3/5] END ................max_depth=100, n_estimators=300; total time=  12.8s[CV 4/5] END ................max_depth=100, n_estimators=300; total time=  11.4s[CV 5/5] END ................max_depth=100, n_estimators=300; total time=  11.5s[CV 1/5] END ................max_depth=100, n_estimators=400; total time=  15.1s[CV 2/5] END ................max_depth=100, n_estimators=400; total time=  15.3s[CV 3/5] END ................max_depth=100, n_estimators=400; total time=  15.6s[CV 4/5] END ................max_depth=100, n_estimators=400; total time=  15.3s[CV 5/5] END ................max_depth=100, n_estimators=400; total time=  15.3s

随机森林最初的后果如下

r_forest_rmse, r_forest_r2# (218.7941962807868, 0.4208644494689676)

GBDT建模

def GBDT_model(df):    '''    构建模型并返回评估后果    输出: 数据dataframe     输入: 特色重要度与评估准则（RMSE与R-squared）    '''        X = df.drop(['price'], axis=1)    Y = df['price']    X_columns = X.columns    X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42)                clf = GradientBoostingRegressor()            parameters = {                'learning_rate': [0.1, 0.5, 1],                'min_samples_leaf': [10, 20, 40 , 60]                                             }    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=5, verbose=3)        model = cv    model.fit(X_train, y_train)    pred = model.predict(X_test)        r2 = r2_score(y_test, pred)    mse = mean_squared_error(y_test, pred)    rmse = mse**.5            coefficients = model.best_estimator_.feature_importances_    importance = np.abs(coefficients)    feature_importance = pd.DataFrame(importance, index= X_columns,                                      columns=['importance']).sort_values('importance', ascending=False)[:10]        return r2, mse, rmse, feature_importanceGBDT_r2, GBDT_mse, GBDT_rmse, GBDT_feature_importance = GBDT_model(model_df)GBDT_r2, GBDT_rmse# (0.46352992147034244, 210.58063809645563)

后果&剖析

目前随机森林的体现最稳固，而集成模型GradientBoostingRegression 的R²很高，RMSE 值也偏高，Boosting的模型受异样值影响很大，这可能是因为数据集中的异样值引起的。

上面咱们来做一下优化，删除数据集中的异样值，看看是否能够进步模型性能。

成果优化

异样值在早些时候就曾经被辨认进去了，咱们基于统计的办法来对其进行解决。

# 基于统计办法计算价格边界q3, q1 = np.percentile(model_df['price'], [75, 25])iqr = q3 - q1q3 + (iqr*1.5)# 失去后果245.0

咱们把任何高于 245 美元的值都视为异样值并删除。

new_model_df = model_df[model_df['price']<245]# 绘制此时的价格散布sb.histplot(new_model_df['price'])plt.title('New price distribution in the dataset')

从新运行这些算法

linear_feat_importance, linear_rmse, linear_r2 = linear_reg(new_model_df)r_forest_rmse, r_forest_r2, r_fores_best_params, r_forest_importance = random_forest(new_model_df)GBDT_r2, GBDT_mse, GBDT_rmse, GBDT_feature_importance = GBDTboost(new_model_df)

失去的新后果如下

归因剖析

那么，基于咱们的模型来剖析，在预测大曼彻斯特地区 Airbnb 房源的价格时，哪些因素更重要？

r_feature_importance = r_forest_importance.reset_index()r_feature_importance = r_feature_importance.rename(columns={'index':'Feature'})r_feature_importance[:15]

# 绘制最重要的15个因素r_feature_importance[:15].sort_values(by='AVG_Importance').plot(kind='barh', x='Feature', y='AVG_Importance', figsize=(8,6));plt.title('Top 15 Most Imporatant Features');

咱们的模型给出的重要因素包含：

accommodates ：能够包容的最大人数。
bathrooms_new ：非共用或非私人浴室的数量。
minimum_nights ：房源可预约的起码晚数。
number_of_reviews ：总评论数。
Free street parking ：收费路边停车位的存在是影响模型定价的最重要的便当设施。
Gym ：健身房设施。

总结&瞻望

咱们通过对Airbnb的数据进行深刻开掘剖析和建模，实现对于民宿租赁场景下的AI了解与建模预估。咱们后续还有一些能够做的事件，晋升模型的体现，实现更精准地预估，比方：

更欠缺的特色工程，联合业务场景构建更无效的业务特色。
应用xgboost、lightgbm、catboost等模型。
应用贝叶斯调参等办法对超参数做更深刻的调优。
深度学习与神经网络的办法引入。

参考资料

数据迷信工具库速查表 | Pandas 速查表：https://www.showmeai.tech/article-detail/101
图解数据分析：从入门到精通系列教程：https://www.showmeai.tech/tutorials/33
机器学习实战：手把手教你玩转机器学习系列：https://www.showmeai.tech/tutorials/41
机器学习实战 | SKLearn入门与简略利用案例：https://www.showmeai.tech/article-detail/202
机器学习实战 | SKLearn最全利用指南：https://www.showmeai.tech/article-detail/203
机器学习实战 | 机器学习特色工程最全解读：https://www.showmeai.tech/article-detail/208