作者:韩信子@ShowMeAI
数据分析实战系列:https://www.showmeai.tech/tutorials/40
机器学习实战系列:https://www.showmeai.tech/tutorials/41
本文地址:https://www.showmeai.tech/article-detail/316
申明:版权所有,转载请分割平台与作者并注明出处
珍藏ShowMeAI查看更多精彩内容

大家进来游览最关怀的问题之一就是住宿,在国外以 Airbnb 为代表的民宿互联网模式彻底改变了酒店业,很多游客更喜爱预订 Airbnb 而不是酒店,而在国内的美团飞猪等平台,也有大量的民宿入驻。

在当初这个信息通明凋谢的互联网时代,咱们是否收集数据信息,开发一个机器学习模型来预测房源价格,为本人的出行提供更智能化的信息呢?必定是能够的,上面ShowMeAI以Airbnb在大曼彻斯特地区的房源数据为例(截至 2022 年 3 月),来演示数据分析与开掘建模的全过程,同样的办法模式能够利用在大家相熟的国内平台上。

上面的我的项目业务和 Airbnb民宿数据 来源于 Inside Airbnb,蕴含无关 Airbnb 对住宅社区影响的数据和宣传。数据源能够在上述链接中获取,大家也能够拜访ShowMeAI的百度网盘地址,获取咱们为大家存储好的我的项目数据。

实战数据集下载(百度网盘):公众号『ShowMeAI钻研核心』回复『实战』,或者点击 这里 获取本文 [[22]基于Airbnb数据的民宿房价预测模型](https://www.showmeai.tech/art...) 『Airbnb民宿数据

ShowMeAI官网GitHub:https://github.com/ShowMeAI-Hub

业务问题

个别咱们须要在开始开掘和建模之前,深刻理解咱们的业务场景和数据状况,咱们先总结了一些在这个业务场景下咱们关怀的一些业务问题,咱们将通过数据分析开掘来实现这些业务问题的了解。

  • 哪些地区或城镇的 Airbnb 房源最多?
  • 最受欢迎的房型是什么?
  • 大曼彻斯特地区的 Airbnb 房源价格特点是什么?
  • 房源与房东的散布状况?
  • 大曼彻斯特地区有哪些房型可供选择?
  • 机器学习模型预测该地区 Airbnb 房源价格的思路是什么样的?
  • 在预测大曼彻斯特地区 Airbnb 房源的价格时,哪些特色更重要?

数据读取与初探

咱们先导入本次须要应用到的剖析开掘与建模工具库

import numpy as npimport pandas as pdfrom tqdm.notebook import tqdm, trangeimport seaborn as sbimport matplotlib.pyplot as plt%matplotlib inlinefrom sklearn.linear_model import LinearRegressionfrom sklearn.linear_model import Lassofrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import r2_score, mean_squared_errorfrom sklearn.preprocessing import StandardScalerimport statsmodels.api as smfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.model_selection import GridSearchCVfrom sklearn.pipeline import Pipeline, FeatureUnionfrom sklearn.feature_selection import SelectFromModelfrom sklearn.ensemble import GradientBoostingRegressorfrom statsmodels.stats.outliers_influence import variance_inflation_factorfrom sklearn.inspection import permutation_importancepd.set_option('display.max_columns', None)pd.set_option('display.max_rows', None)

接下来咱们读取大曼彻斯特地区的房源数据

gm_listings = pd.read_csv('gm_listings-2.csv')gm_calendar = pd.read_csv('calendar-2.csv')gm_reviews = pd.read_csv('reviews-2.csv')

查看数据的根底信息如下

gm_listings.head()

gm_listings.shape# (3584, 74)gm_listings.columns

gm_calendar.head()

gm_reviews.head()

咱们对数据的初览能够看到,大曼彻斯特地区的房源数据集蕴含 3584 行和 78 列,蕴含无关房东、房源类型、区域和评级的信息。

数据荡涤

数据荡涤是机器学习建模利用的【特色工程】阶段的外围步骤,它波及的办法技能欢送大家查阅ShowMeAI对应的教程文章,快学快用。

  • 机器学习实战 | 机器学习特色工程最全解读

字段荡涤

因为数据中的字段泛滥,有些字段比拟乱,咱们须要做一些数据荡涤的工作,数据蕴含一些带有URL的列,对最初的预测作用不大,咱们把它们荡涤掉。

# 删除url字段def drop_function(df):    df = df.drop(columns=['listing_url', 'description', 'host_thumbnail_url', 'host_picture_url', 'latitude', 'longitude', 'picture_url', 'host_url', 'host_location', 'neighbourhood', 'neighbourhood_cleansed', 'host_about', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped'])        return dfgm_df = drop_function(gm_listings)

删除过后的数据如下,洁净很多

缺失值解决

数据中也蕴含了一些缺失值,咱们对它们进行剖析解决:

# 查看缺失值百分比(gm_df.isnull().sum()/gm_df.shape[0])* 100

失去如下后果

id                                                0.000000scrape_id                                         0.000000last_scraped                                      0.000000name                                              0.000000neighborhood_overview                            41.266741host_id                                           0.000000host_name                                         0.000000host_since                                        0.000000host_response_time                               10.212054host_response_rate                               10.212054host_acceptance_rate                              5.636161host_is_superhost                                 0.000000host_neighbourhood                               91.657366host_listings_count                               0.000000host_total_listings_count                         0.000000host_verifications                                0.000000host_has_profile_pic                              0.000000host_identity_verified                            0.000000neighbourhood_group_cleansed                      0.000000property_type                                     0.000000room_type                                         0.000000accommodates                                      0.000000bathrooms                                       100.000000bathrooms_text                                    0.306920bedrooms                                          4.687500beds                                              2.120536amenities                                         0.000000price                                             0.000000minimum_nights                                    0.000000maximum_nights                                    0.000000minimum_minimum_nights                            0.000000maximum_minimum_nights                            0.000000minimum_maximum_nights                            0.000000maximum_maximum_nights                            0.000000minimum_nights_avg_ntm                            0.000000maximum_nights_avg_ntm                            0.000000calendar_updated                                100.000000number_of_reviews                                 0.000000number_of_reviews_ltm                             0.000000number_of_reviews_l30d                            0.000000first_review                                     19.810268last_review                                      19.810268review_scores_rating                             19.810268review_scores_accuracy                           20.089286review_scores_cleanliness                        20.089286review_scores_checkin                            20.089286review_scores_communication                      20.089286review_scores_location                           20.089286review_scores_value                              20.089286license                                         100.000000instant_bookable                                  0.000000calculated_host_listings_count                    0.000000calculated_host_listings_count_entire_homes       0.000000calculated_host_listings_count_private_rooms      0.000000calculated_host_listings_count_shared_rooms       0.000000reviews_per_month                                19.810268dtype: float64

咱们分几种不同的比例状况对缺失值进行解决:

  • 高缺失比例的字段,如license、calendar_updated、bathrooms、host_neighborhood等蕴含90%以上的NaN值,包含neighborhood overview是41%的NaN,并且蕴含文本数据。咱们会间接剔除这些字段
  • 数值型字段,缺失不多的状况下,咱们用字段平均值进行填充。这保障了这些值的散布被保留下来。这些列包含bedrooms、beds、review_scores_rating、review_scores_accuracy和其余打分字段。
  • 类别型字段,像bathrooms_text和host_response_time,咱们用众数进行填充。
# 剔除高缺失比例字段def drop_function_2(df):    df = df.drop(columns=['license', 'calendar_updated', 'bathrooms', 'host_neighbourhood', 'neighborhood_overview'])        return dfgm_df = drop_function_2(gm_df)# 均值填充def input_mean(df, column_list):    for columns in column_list:         df[columns].fillna(value = df[columns].mean(), inplace=True)        return dfcolumn_list = ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',              'review_scores_checkin', 'review_scores_communication', 'review_scores_location',              'review_scores_value', 'reviews_per_month',              'bedrooms', 'beds']gm_df = input_mean(gm_df, column_list)# 众数填充def input_mode(df, column_list):        for columns in column_list:                df[columns].fillna(value = df[columns].mode()[0], inplace=True)        return dfcolumn_list = ['first_review', 'last_review', 'bathrooms_text', 'host_acceptance_rate',                'host_response_rate', 'host_response_time']gm_df = input_mode(gm_df, column_list)

字段编码

host_is_superhosthas_availability 等列对应的字符串含意为 true 或 false,咱们对其编码替换为0或1。

gm_df = gm_df.replace({'host_is_superhost': 't', 'host_has_profile_pic': 't', 'host_identity_verified': 't', 'has_availability': 't', 'instant_bookable': 't'}, 1)gm_df = gm_df.replace({'host_is_superhost': 'f', 'host_has_profile_pic': 'f', 'host_identity_verified': 'f', 'has_availability': 'f', 'instant_bookable': 'f'}, 0)

咱们查看下替换后的数据分布

gm_df['host_is_superhost'].value_counts()

字段格局转换

价格相干的字段,目前还是字符串类型,蕴含“$”等符号,咱们对其解决并转换为数值型。

def string_to_int(df, column):    # 字符串替换清理    df[column] = df[column].str.replace("$", "")    df[column] = df[column].str.replace(",", "")        # 转为数值型    df[column] = pd.to_numeric(df[column]).astype(int)        return dfgm_df = string_to_int(gm_df, 'price')

列表型字段编码

host_verificationsamenities这样的字段,取值为列表格局,咱们对其进行编码解决(用哑变量替换)。

# 查看列表型取值字段gm_df_copy = gm_df.copy()gm_df_copy['amenities'].head()

gm_df_copy['host_verifications'].head()

# 哑变量编码gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace('"', '')gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace(']', "")gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace('[', "")df_amenities = gm_df_copy['amenities'].str.get_dummies(sep = ",")gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace("'", "")gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace(']', "")gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace('[', "")df_host_ver = gm_df_copy['host_verifications'].str.get_dummies(sep = ",")

编码后的后果如下所示

df_amenities.head()df_host_ver.head()

# 删除原始字段gm_df = gm_df.drop(['host_verifications', 'amenities'], axis=1)

数据摸索

下一步咱们要进行更全面一些的探索性数据分析。

EDA数据分析局部波及的工具库,大家能够参考ShowMeAI制作的工具库速查表和教程进行学习和疾速应用。

  • 数据迷信工具库速查表 | Pandas 速查表
  • 图解数据分析:从入门到精通系列教程

哪些街区的房源最多?

gm_df['neighbourhood_group_cleansed'].value_counts()

bar_data = gm_df['neighbourhood_group_cleansed'].value_counts().sort_values()# 从bar_data构建新的dataframebar_data = pd.DataFrame(bar_data).reset_index()bar_data['size'] = bar_data['neighbourhood_group_cleansed']/gm_df['neighbourhood_group_cleansed'].count()# 排序 bar_data.sort_values(by='size', ascending=False)bar_data = bar_data.rename(columns={'index' : 'Towns', 'neighbourhood_group_cleansed' : 'number_of_listings',                        'size':'fraction_of_total'})#绘图展现#plt.figure(figsize=(10,10));bar_data.plot(kind='barh', x ='Towns', y='fraction_of_total', figsize=(8,6))plt.title('Towns with the Most listings');plt.xlabel('Fraction of Total Listings');

曼彻斯特镇领有大曼彻斯特地区的大部分房源,占总房源的 53% (1849),其次是索尔福德,占总房源的 17% ;特拉福德,占总房源的 9%。

大曼彻斯特地区的 Airbnb 房源价格散布

gm_df['price'].mean(), gm_df['price'].min(), gm_df['price'].max(),gm_df['price'].median()# (143.47600446428572, 8, 7372, 79.0)

Airbnb 房源的均价为 143 美元,中位价为 79 美元,数据集中察看到的最高价格为 7372 美元。

# 划分价格档位区间labels = ['$0 - $100', '$100 - $200', '$200 - $300', '$300 - $400', '$400 - $500', '$500 - $1000', '$1000 - $8000']price_cuts = pd.cut(gm_df['price'], bins = [0, 100, 200, 300, 400, 500, 1000, 8000], right=True, labels= labels)# 从价格档构建dataframeprice_clusters = pd.DataFrame(price_cuts).rename(columns={'price': 'price_clusters'})# 拼接原始dataframegm_df = pd.concat([gm_df, price_clusters], axis=1)# 散布绘图def price_cluster_plot(df, column, title):        plt.figure(figsize=(8,6));    yx = sb.histplot(data = df[column]);        total = float(df[column].count())    for p in yx.patches:        width = p.get_width()        height = p.get_height()        yx.text(p.get_x() + p.get_width()/2.,height+5, '{:1.1f}%'.format((height/total)*100), ha='center')    yx.set_title(title);    plt.xticks(rotation=90)        return yxprice_cluster_plot(gm_df, column='price_clusters',                    title='Price distribution of Airbnb Listings in the Greater Manchester Area');

从下面的剖析和可视化后果能够看出,65.4% 的总房源价格在 0-100 美元之间,而价格在 100-200 美元的房源占总房源的 23.4%。不过咱们也察看到数据分布有很显著的长尾个性,也能够把特地高价的局部视作异样值,它们可能会对咱们的剖析有一些影响。

最受欢迎的房型是什么

# 基于评论量统计排序ax = gm_df.groupby('property_type').agg(    median_rating=('review_scores_rating', 'median'),number_of_reviews=('number_of_reviews', 'max')).sort_values(by='number_of_reviews', ascending=False).reset_index()ax.head()

在评论最多的前 10 种房产类型中, Entire rental unit 评论数量最多,其次是Private room in rental unit。

# 可视化bx = ax.loc[:10]bx =sb.boxplot(data =bx, x='median_rating', y='property_type')bx.set_xlim(4.5, 5)plt.title('Most Enjoyed Property types');plt.xlabel('Median Rating');plt.ylabel('Property Type')

房东与房源散布

# 持有房源最多的房东host_df = pd.DataFrame(gm_df['host_name'].value_counts()/gm_df['host_name'].count() *100).reset_index()host_df = host_df.rename(columns={'index':'name', 'host_name':'perc_count'})host_df.head(10)

host_df['perc_count'].loc[:10].sum()

从上述剖析能够看出,房源最多的前 10 名房东占房源总数的 13.6%。

大曼彻斯特地区提供的客房类型散布

gm_df['room_type'].value_counts()

# 散布绘图zx = sb.countplot(data=gm_df, x='room_type')total = float(gm_df['room_type'].count())for p in zx.patches:    width = p.get_width()    height = p.get_height()    zx.text(p.get_x() + p.get_width()/2.,height+5, '{:1.1f}%'.format((height/total)*100), ha='center')    zx.set_title('Plot showing different type of rooms available');    plt.xlabel('Room')

大部分客房是 整栋屋宇/公寓 ,占房源总数的 60%,其次是私人客房,占房源总数的 39%,共享房间酒店房间 别离占房源的 0.7% 和 0.5%。

机器学习建模

上面咱们应用回归建模办法来对民宿房源价格进行预估。

特色工程

对于特色工程,欢送大家查阅ShowMeAI对应的教程文章,快学快用。

  • 机器学习实战 | 机器学习特色工程最全解读

咱们首先对原始数据进行特色工程,失去适宜建模的数据特色。

# 查看此时的数据集gm_df.head()

# 回归数据集gm_regression_df = gm_df.copy()# 剔除无用字段gm_regression_df = gm_regression_df.drop(columns=['id', 'scrape_id', 'last_scraped', 'name', 'host_id', 'host_since', 'first_review', 'last_review', 'price_clusters', 'host_name'])# 再次查看数据gm_regression_df.head()

咱们发现host_response_ratehost_acceptance_rate字段带有百分号,咱们再做一点数据荡涤。

# 去除百分号并转换为数值型gm_regression_df['host_response_rate'] =  gm_regression_df['host_response_rate'].str.replace("%", "")gm_regression_df['host_acceptance_rate'] =  gm_regression_df['host_acceptance_rate'].str.replace("%", "")   # convert to intgm_regression_df['host_response_rate'] = pd.to_numeric(gm_regression_df['host_response_rate']).astype(int)gm_regression_df['host_acceptance_rate'] =  pd.to_numeric(gm_regression_df['host_acceptance_rate']).astype(int)# 查看转换后后果gm_regression_df['host_response_rate'].head()

bathrooms_text 列蕴含数字和文本数据的组合,咱们对其做一些解决

# 查看原始字段gm_regression_df['bathrooms_text'].value_counts()

# 切分与数据处理def split_bathroom(df, column, text, new_column):    df_2 = df[df[column].str.contains(text, case=False)]    df.loc[df[column].str.contains(text, case=False), new_column] = df_2[column]    return df# 利用上述函数gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='shared', new_column='shared_bath')gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='private', new_column='private_bath')# 查看shared_bath字段gm_regression_df['shared_bath'].value_counts()

# 查看private_bath字段gm_regression_df['private_bath'].value_counts()

gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private bath", "pb", case=False)gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private baths", "pbs", case=False)gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared bath", "sb", case=False)gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared baths", "sb", case=False)gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared half-bath", "sb", case=False)gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private half-bath", "sb", case=False)gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='bath', new_column='bathrooms_new')gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].str.split(" ", expand=True)gm_regression_df['private_bath'] = gm_regression_df['private_bath'].str.split(" ", expand=True)gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].str.split(" ", expand=True)# 填充缺失值为0 gm_regression_df = gm_regression_df.fillna(0)gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].replace(to_replace='Shared', value=0.5)gm_regression_df['private_bath'] = gm_regression_df['private_bath'].replace(to_replace='Private', value=0.5)gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].replace(to_replace='Half-bath', value=0.5)# 转成数值型gm_regression_df['shared_bath'] = pd.to_numeric(gm_regression_df['shared_bath']).astype(int)gm_regression_df['private_bath'] = pd.to_numeric(gm_regression_df['private_bath']).astype(int)gm_regression_df['bathrooms_new'] =  pd.to_numeric(gm_regression_df['bathrooms_new']).astype(int)# 查看解决后的字段gm_regression_df[['shared_bath', 'private_bath', 'bathrooms_new']].head()

上面咱们对类别型字段进行编码,依据字段含意的不同,咱们应用「序号编码」和「独热向量编码」等办法来实现。

# 序号编码def encoder(df):    for column in df[['neighbourhood_group_cleansed', 'property_type']].columns:        labels = df[column].astype('category').cat.categories.tolist()        replace_map = {column : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}        df.replace(replace_map, inplace=True)        print(replace_map)        return df gm_regression_df = encoder(gm_regression_df)

咱们对于host_response_timeroom_type字段,应用独热向量编码(哑变量变换)

host_dummy = pd.get_dummies(gm_regression_df['host_response_time'], prefix='host_response')room_dummy = pd.get_dummies(gm_regression_df['room_type'], prefix='room_type')# 拼接编码后的字段gm_regression_df = pd.concat([gm_regression_df, host_dummy, room_dummy], axis=1)# 剔除原始字段gm_regression_df = gm_regression_df.drop(columns=['host_response_time', 'room_type'], axis=1)

咱们再把之前解决过的df_amenities做一点解决,再拼接到数据特色里

df_3 = pd.DataFrame(df_amenities.sum())features = df_3['amenities'][:150].to_list()amenities_updated = df_amenities.filter(items=(features))gm_regression_df = pd.concat([gm_regression_df, amenities_updated], axis=1)

查看一下最终数据的维度

gm_regression_df.shape# (3584, 198)

咱们最初失去了198个字段,为了防止特色之间的多重共线性,应用方差因子法(VIF)来抉择机器学习模型的特色。 VIF 大于 10 的特色被删除,因为这些特色的方差能够由数据集中的其余特色示意和解释。

# 计算VIFvif_model = gm_regression_df.drop(['price'], axis=1)vif_df = pd.DataFrame()vif_df['feature'] = vif_model.columnsvif_df['VIF'] = [variance_inflation_factor(vif_model.values, i) for i in range(len(vif_model.columns))]# 选出小于10的特色vif_df_new = vif_df[vif_df['VIF']<=10]feature_list =  vif_df_new['feature'].to_list()# 选出这些特色对应的数据model_df = gm_regression_df.filter(items=(feature_list))model_df.head()

咱们拼接上price指标标签字段,能够构建残缺的数据集

price_col = gm_regression_df['price']model_df = model_df.join(price_col)

机器学习算法

咱们在这里应用几个典型的回归算法,包含线性回归、RandomForestRegression、Lasso Regression 和 GradientBoostingRegression。

对于机器学习算法的利用办法,欢送大家查阅ShowMeAI对应的教程与文章,快学快用。

  • 机器学习实战:手把手教你玩转机器学习系列
  • 机器学习实战 | SKLearn入门与简略利用案例
  • 机器学习实战 | SKLearn最全利用指南

线性回归建模

def linear_reg(df, test_size=0.3, random_state=42):    '''    构建模型并返回评估后果    输出: 数据dataframe     输入: 特色重要度与评估准则(RMSE与R-squared)    '''        X = df.drop(columns=['price'])    y = df[['price']]    X_columns = X.columns        # 切分训练集与测试集    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state=random_state)    # 线性回归分类器        clf = LinearRegression()        # 候选参数列表          parameters = {                  'n_jobs': [1, 2, 5, 10, 100],                  'fit_intercept': [True, False]                                   }        # 网格搜寻穿插验证调参        cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=3, verbose=3)      cv.fit(X_train,y_train)        # 测试集预估    pred = cv.predict(X_test)        # 模型评估    r2 = r2_score(y_test, pred)    mse = mean_squared_error(y_test, pred)    rmse = mse **.5         # 最佳参数    best_par = cv.best_params_    coefficients = cv.best_estimator_.coef_            #特色重要度    importance = np.abs(coefficients)    feature_importance = pd.DataFrame(importance, columns=X_columns).T    #feature_importance = feature_importance.T    feature_importance.columns = ['importance']    feature_importance = feature_importance.sort_values('importance', ascending=False)        print("The model performance for testing set")    print("--------------------------------------")    print('RMSE is {}'.format(rmse))    print('R2 score is {}'.format(r2))    print("\n")        return feature_importance, rmse, r2     linear_feat_importance, linear_rmse, linear_r2 = linear_reg(model_df)

随机森林建模

# 随机森林建模def random_forest(df):    '''    构建模型并返回评估后果    输出: 数据dataframe     输入: 特色重要度与评估准则(RMSE与R-squared)    '''        X = df.drop(['price'], axis=1)    X_columns = X.columns        y = df['price']    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)    # 随机森林模型            clf = RandomForestRegressor()        # 候选参数    parameters = {                'n_estimators': [50, 100, 200, 300, 400],                'max_depth': [2, 3, 4, 5],                 'max_depth': [80, 90, 100]                             }    # 网格搜寻穿插验证调参    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=5, verbose=3)    model = cv    model.fit(X_train, y_train)    # 测试集预估    pred = model.predict(X_test)    # 模型评估    mse = mean_squared_error(y_test, pred)    rmse = mse**.5    r2 = r2_score(y_test, pred)          # 最佳超参数    best_par = model.best_params_        # 特色重要度    r = permutation_importance(model, X_test, y_test,                           n_repeats=10,                           random_state=0)    perm = pd.DataFrame(columns=['AVG_Importance'], index=[i for i in X_train.columns])    perm['AVG_Importance'] = r.importances_mean    perm = perm.sort_values(by='AVG_Importance', ascending=False);        return rmse, r2, best_par, perm# 运行建模r_forest_rmse, r_forest_r2, r_fores_best_params, r_forest_importance = random_forest(model_df)

运行后果如下

Fitting 5 folds for each of 15 candidates, totalling 75 fits[CV 1/5] END ..................max_depth=80, n_estimators=50; total time=   2.4s[CV 2/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s[CV 3/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s[CV 4/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s[CV 5/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s[CV 1/5] END .................max_depth=80, n_estimators=100; total time=   3.8s[CV 2/5] END .................max_depth=80, n_estimators=100; total time=   3.8s[CV 3/5] END .................max_depth=80, n_estimators=100; total time=   3.9s[CV 4/5] END .................max_depth=80, n_estimators=100; total time=   3.8s[CV 5/5] END .................max_depth=80, n_estimators=100; total time=   3.8s[CV 1/5] END .................max_depth=80, n_estimators=200; total time=   7.5s[CV 2/5] END .................max_depth=80, n_estimators=200; total time=   7.7s[CV 3/5] END .................max_depth=80, n_estimators=200; total time=   7.7s[CV 4/5] END .................max_depth=80, n_estimators=200; total time=   7.6s[CV 5/5] END .................max_depth=80, n_estimators=200; total time=   7.6s[CV 1/5] END .................max_depth=80, n_estimators=300; total time=  11.3s[CV 2/5] END .................max_depth=80, n_estimators=300; total time=  11.4s[CV 3/5] END .................max_depth=80, n_estimators=300; total time=  11.7s[CV 4/5] END .................max_depth=80, n_estimators=300; total time=  11.4s[CV 5/5] END .................max_depth=80, n_estimators=300; total time=  11.4s[CV 1/5] END .................max_depth=80, n_estimators=400; total time=  15.1s[CV 2/5] END .................max_depth=80, n_estimators=400; total time=  16.4s[CV 3/5] END .................max_depth=80, n_estimators=400; total time=  15.6s[CV 4/5] END .................max_depth=80, n_estimators=400; total time=  15.2s[CV 5/5] END .................max_depth=80, n_estimators=400; total time=  15.6s[CV 1/5] END ..................max_depth=90, n_estimators=50; total time=   1.9s[CV 2/5] END ..................max_depth=90, n_estimators=50; total time=   1.9s[CV 3/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s[CV 4/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s[CV 5/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s[CV 1/5] END .................max_depth=90, n_estimators=100; total time=   3.9s[CV 2/5] END .................max_depth=90, n_estimators=100; total time=   3.9s[CV 3/5] END .................max_depth=90, n_estimators=100; total time=   4.0s[CV 4/5] END .................max_depth=90, n_estimators=100; total time=   3.9s[CV 5/5] END .................max_depth=90, n_estimators=100; total time=   3.9s[CV 1/5] END .................max_depth=90, n_estimators=200; total time=   8.7s[CV 2/5] END .................max_depth=90, n_estimators=200; total time=   8.1s[CV 3/5] END .................max_depth=90, n_estimators=200; total time=   8.1s[CV 4/5] END .................max_depth=90, n_estimators=200; total time=   7.7s[CV 5/5] END .................max_depth=90, n_estimators=200; total time=   8.0s[CV 1/5] END .................max_depth=90, n_estimators=300; total time=  11.6s[CV 2/5] END .................max_depth=90, n_estimators=300; total time=  11.8s[CV 3/5] END .................max_depth=90, n_estimators=300; total time=  12.2s[CV 4/5] END .................max_depth=90, n_estimators=300; total time=  12.0s[CV 5/5] END .................max_depth=90, n_estimators=300; total time=  13.2s[CV 1/5] END .................max_depth=90, n_estimators=400; total time=  15.6s[CV 2/5] END .................max_depth=90, n_estimators=400; total time=  15.9s[CV 3/5] END .................max_depth=90, n_estimators=400; total time=  16.1s[CV 4/5] END .................max_depth=90, n_estimators=400; total time=  15.7s[CV 5/5] END .................max_depth=90, n_estimators=400; total time=  15.8s[CV 1/5] END .................max_depth=100, n_estimators=50; total time=   1.9s[CV 2/5] END .................max_depth=100, n_estimators=50; total time=   2.0s[CV 3/5] END .................max_depth=100, n_estimators=50; total time=   2.0s[CV 4/5] END .................max_depth=100, n_estimators=50; total time=   2.0s[CV 5/5] END .................max_depth=100, n_estimators=50; total time=   2.0s[CV 1/5] END ................max_depth=100, n_estimators=100; total time=   4.0s[CV 2/5] END ................max_depth=100, n_estimators=100; total time=   4.0s[CV 3/5] END ................max_depth=100, n_estimators=100; total time=   4.1s[CV 4/5] END ................max_depth=100, n_estimators=100; total time=   4.0s[CV 5/5] END ................max_depth=100, n_estimators=100; total time=   4.0s[CV 1/5] END ................max_depth=100, n_estimators=200; total time=   7.8s[CV 2/5] END ................max_depth=100, n_estimators=200; total time=   7.9s[CV 3/5] END ................max_depth=100, n_estimators=200; total time=   8.1s[CV 4/5] END ................max_depth=100, n_estimators=200; total time=   7.9s[CV 5/5] END ................max_depth=100, n_estimators=200; total time=   7.8s[CV 1/5] END ................max_depth=100, n_estimators=300; total time=  11.8s[CV 2/5] END ................max_depth=100, n_estimators=300; total time=  12.0s[CV 3/5] END ................max_depth=100, n_estimators=300; total time=  12.8s[CV 4/5] END ................max_depth=100, n_estimators=300; total time=  11.4s[CV 5/5] END ................max_depth=100, n_estimators=300; total time=  11.5s[CV 1/5] END ................max_depth=100, n_estimators=400; total time=  15.1s[CV 2/5] END ................max_depth=100, n_estimators=400; total time=  15.3s[CV 3/5] END ................max_depth=100, n_estimators=400; total time=  15.6s[CV 4/5] END ................max_depth=100, n_estimators=400; total time=  15.3s[CV 5/5] END ................max_depth=100, n_estimators=400; total time=  15.3s

随机森林最初的后果如下

r_forest_rmse, r_forest_r2# (218.7941962807868, 0.4208644494689676)

GBDT建模

def GBDT_model(df):    '''    构建模型并返回评估后果    输出: 数据dataframe     输入: 特色重要度与评估准则(RMSE与R-squared)    '''        X = df.drop(['price'], axis=1)    Y = df['price']    X_columns = X.columns    X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42)                clf = GradientBoostingRegressor()            parameters = {                'learning_rate': [0.1, 0.5, 1],                'min_samples_leaf': [10, 20, 40 , 60]                                             }    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=5, verbose=3)        model = cv    model.fit(X_train, y_train)    pred = model.predict(X_test)        r2 = r2_score(y_test, pred)    mse = mean_squared_error(y_test, pred)    rmse = mse**.5            coefficients = model.best_estimator_.feature_importances_    importance = np.abs(coefficients)    feature_importance = pd.DataFrame(importance, index= X_columns,                                      columns=['importance']).sort_values('importance', ascending=False)[:10]        return r2, mse, rmse, feature_importanceGBDT_r2, GBDT_mse, GBDT_rmse, GBDT_feature_importance = GBDT_model(model_df)GBDT_r2, GBDT_rmse# (0.46352992147034244, 210.58063809645563)

后果&剖析

目前随机森林的体现最稳固,而集成模型GradientBoostingRegression 的R²很高,RMSE 值也偏高,Boosting的模型受异样值影响很大,这可能是因为数据集中的异样值引起的。

上面咱们来做一下优化,删除数据集中的异样值,看看是否能够进步模型性能。

成果优化

异样值在早些时候就曾经被辨认进去了,咱们基于统计的办法来对其进行解决。

# 基于统计办法计算价格边界q3, q1 = np.percentile(model_df['price'], [75, 25])iqr = q3 - q1q3 + (iqr*1.5)# 失去后果245.0

咱们把任何高于 245 美元的值都视为异样值并删除。

new_model_df = model_df[model_df['price']<245]# 绘制此时的价格散布sb.histplot(new_model_df['price'])plt.title('New price distribution in the dataset')

从新运行这些算法

linear_feat_importance, linear_rmse, linear_r2 = linear_reg(new_model_df)r_forest_rmse, r_forest_r2, r_fores_best_params, r_forest_importance = random_forest(new_model_df)GBDT_r2, GBDT_mse, GBDT_rmse, GBDT_feature_importance = GBDTboost(new_model_df)

失去的新后果如下

归因剖析

那么,基于咱们的模型来剖析,在预测大曼彻斯特地区 Airbnb 房源的价格时,哪些因素更重要?

r_feature_importance = r_forest_importance.reset_index()r_feature_importance = r_feature_importance.rename(columns={'index':'Feature'})r_feature_importance[:15]

# 绘制最重要的15个因素r_feature_importance[:15].sort_values(by='AVG_Importance').plot(kind='barh', x='Feature', y='AVG_Importance', figsize=(8,6));plt.title('Top 15 Most Imporatant Features');

咱们的模型给出的重要因素包含:

  • accommodates :能够包容的最大人数。
  • bathrooms_new :非共用或非私人浴室的数量。
  • minimum_nights :房源可预约的起码晚数。
  • number_of_reviews :总评论数。
  • Free street parking :收费路边停车位的存在是影响模型定价的最重要的便当设施。
  • Gym :健身房设施。

总结&瞻望

咱们通过对Airbnb的数据进行深刻开掘剖析和建模,实现对于民宿租赁场景下的AI了解与建模预估。咱们后续还有一些能够做的事件,晋升模型的体现,实现更精准地预估,比方:

  • 更欠缺的特色工程,联合业务场景构建更无效的业务特色。
  • 应用xgboost、lightgbm、catboost等模型。
  • 应用贝叶斯调参等办法对超参数做更深刻的调优。
  • 深度学习与神经网络的办法引入。

参考资料

  • 数据迷信工具库速查表 | Pandas 速查表:https://www.showmeai.tech/article-detail/101
  • 图解数据分析:从入门到精通系列教程:https://www.showmeai.tech/tutorials/33
  • 机器学习实战:手把手教你玩转机器学习系列:https://www.showmeai.tech/tutorials/41
  • 机器学习实战 | SKLearn入门与简略利用案例:https://www.showmeai.tech/article-detail/202
  • 机器学习实战 | SKLearn最全利用指南:https://www.showmeai.tech/article-detail/203
  • 机器学习实战 | 机器学习特色工程最全解读:https://www.showmeai.tech/article-detail/208