简介

为了更好的熟练掌握pandas在理论数据分析中的利用,明天咱们再介绍一下怎么应用pandas做美国餐厅评分数据的剖析。

餐厅评分数据简介

数据的起源是UCI ML Repository,蕴含了一千多条数据,有5个属性,别离是:

userID: 用户ID

placeID:餐厅ID

rating:总体评分

food_rating:食物评分

service_rating:服务评分

咱们应用pandas来读取数据:

import numpy as nppath = '../data/restaurant_rating_final.csv'df = pd.read_csv(path)df
userIDplaceIDratingfood_ratingservice_rating
0U1077135085222
1U1077135038221
2U1077132825222
3U1077135060122
4U1068135104112
..................
1156U1043132630111
1157U1011132715110
1158U1068132733110
1159U1068132594111
1160U1068132660000

1161 rows × 5 columns

剖析评分数据

如果咱们关注的是不同餐厅的总评分和食物评分,咱们能够先看下这些餐厅评分的平均数,这里咱们应用pivot_table办法:

mean_ratings = df.pivot_table(values=['rating','food_rating'], index='placeID',                                 aggfunc='mean')mean_ratings[:5]
food_ratingrating
placeID
1325601.000.50
1325611.000.75
1325641.251.25
1325721.001.00
1325831.001.00

而后再看一下各个placeID,投票人数的统计:

ratings_by_place = df.groupby('placeID').size()ratings_by_place[:10]
placeID132560     4132561     4132564     4132572    15132583     4132584     6132594     5132608     6132609     5132613     6dtype: int64

如果投票人数太少,那么这些数据其实是不主观的,咱们来筛选一下投票人数超过4个的餐厅:

active_place = ratings_by_place.index[ratings_by_place >= 4]active_place
Int64Index([132560, 132561, 132564, 132572, 132583, 132584, 132594, 132608,            132609, 132613,            ...            135080, 135081, 135082, 135085, 135086, 135088, 135104, 135106,            135108, 135109],           dtype='int64', name='placeID', length=124)

抉择这些餐厅的均匀评分数据:

mean_ratings = mean_ratings.loc[active_place]mean_ratings
food_ratingrating
placeID
1325601.0000000.500000
1325611.0000000.750000
1325641.2500001.250000
1325721.0000001.000000
1325831.0000001.000000
.........
1350881.1666671.000000
1351041.4285710.857143
1351061.2000001.200000
1351081.1818181.181818
1351091.2500001.000000

124 rows × 2 columns

对rating进行排序,抉择评分最高的10个:

top_ratings = mean_ratings.sort_values(by='rating', ascending=False)top_ratings[:10]
food_ratingrating
placeID
1329551.8000002.000000
1350342.0000002.000000
1349862.0000002.000000
1329221.5000001.833333
1327552.0000001.800000
1350741.7500001.750000
1350132.0000001.750000
1349761.7500001.750000
1350551.7142861.714286
1350751.6923081.692308

咱们还能够计算均匀总评分和均匀食物评分的差值,并以一栏diff进行保留:

mean_ratings['diff'] = mean_ratings['rating'] - mean_ratings['food_rating']sorted_by_diff = mean_ratings.sort_values(by='diff')sorted_by_diff[:10]
food_ratingratingdiff
placeID
1326672.0000001.250000-0.750000
1325941.2000000.600000-0.600000
1328581.4000000.800000-0.600000
1351041.4285710.857143-0.571429
1325601.0000000.500000-0.500000
1350271.3750000.875000-0.500000
1327401.2500000.750000-0.500000
1349921.5000001.000000-0.500000
1327061.2500000.750000-0.500000
1328701.0000000.600000-0.400000

将数据进行反转,抉择差距最大的前10:

sorted_by_diff[::-1][:10]
food_ratingratingdiff
placeID
1349870.5000001.0000000.500000
1329371.0000001.5000000.500000
1350661.0000001.5000000.500000
1328511.0000001.4285710.428571
1350490.6000001.0000000.400000
1329221.5000001.8333330.333333
1350301.3333331.5833330.250000
1350631.0000001.2500000.250000
1326261.0000001.2500000.250000
1350001.0000001.2500000.250000

计算rating的标准差,并抉择最大的前10个:

# Standard deviation of rating grouped by placeIDrating_std_by_place = df.groupby('placeID')['rating'].std()# Filter down to active_titlesrating_std_by_place = rating_std_by_place.loc[active_place]# Order Series by value in descending orderrating_std_by_place.sort_values(ascending=False)[:10]
placeID134987    1.154701135049    1.000000134983    1.000000135053    0.991031135027    0.991031132847    0.983192132767    0.983192132884    0.983192135082    0.971825132706    0.957427Name: rating, dtype: float64

本文已收录于 http://www.flydean.com/02-pandas-restaurant/

最艰深的解读,最粗浅的干货,最简洁的教程,泛滥你不晓得的小技巧等你来发现!

欢送关注我的公众号:「程序那些事」,懂技术,更懂你!