乐趣区

关于python:数据分析实际案例之pandas在餐厅评分数据中的使用

简介

为了更好的熟练掌握 pandas 在理论数据分析中的利用,明天咱们再介绍一下怎么应用 pandas 做美国餐厅评分数据的剖析。

餐厅评分数据简介

数据的起源是 UCI ML Repository,蕴含了一千多条数据,有 5 个属性,别离是:

userID:用户 ID

placeID:餐厅 ID

rating:总体评分

food_rating:食物评分

service_rating:服务评分

咱们应用 pandas 来读取数据:

import numpy as np

path = '../data/restaurant_rating_final.csv'
df = pd.read_csv(path)
df
userID placeID rating food_rating service_rating
0 U1077 135085 2 2 2
1 U1077 135038 2 2 1
2 U1077 132825 2 2 2
3 U1077 135060 1 2 2
4 U1068 135104 1 1 2
1156 U1043 132630 1 1 1
1157 U1011 132715 1 1 0
1158 U1068 132733 1 1 0
1159 U1068 132594 1 1 1
1160 U1068 132660 0 0 0

1161 rows × 5 columns

剖析评分数据

如果咱们关注的是不同餐厅的总评分和食物评分,咱们能够先看下这些餐厅评分的平均数,这里咱们应用 pivot_table 办法:

mean_ratings = df.pivot_table(values=['rating','food_rating'], index='placeID',
                                 aggfunc='mean')
mean_ratings[:5]
food_rating rating
placeID
132560 1.00 0.50
132561 1.00 0.75
132564 1.25 1.25
132572 1.00 1.00
132583 1.00 1.00

而后再看一下各个 placeID,投票人数的统计:

ratings_by_place = df.groupby('placeID').size()
ratings_by_place[:10]
placeID
132560     4
132561     4
132564     4
132572    15
132583     4
132584     6
132594     5
132608     6
132609     5
132613     6
dtype: int64

如果投票人数太少,那么这些数据其实是不主观的,咱们来筛选一下投票人数超过 4 个的餐厅:

active_place = ratings_by_place.index[ratings_by_place >= 4]
active_place
Int64Index([132560, 132561, 132564, 132572, 132583, 132584, 132594, 132608,
            132609, 132613,
            ...
            135080, 135081, 135082, 135085, 135086, 135088, 135104, 135106,
            135108, 135109],
           dtype='int64', name='placeID', length=124)

抉择这些餐厅的均匀评分数据:

mean_ratings = mean_ratings.loc[active_place]
mean_ratings
food_rating rating
placeID
132560 1.000000 0.500000
132561 1.000000 0.750000
132564 1.250000 1.250000
132572 1.000000 1.000000
132583 1.000000 1.000000
135088 1.166667 1.000000
135104 1.428571 0.857143
135106 1.200000 1.200000
135108 1.181818 1.181818
135109 1.250000 1.000000

124 rows × 2 columns

对 rating 进行排序,抉择评分最高的 10 个:

top_ratings = mean_ratings.sort_values(by='rating', ascending=False)
top_ratings[:10]
food_rating rating
placeID
132955 1.800000 2.000000
135034 2.000000 2.000000
134986 2.000000 2.000000
132922 1.500000 1.833333
132755 2.000000 1.800000
135074 1.750000 1.750000
135013 2.000000 1.750000
134976 1.750000 1.750000
135055 1.714286 1.714286
135075 1.692308 1.692308

咱们还能够计算均匀总评分和均匀食物评分的差值,并以一栏 diff 进行保留:

mean_ratings['diff'] = mean_ratings['rating'] - mean_ratings['food_rating']

sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]
food_rating rating diff
placeID
132667 2.000000 1.250000 -0.750000
132594 1.200000 0.600000 -0.600000
132858 1.400000 0.800000 -0.600000
135104 1.428571 0.857143 -0.571429
132560 1.000000 0.500000 -0.500000
135027 1.375000 0.875000 -0.500000
132740 1.250000 0.750000 -0.500000
134992 1.500000 1.000000 -0.500000
132706 1.250000 0.750000 -0.500000
132870 1.000000 0.600000 -0.400000

将数据进行反转,抉择差距最大的前 10:

sorted_by_diff[::-1][:10]
food_rating rating diff
placeID
134987 0.500000 1.000000 0.500000
132937 1.000000 1.500000 0.500000
135066 1.000000 1.500000 0.500000
132851 1.000000 1.428571 0.428571
135049 0.600000 1.000000 0.400000
132922 1.500000 1.833333 0.333333
135030 1.333333 1.583333 0.250000
135063 1.000000 1.250000 0.250000
132626 1.000000 1.250000 0.250000
135000 1.000000 1.250000 0.250000

计算 rating 的标准差,并抉择最大的前 10 个:

# Standard deviation of rating grouped by placeID
rating_std_by_place = df.groupby('placeID')['rating'].std()
# Filter down to active_titles
rating_std_by_place = rating_std_by_place.loc[active_place]
# Order Series by value in descending order
rating_std_by_place.sort_values(ascending=False)[:10]
placeID
134987    1.154701
135049    1.000000
134983    1.000000
135053    0.991031
135027    0.991031
132847    0.983192
132767    0.983192
132884    0.983192
135082    0.971825
132706    0.957427
Name: rating, dtype: float64

本文已收录于 http://www.flydean.com/02-pandas-restaurant/

最艰深的解读,最粗浅的干货,最简洁的教程,泛滥你不晓得的小技巧等你来发现!

欢送关注我的公众号:「程序那些事」, 懂技术,更懂你!

退出移动版