竞赛背景
皇包车(HI GUIDES)是一个为中国出境游用户提供全球中文包车游服务的平台。拥有境外 10 万名华人司机兼导游(司导),覆盖全球 90 多个国家,1600 多个城市,300 多个国际机场。截止 2017 年 6 月,已累计服务 400 万中国出境游用户。
由于消费者消费能力逐渐增强、旅游信息不透明程度的下降,游客的行为逐渐变得难以预测,传统旅行社的旅游路线模式已经不能满足游客需求。如何为用户提供更受欢迎、更合适的包车游路线,就需要借助大数据的力量。结合用户个人喜好、景点受欢迎度、天气交通等维度,制定多套旅游信息化解决方案和产品。
赛题地址:https://www.dcjingsai.com/com…
任务
黄包车提供五万余条客户浏览 APP 行为,其中有些客户在浏览后完成了订单,且享受了精品旅游服务,而有些用户则没有下单。
参赛者需要分析用户的个人信息和浏览行为,从而 预测用户是否会在短期内购买精品旅游服务。
数据导入及预览
import pandas as pd
import numpy as np
from sklearn import preprocessing
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif'] = [u'SimHei']
plt.rcParams['axes.unicode_minus'] = False
user_train = pd.read_csv(r'Data\trainingset\userProfile_train.csv')
action_train = pd.read_csv(r'Data\trainingset\action_train.csv')
comment_train = pd.read_csv(r'Data\trainingset\userComment_train.csv')
orderFuture_train= pd.read_csv(r'Data\trainingset\orderFuture_train.csv')
orderHistory_train= pd.read_csv(r'Data\trainingset\orderHistory_train.csv')
user_test = pd.read_csv(r'Data\test\userProfile_test.csv')
action_test = pd.read_csv(r'Data\test\action_test.csv')
comment_test = pd.read_csv(r'Data\test\userComment_test.csv')
orderFuture_test = pd.read_csv(r'Data\test\orderFuture_test.csv')
orderHistory_test = pd.read_csv(r'Data\test\orderHistory_test.csv')
user = pd.concat([user_train,user_test])
action = pd.concat([action_train,action_test])
comment = pd.concat([comment_train,comment_test])
orderHistory = pd.concat([orderHistory_train,orderHistory_test])
orderFuture = pd.concat([orderFuture_train,orderFuture_test])
user.head()
userid | gender | province | age | |
---|---|---|---|---|
0 | 100000000013 | 男 | NaN | 60 后 |
1 | 100000000111 | NaN | 上海 | NaN |
2 | 100000000127 | NaN | 上海 | NaN |
3 | 100000000231 | 男 | 北京 | 70 后 |
4 | 100000000379 | 男 | 北京 | NaN |
</div>
action.head()
<div>
userid | actionType | actionTime | |
---|---|---|---|
0 | 100000000013 | 1 | 1474300753 |
1 | 100000000013 | 5 | 1474300763 |
2 | 100000000013 | 6 | 1474300874 |
3 | 100000000013 | 5 | 1474300911 |
4 | 100000000013 | 6 | 1474300936 |
</div>
orderHistory.head()
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {vertical-align: middle;}
.dataframe tbody tr th {vertical-align: top;}
.dataframe thead th {text-align: right;}
</style>
userid | orderid | orderTime | orderType | city | country | continent | |
---|---|---|---|---|---|---|---|
0 | 100000000013 | 1000015 | 1481714516 | 0 | 柏林 | 德国 | 欧洲 |
1 | 100000000013 | 1000014 | 1501959643 | 0 | 旧金山 | 美国 | 北美洲 |
2 | 100000000393 | 1000033 | 1499440296 | 0 | 巴黎 | 法国 | 欧洲 |
3 | 100000000459 | 1000036 | 1480601668 | 0 | 纽约 | 美国 | 北美洲 |
4 | 100000000459 | 1000034 | 1479146723 | 0 | 巴厘岛 | 印度尼西亚 | 亚洲 |
</div>
orderFuture.head()
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {vertical-align: middle;}
.dataframe tbody tr th {vertical-align: top;}
.dataframe thead th {text-align: right;}
</style>
orderType | userid | |
---|---|---|
0 | 0.0 | 100000000013 |
1 | 0.0 | 100000000111 |
2 | 0.0 | 100000000127 |
3 | 0.0 | 100000000231 |
4 | 0.0 | 100000000379 |
</div>
comment.head()
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {vertical-align: middle;}
.dataframe tbody tr th {vertical-align: top;}
.dataframe thead th {text-align: right;}
</style>
userid | orderid | rating | tags | commentsKeyWords | |
---|---|---|---|---|---|
0 | 100000000013 | 1000015 | 4.0 | NaN | [‘ 很 ’,’ 简陋 ’,’ 太 ’,’ 随便 ’] |
1 | 100000000231 | 1000024 | 5.0 | 提前联系 | 耐心等候 | [‘ 很 ’,’ 细心 ’] |
2 | 100000000471 | 1000038 | 5.0 | NaN | NaN |
3 | 100000000637 | 1000040 | 5.0 | 主动热情 | 提前联系 | 举牌迎接 | 主动搬运行李 | NaN |
4 | 100000000755 | 1000045 | 1.0 | 未举牌服务 | NaN |
</div>
EDA 及可视化
用户信息
用户信息表共 40307 条用户数据,userid 唯一标识,数据缺失较为严重。
user.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50383 entries, 0 to 10075
Data columns (total 4 columns):
userid 50383 non-null int64
gender 19769 non-null object
province 45484 non-null object
age 5961 non-null object
dtypes: int64(1), object(3)
memory usage: 1.9+ MB
用户地区分布
(user.province.value_counts()/user.province.value_counts().sum()).head().sum()
0.7712162518687891
用户以北京、上海、广东、江苏、浙江等发达地区为主,五地区占到总用户数的 77%。
fig,axes = plt.subplots(figsize=(20,10))
sns.countplot(x='province',data=user,order=user.province.value_counts().index.tolist())
<matplotlib.axes._subplots.AxesSubplot at 0x28b00198198>
用户性别信息
用户性别共 15760 条数据,女性占 54.7%,男性 45.3%。
fig,axes = plt.subplots(1,2,figsize=(12,4))
user.gender.value_counts().plot.bar(ax=axes[0])
axes[0].set_xticklabels(['女','男'],rotation=0)
user.gender.value_counts().plot.pie(ax=axes[1],autopct='%.2f%%')
<matplotlib.axes._subplots.AxesSubplot at 0x28b02fbc208>
用户年龄信息
用户年龄共 4742 条信息,以 60 后、70 后、80 后、90 后为主。
fig,axes = plt.subplots(figsize=(10,4))
user.age.value_counts().plot.bar()
plt.xticks(rotation=0)
(array([0, 1, 2, 3, 4]), <a list of 5 Text xticklabel objects>)
提供年龄信息的用户中,男性多于女性,且数据缺失严重。
fig,axes = plt.subplots(figsize=(10,4))
sns.countplot(x='age',data=user,hue='gender')
<matplotlib.axes._subplots.AxesSubplot at 0x28b030b2c50>
用户浏览行为
行为类型一共有 9 个,其中 1 是唤醒 app;2- 4 是浏览产品,无先后关系;5- 9 则是有先后关系的,从填写表单到提交订单再到最后支付。
import time
def time_convert(timestamp):
str_time =time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(timestamp))
return str_time
action.actionTime = action.actionTime.map(lambda x: time_convert(x))
action['year']=action.actionTime.str[:4]
action['month']=action.actionTime.str[5:7]
action['day']=action.actionTime.str[8:10]
action['date']=action.actionTime.str[:10]
action['time']=action.actionTime.str[11:]
action['year_month']=action.actionTime.str[:7]
action['hour']=action.actionTime.str[11:13]
用户月访问量
MAU 用月内产生用户行为的独立 ID 数量表示,PV 用唤醒 APP(行为 1)次数表示。用户活跃的两个峰值分别在四五月和十月,小长假是人们出国游的首选时间。
fig,axes = plt.subplots(2,1,figsize=(10,10))
action[action['year_month'] !='2016-08'].drop_duplicates(['userid']).groupby('year_month').userid.count().plot(ax=axes[0])
axes[0].set_title('独立用户月访问量(MAU)')
action[action.actionType==1].groupby('year_month').userid.count().plot(ax=axes[1])
axes[1].set_title('用户月访问量(PV)')
Text(0.5, 1.0, '用户月访问量(PV)')
日访问量
DAU 为日内产生用户行为的独立 ID 数,PV 为日内行为为 1 的行为条数。DAU 峰值出现在 4 月初,但同一时间段内的 PV 却相对 PV 峰值 5 月初较低,说明 4 月初平均每用户唤醒次数较低,可能是有拉新活动。对两项指标相除,可以验证以上猜想。同样的,16 年 12 月之前用户的 PV/DAU 较大,之后较为平稳,说明 APP 进入健康平稳期。
action.head()
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {vertical-align: middle;}
.dataframe tbody tr th {vertical-align: top;}
.dataframe thead th {text-align: right;}
</style>
userid | actionType | actionTime | year | month | day | date | time | year_month | hour | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 100000000013 | 1 | 2016-09-19 23:59:13 | 2016 | 09 | 19 | 2016-09-19 | 23:59:13 | 2016-09 | 23 |
1 | 100000000013 | 5 | 2016-09-19 23:59:23 | 2016 | 09 | 19 | 2016-09-19 | 23:59:23 | 2016-09 | 23 |
2 | 100000000013 | 6 | 2016-09-20 00:01:14 | 2016 | 09 | 20 | 2016-09-20 | 00:01:14 | 2016-09 | 00 |
3 | 100000000013 | 5 | 2016-09-20 00:01:51 | 2016 | 09 | 20 | 2016-09-20 | 00:01:51 | 2016-09 | 00 |
4 | 100000000013 | 6 | 2016-09-20 00:02:16 | 2016 | 09 | 20 | 2016-09-20 | 00:02:16 | 2016-09 | 00 |
</div>
fig,axes = plt.subplots(2,1,figsize=(10,10))
action.drop_duplicates(['userid']).groupby('date').userid.count().plot(ax=axes[0])
axes[0].set_title('独立用户日访问量(DAU)')
action[action['actionType']==1].groupby('date').userid.count().plot(ax=axes[1])
axes[1].set_title('用户日访问量(PV)')
Text(0.5, 1.0, '用户日访问量(PV)')
fig,axes = plt.subplots(figsize=(10,5))
(action.drop_duplicates(['userid']).groupby('date').userid.count()/action[action['actionType']==1].groupby('date').userid.count()).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x28b03211080>
小时访问分析
数据的点击量呈现一个非常奇怪的形状,在日间(8 点到 16 点)呈现较低的访问量,并在 12 点左右达到最低值,显然数据缺失严重。
fig,axes = plt.subplots(2,1,figsize=(10,10))
action.drop_duplicates(['userid']).groupby('hour').userid.count().plot(ax=axes[0])
axes[0].set_title('独立用户小时访问量(HAU)')
action[action['actionType']==1].groupby('hour').userid.count().plot(ax=axes[1])
axes[1].set_title('用户日小时访问量(PV)')
Text(0.5, 1.0, '用户日小时访问量(PV)')
不同类型用户访问量
# 对访问类型分类
def vis_type(x):
if x in [2,3,4]:
return 2
else:
return x
action['visitor_type']=action['actionType'].map(lambda x: vis_type(x))
fig,axes = plt.subplots(figsize=(10,5))
diff_visitor = action.groupby(['hour','visitor_type']).userid.count().unstack()
plt.plot(diff_visitor)
plt.title('用户日小时访问量(PV)')
Text(0.5, 1.0, '用户日小时访问量(PV)')
用户转化模型
首先是唤醒 APP(1)到浏览页面的转化(2)数据结果正常,但填写表单(5)数量远大于操作 1、2,即大量表单在没有使用 APP 的情况下填写,可能是通过其他渠道跳入填写页面,或数据缺失严重。同时,填写表单(7)数量小于(8),可能数据缺失缺失较为严重。
from example.commons import Faker
from pyecharts import options as opts
from pyecharts.charts import Funnel, Page
df = action.groupby('visitor_type',as_index=False).userid.count().values.tolist()
def funnel_base() -> Funnel:
c = (Funnel()
.add("访问量",df)
.set_global_opts(title_opts=opts.TitleOpts(title="访问转化"))
)
return c
funnel_base().render_notebook()
<script>
require.config({
paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});
</script>
<div id="9db50cf8564b4e7490a7551c4e5b8cb4" style="width:900px; height:500px;"></div>
<script>
require(['echarts'], function(echarts) {
var chart_9db50cf8564b4e7490a7551c4e5b8cb4 = echarts.init(document.getElementById('9db50cf8564b4e7490a7551c4e5b8cb4'), 'white', {renderer: 'canvas'});
var option_9db50cf8564b4e7490a7551c4e5b8cb4 = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
"#c23531",
"#2f4554",
"#61a0a8",
"#d48265",
"#749f83",
"#ca8622",
"#bda29a",
"#6e7074",
"#546570",
"#c4ccd3",
"#f05b72",
"#ef5b9c",
"#f47920",
"#905a3d",
"#fab27b",
"#2a5caa",
"#444693",
"#726930",
"#b2d235",
"#6d8346",
"#ac6767",
"#1d953f",
"#6950a1",
"#918597"
],
"series": [
{
"type": "funnel",
"name": "\u8bbf\u95ee\u91cf",
"data": [
{
"name": 1,
"value": 479374
},
{
"name": 2,
"value": 209297
},
{
"name": 5,
"value": 599224
},
{
"name": 6,
"value": 284216
},
{
"name": 7,
"value": 35036
},
{
"name": 8,
"value": 35867
},
{
"name": 9,
"value": 23046
}
],
"sort": "descending",
"gap": 0,
"label": {
"show": true,
"position": "top",
"margin": 8
}
}
],
"legend": [
{
"data": [
1,
2,
5,
6,
7,
8,
9
],
"selected": {
"1": true,
"2": true,
"5": true,
"6": true,
"7": true,
"8": true,
"9": true
},
"show": true
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {"type": "line"},
"textStyle": {"fontSize": 14},
"borderWidth": 0
},
"title": [
{"text": "\u8bbf\u95ee\u8f6c\u5316"}
]
};
chart_9db50cf8564b4e7490a7551c4e5b8cb4.setOption(option_9db50cf8564b4e7490a7551c4e5b8cb4);
});
</script>
def funnel_base() -> Funnel:
c = (Funnel()
.add("访问量",df[:2])
.set_global_opts(title_opts=opts.TitleOpts(title="访问转化"))
)
return c
funnel_base().render_notebook()
<script>
require.config({
paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});
</script>
<div id="78bc04c80e4f48caa9576c3bfeff58ad" style="width:900px; height:500px;"></div>
<script>
require(['echarts'], function(echarts) {
var chart_78bc04c80e4f48caa9576c3bfeff58ad = echarts.init(document.getElementById('78bc04c80e4f48caa9576c3bfeff58ad'), 'white', {renderer: 'canvas'});
var option_78bc04c80e4f48caa9576c3bfeff58ad = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
"#c23531",
"#2f4554",
"#61a0a8",
"#d48265",
"#749f83",
"#ca8622",
"#bda29a",
"#6e7074",
"#546570",
"#c4ccd3",
"#f05b72",
"#ef5b9c",
"#f47920",
"#905a3d",
"#fab27b",
"#2a5caa",
"#444693",
"#726930",
"#b2d235",
"#6d8346",
"#ac6767",
"#1d953f",
"#6950a1",
"#918597"
],
"series": [
{
"type": "funnel",
"name": "\u8bbf\u95ee\u91cf",
"data": [
{
"name": 1,
"value": 479374
},
{
"name": 2,
"value": 209297
}
],
"sort": "descending",
"gap": 0,
"label": {
"show": true,
"position": "top",
"margin": 8
}
}
],
"legend": [
{
"data": [
1,
2
],
"selected": {
"1": true,
"2": true
},
"show": true
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {"type": "line"},
"textStyle": {"fontSize": 14},
"borderWidth": 0
},
"title": [
{"text": "\u8bbf\u95ee\u8f6c\u5316"}
]
};
chart_78bc04c80e4f48caa9576c3bfeff58ad.setOption(option_78bc04c80e4f48caa9576c3bfeff58ad);
});
</script>
def funnel_base() -> Funnel:
c = (Funnel()
.add("访问量",df[2:])
.set_global_opts(title_opts=opts.TitleOpts(title="访问转化"))
)
return c
funnel_base().render_notebook()
<script>
require.config({
paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});
</script>
<div id="d6a65eb8443e4bfb89e1be7e26f03de0" style="width:900px; height:500px;"></div>
<script>
require(['echarts'], function(echarts) {
var chart_d6a65eb8443e4bfb89e1be7e26f03de0 = echarts.init(document.getElementById('d6a65eb8443e4bfb89e1be7e26f03de0'), 'white', {renderer: 'canvas'});
var option_d6a65eb8443e4bfb89e1be7e26f03de0 = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
"#c23531",
"#2f4554",
"#61a0a8",
"#d48265",
"#749f83",
"#ca8622",
"#bda29a",
"#6e7074",
"#546570",
"#c4ccd3",
"#f05b72",
"#ef5b9c",
"#f47920",
"#905a3d",
"#fab27b",
"#2a5caa",
"#444693",
"#726930",
"#b2d235",
"#6d8346",
"#ac6767",
"#1d953f",
"#6950a1",
"#918597"
],
"series": [
{
"type": "funnel",
"name": "\u8bbf\u95ee\u91cf",
"data": [
{
"name": 5,
"value": 599224
},
{
"name": 6,
"value": 284216
},
{
"name": 7,
"value": 35036
},
{
"name": 8,
"value": 35867
},
{
"name": 9,
"value": 23046
}
],
"sort": "descending",
"gap": 0,
"label": {
"show": true,
"position": "top",
"margin": 8
}
}
],
"legend": [
{
"data": [
5,
6,
7,
8,
9
],
"selected": {
"5": true,
"6": true,
"7": true,
"8": true,
"9": true
},
"show": true
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {"type": "line"},
"textStyle": {"fontSize": 14},
"borderWidth": 0
},
"title": [
{"text": "\u8bbf\u95ee\u8f6c\u5316"}
]
};
chart_d6a65eb8443e4bfb89e1be7e26f03de0.setOption(option_d6a65eb8443e4bfb89e1be7e26f03de0);
});
</script>
用户评价
用户评分
评价表中共 9863 条数据,其中评分无缺失值,平均分为 4.91, 五星好评占绝大多数。
comment.rating.mean()
4.916672610845424
from pyecharts.charts import Bar
bar = Bar()
bar.add_xaxis(comment.rating.value_counts().index.tolist())
bar.add_yaxis("评分", comment.rating.value_counts().values.tolist())
bar.render_notebook()
<script>
require.config({
paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});
</script>
<div id="85693d3bcea24bc0b08a8172fbf40f5c" style="width:900px; height:500px;"></div>
<script>
require(['echarts'], function(echarts) {
var chart_85693d3bcea24bc0b08a8172fbf40f5c = echarts.init(document.getElementById('85693d3bcea24bc0b08a8172fbf40f5c'), 'white', {renderer: 'canvas'});
var option_85693d3bcea24bc0b08a8172fbf40f5c = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
"#c23531",
"#2f4554",
"#61a0a8",
"#d48265",
"#749f83",
"#ca8622",
"#bda29a",
"#6e7074",
"#546570",
"#c4ccd3",
"#f05b72",
"#ef5b9c",
"#f47920",
"#905a3d",
"#fab27b",
"#2a5caa",
"#444693",
"#726930",
"#b2d235",
"#6d8346",
"#ac6767",
"#1d953f",
"#6950a1",
"#918597"
],
"series": [
{
"type": "bar",
"name": "\u8bc4\u5206",
"data": [
11833,
247,
118,
97,
35,
3,
2,
2
],
"barCategoryGap": "20%",
"label": {
"show": true,
"position": "top",
"margin": 8
}
}
],
"legend": [
{
"data": ["\u8bc4\u5206"],
"selected": {"\u8bc4\u5206": true}
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {"type": "line"},
"textStyle": {"fontSize": 14},
"borderWidth": 0
},
"xAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
},
"data": [
5.0,
4.0,
1.0,
3.0,
2.0,
4.33,
2.33,
3.67
]
}
],
"yAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
}
}
]
};
chart_85693d3bcea24bc0b08a8172fbf40f5c.setOption(option_85693d3bcea24bc0b08a8172fbf40f5c);
});
</script>
用户评价标签
以四分为分界线划分好评与差评, 分别制作词云图如下:
tags_count = comment[comment.rating>=4].tags.str.split("|").dropna().apply(pd.value_counts).sum()
path=r'C:\Windows\Fonts\simhei.ttf'
import wordcloud
w = wordcloud.WordCloud(font_path=path,width=1400, height=1400, margin=2)
w.fit_words(tags_count)
plt.figure(dpi=1000)
plt.imshow(w)
plt.axis('off')
(-0.5, 1399.5, 1399.5, -0.5)
tags_count = comment[comment.rating<4].tags.str.split("|").dropna().apply(pd.value_counts).sum()
path=r'C:\Windows\Fonts\simhei.ttf'
import wordcloud
w = wordcloud.WordCloud(font_path=path,width=1400, height=1400, margin=2)
w.fit_words(tags_count)
plt.figure(dpi=500)
plt.imshow(w)
plt.axis('off')
(-0.5, 1399.5, 1399.5, -0.5)
用户评论关键词
用户评论关键词同样以 4 分为分界线,分别制作词云图。
Keyword_count=comment[comment['rating']>=4].commentsKeyWords.dropna().str[1:-1].str.split(',').apply(pd.value_counts).sum()
path=r'C:\Windows\Fonts\simhei.ttf'
import wordcloud
w = wordcloud.WordCloud(font_path=path,width=1400, height=1400, margin=2)
w.fit_words(Keyword_count)
plt.figure(dpi=1000)
plt.imshow(w)
plt.axis('off')
(-0.5, 1399.5, 1399.5, -0.5)
Keyword_count=comment[comment['rating']<4].commentsKeyWords.dropna().str[1:-1].str.split(',').apply(pd.value_counts).sum()
path=r'C:\Windows\Fonts\simhei.ttf'
import wordcloud
w = wordcloud.WordCloud(font_path=path,width=1400, height=1400, margin=2)
w.fit_words(Keyword_count)
plt.figure(dpi=1000)
plt.imshow(w)
plt.axis('off')
(-0.5, 1399.5, 1399.5, -0.5)
订单数据
该数据描述了用户的历史订单信息。数据共有 7 列,分别是用户 id,订单 id,订单时间,订单类型,旅游城市,国家,大陆。其中 1 表示购买了精品旅游服务,0 表示普通旅游服务。
用户复购
订单数据共 20653 项,涵盖 10637 名用户, 用户复购图如下:
order_number=orderHistory.groupby(['userid'],as_index=False).orderid.count().groupby('orderid',as_index=False).userid.count().rename(columns={'orderid':'order_quantity','userid':'count'})
order_number=pd.concat([order_number[:8],pd.DataFrame([{'order_quantity':'8 次以上','count':order_number[8:].count().sum()}])])
from pyecharts.charts import Page, Pie
def pie_base() -> Pie:
c = (Pie()
.add('',order_number.values.tolist())
.set_global_opts(title_opts=opts.TitleOpts(title="所有服务用户复购图"))
.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
)
return c
pie_base().render_notebook()
<script>
require.config({
paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});
</script>
<div id="5b1552eb64bf4829b2e1ea71b0604bde" style="width:900px; height:500px;"></div>
<script>
require(['echarts'], function(echarts) {
var chart_5b1552eb64bf4829b2e1ea71b0604bde = echarts.init(document.getElementById('5b1552eb64bf4829b2e1ea71b0604bde'), 'white', {renderer: 'canvas'});
var option_5b1552eb64bf4829b2e1ea71b0604bde = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
"#c23531",
"#2f4554",
"#61a0a8",
"#d48265",
"#749f83",
"#ca8622",
"#bda29a",
"#6e7074",
"#546570",
"#c4ccd3",
"#f05b72",
"#ef5b9c",
"#f47920",
"#905a3d",
"#fab27b",
"#2a5caa",
"#444693",
"#726930",
"#b2d235",
"#6d8346",
"#ac6767",
"#1d953f",
"#6950a1",
"#918597"
],
"series": [
{
"type": "pie",
"clockwise": true,
"data": [
{
"name": 7992,
"value": 1
},
{
"name": 2638,
"value": 2
},
{
"name": 1244,
"value": 3
},
{
"name": 566,
"value": 4
},
{
"name": 349,
"value": 5
},
{
"name": 181,
"value": 6
},
{
"name": 113,
"value": 7
},
{
"name": 64,
"value": 8
},
{
"name": 46,
"value": "8\u6b21\u4ee5\u4e0a"
}
],
"radius": [
"0%",
"75%"
],
"center": [
"50%",
"50%"
],
"label": {
"show": true,
"position": "top",
"margin": 8,
"formatter": "{b}: {c}"
},
"rippleEffect": {
"show": true,
"brushType": "stroke",
"scale": 2.5,
"period": 4
}
}
],
"legend": [
{
"data": [
7992,
2638,
1244,
566,
349,
181,
113,
64,
46
],
"selected": {},
"show": true
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {"type": "line"},
"textStyle": {"fontSize": 14},
"borderWidth": 0
},
"title": [
{"text": "\u6240\u6709\u670d\u52a1\u7528\u6237\u590d\u8d2d\u56fe"}
]
};
chart_5b1552eb64bf4829b2e1ea71b0604bde.setOption(option_5b1552eb64bf4829b2e1ea71b0604bde);
});
</script>
order_number=orderHistory[orderHistory.orderType==1].groupby(['userid'],as_index=False).orderid.count().groupby('orderid',as_index=False).userid.count().rename(columns={'orderid':'order_quantity','userid':'count'})
order_number=pd.concat([order_number[:8],pd.DataFrame([{'order_quantity':'8 次以上','count':order_number[8:].count().sum()}])])
from pyecharts.charts import Page, Pie
def pie_base() -> Pie:
c = (Pie()
.add('',order_number.values.tolist())
.set_global_opts(title_opts=opts.TitleOpts(title="精品服务用户复购图"))
.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
)
return c
pie_base().render_notebook()
<script>
require.config({
paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});
</script>
<div id="92f3c2bcefce447e8e180e96af4c4f1c" style="width:900px; height:500px;"></div>
<script>
require(['echarts'], function(echarts) {
var chart_92f3c2bcefce447e8e180e96af4c4f1c = echarts.init(document.getElementById('92f3c2bcefce447e8e180e96af4c4f1c'), 'white', {renderer: 'canvas'});
var option_92f3c2bcefce447e8e180e96af4c4f1c = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
"#c23531",
"#2f4554",
"#61a0a8",
"#d48265",
"#749f83",
"#ca8622",
"#bda29a",
"#6e7074",
"#546570",
"#c4ccd3",
"#f05b72",
"#ef5b9c",
"#f47920",
"#905a3d",
"#fab27b",
"#2a5caa",
"#444693",
"#726930",
"#b2d235",
"#6d8346",
"#ac6767",
"#1d953f",
"#6950a1",
"#918597"
],
"series": [
{
"type": "pie",
"clockwise": true,
"data": [
{
"name": 1359,
"value": 1
},
{
"name": 407,
"value": 2
},
{
"name": 182,
"value": 3
},
{
"name": 81,
"value": 4
},
{
"name": 50,
"value": 5
},
{
"name": 20,
"value": 6
},
{
"name": 11,
"value": 7
},
{
"name": 7,
"value": 8
},
{
"name": 22,
"value": "8\u6b21\u4ee5\u4e0a"
}
],
"radius": [
"0%",
"75%"
],
"center": [
"50%",
"50%"
],
"label": {
"show": true,
"position": "top",
"margin": 8,
"formatter": "{b}: {c}"
},
"rippleEffect": {
"show": true,
"brushType": "stroke",
"scale": 2.5,
"period": 4
}
}
],
"legend": [
{
"data": [
1359,
407,
182,
81,
50,
20,
11,
7,
22
],
"selected": {},
"show": true
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {"type": "line"},
"textStyle": {"fontSize": 14},
"borderWidth": 0
},
"title": [
{"text": "\u7cbe\u54c1\u670d\u52a1\u7528\u6237\u590d\u8d2d\u56fe"}
]
};
chart_92f3c2bcefce447e8e180e96af4c4f1c.setOption(option_92f3c2bcefce447e8e180e96af4c4f1c);
});
</script>
用户出游
orderHistory.orderTime = orderHistory.orderTime.map(lambda x: time_convert(x))
orderHistory['year']=orderHistory.orderTime.str[:4]
orderHistory['month']=orderHistory.orderTime.str[5:7]
orderHistory['day']=orderHistory.orderTime.str[8:10]
orderHistory['date']=orderHistory.orderTime.str[:10]
orderHistory['time']=orderHistory.orderTime.str[11:]
orderHistory['year_month']=orderHistory.orderTime.str[:7]
orderHistory['hour']=orderHistory.orderTime.str[11:13]
from pyecharts.charts import Bar
from pyecharts import options as opts
jingpin_top10 = orderHistory[orderHistory.orderType==1].city.value_counts()[:10]
bar = Bar()
bar.add_xaxis(jingpin_top10.index.tolist())
bar.add_yaxis("精品游十大热门城市", jingpin_top10.values.tolist())
bar.render_notebook()
<script>
require.config({
paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});
</script>
<div id="89a32a8cd3ca4f11b91be9d041538579" style="width:900px; height:500px;"></div>
<script>
require(['echarts'], function(echarts) {
var chart_89a32a8cd3ca4f11b91be9d041538579 = echarts.init(document.getElementById('89a32a8cd3ca4f11b91be9d041538579'), 'white', {renderer: 'canvas'});
var option_89a32a8cd3ca4f11b91be9d041538579 = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
"#c23531",
"#2f4554",
"#61a0a8",
"#d48265",
"#749f83",
"#ca8622",
"#bda29a",
"#6e7074",
"#546570",
"#c4ccd3",
"#f05b72",
"#ef5b9c",
"#f47920",
"#905a3d",
"#fab27b",
"#2a5caa",
"#444693",
"#726930",
"#b2d235",
"#6d8346",
"#ac6767",
"#1d953f",
"#6950a1",
"#918597"
],
"series": [
{
"type": "bar",
"name": "\u7cbe\u54c1\u6e38\u5341\u5927\u70ed\u95e8\u57ce\u5e02",
"data": [
445,
248,
163,
158,
150,
147,
147,
133,
131,
130
],
"barCategoryGap": "20%",
"label": {
"show": true,
"position": "top",
"margin": 8
}
}
],
"legend": [
{
"data": ["\u7cbe\u54c1\u6e38\u5341\u5927\u70ed\u95e8\u57ce\u5e02"],
"selected": {"\u7cbe\u54c1\u6e38\u5341\u5927\u70ed\u95e8\u57ce\u5e02": true}
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {"type": "line"},
"textStyle": {"fontSize": 14},
"borderWidth": 0
},
"xAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
},
"data": [
"\u4e1c\u4eac",
"\u5927\u962a",
"\u66fc\u8c37",
"\u53f0\u5317",
"\u5df4\u5398\u5c9b",
"\u4eac\u90fd",
"\u58a8\u5c14\u672c",
"\u6089\u5c3c",
"\u5409\u9686\u5761",
"\u5df4\u9ece"
]
}
],
"yAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
}
}
]
};
chart_89a32a8cd3ca4f11b91be9d041538579.setOption(option_89a32a8cd3ca4f11b91be9d041538579);
});
</script>
from pyecharts.globals import ThemeType
putong_top10 = orderHistory[orderHistory.orderType==0].city.value_counts()[:10]
bar = Bar({"theme": ThemeType.ESSOS})
bar.add_xaxis(putong_top10.index.tolist())
bar.add_yaxis("普通游十大热门城市", putong_top10.values.tolist())
bar.render_notebook()
<script>
require.config({
paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min', 'essos':'https://assets.pyecharts.org/assets/themes/essos'}
});
</script>
<div id="cfbdf5d205b04d3293e3e23fd3adc8b7" style="width:900px; height:500px;"></div>
<script>
require(['echarts', 'essos'], function(echarts) {
var chart_cfbdf5d205b04d3293e3e23fd3adc8b7 = echarts.init(document.getElementById('cfbdf5d205b04d3293e3e23fd3adc8b7'), 'essos', {renderer: 'canvas'});
var option_cfbdf5d205b04d3293e3e23fd3adc8b7 = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"series": [
{
"type": "bar",
"name": "\u666e\u901a\u6e38\u5341\u5927\u70ed\u95e8\u57ce\u5e02",
"data": [
2319,
1951,
1274,
1250,
1240,
1237,
1151,
1006,
900,
869
],
"barCategoryGap": "20%",
"label": {
"show": true,
"position": "top",
"margin": 8
}
}
],
"legend": [
{
"data": ["\u666e\u901a\u6e38\u5341\u5927\u70ed\u95e8\u57ce\u5e02"],
"selected": {"\u666e\u901a\u6e38\u5341\u5927\u70ed\u95e8\u57ce\u5e02": true}
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {"type": "line"},
"textStyle": {"fontSize": 14},
"borderWidth": 0
},
"xAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
},
"data": [
"\u65b0\u52a0\u5761",
"\u4e1c\u4eac",
"\u53f0\u5317",
"\u9999\u6e2f",
"\u7ebd\u7ea6",
"\u5409\u9686\u5761",
"\u6089\u5c3c",
"\u5927\u962a",
"\u66fc\u8c37",
"\u58a8\u5c14\u672c"
]
}
],
"yAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
}
}
]
};
chart_cfbdf5d205b04d3293e3e23fd3adc8b7.setOption(option_cfbdf5d205b04d3293e3e23fd3adc8b7);
});
</script>
continent_jingpin=orderHistory[orderHistory['orderType']==1].groupby(['continent'],as_index=False).orderid.count()
continent_putong = orderHistory[orderHistory['orderType']==0].groupby(['continent'],as_index=False).orderid.count()
continent_putong = pd.concat([continent_putong,pd.DataFrame([{'continent':'南美洲','userid':0}])]).sort_values('continent')
bar = Bar()
bar.add_xaxis(continent_jingpin.continent.tolist())
bar.add_yaxis("精品游大陆分布", continent_jingpin.orderid.values.tolist())
bar.add_yaxis("普通游大陆分布", continent_putong.orderid.values.tolist())
bar.render_notebook()
<script>
require.config({
paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});
</script>
<div id="365792879cfe489090b32b4db9a3e16d" style="width:900px; height:500px;"></div>
<script>
require(['echarts'], function(echarts) {
var chart_365792879cfe489090b32b4db9a3e16d = echarts.init(document.getElementById('365792879cfe489090b32b4db9a3e16d'), 'white', {renderer: 'canvas'});
var option_365792879cfe489090b32b4db9a3e16d = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
"#c23531",
"#2f4554",
"#61a0a8",
"#d48265",
"#749f83",
"#ca8622",
"#bda29a",
"#6e7074",
"#546570",
"#c4ccd3",
"#f05b72",
"#ef5b9c",
"#f47920",
"#905a3d",
"#fab27b",
"#2a5caa",
"#444693",
"#726930",
"#b2d235",
"#6d8346",
"#ac6767",
"#1d953f",
"#6950a1",
"#918597"
],
"series": [
{
"type": "bar",
"name": "\u7cbe\u54c1\u6e38\u5927\u9646\u5206\u5e03",
"data": [
2204,
558,
2,
370,
763,
4
],
"barCategoryGap": "20%",
"label": {
"show": true,
"position": "top",
"margin": 8
}
},
{
"type": "bar",
"name": "\u666e\u901a\u6e38\u5927\u9646\u5206\u5e03",
"data": [
12936.0,
3638.0,
null,
2878.0,
2351.0,
8.0
],
"barCategoryGap": "20%",
"label": {
"show": true,
"position": "top",
"margin": 8
}
}
],
"legend": [
{
"data": [
"\u7cbe\u54c1\u6e38\u5927\u9646\u5206\u5e03",
"\u666e\u901a\u6e38\u5927\u9646\u5206\u5e03"
],
"selected": {
"\u7cbe\u54c1\u6e38\u5927\u9646\u5206\u5e03": true,
"\u666e\u901a\u6e38\u5927\u9646\u5206\u5e03": true
}
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {"type": "line"},
"textStyle": {"fontSize": 14},
"borderWidth": 0
},
"xAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
},
"data": [
"\u4e9a\u6d32",
"\u5317\u7f8e\u6d32",
"\u5357\u7f8e\u6d32",
"\u5927\u6d0b\u6d32",
"\u6b27\u6d32",
"\u975e\u6d32"
]
}
],
"yAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
}
}
]
};
chart_365792879cfe489090b32b4db9a3e16d.setOption(option_365792879cfe489090b32b4db9a3e16d);
});
</script>
country_boutique = orderHistory[orderHistory['orderType']==1].groupby('country').country.count().sort_values(ascending = False)[:10]
country_ordinary = orderHistory[orderHistory['orderType']==0].groupby('country').country.count().sort_values(ascending = False)[:10]
bar = Bar()
bar.add_xaxis(country_boutique.index.tolist())
bar.add_yaxis("精品游十大热门国家", country_boutique.values.tolist())
bar.render_notebook()
<script>
require.config({
paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});
</script>
<div id="e88237881863469a96bef6e7ca24550c" style="width:900px; height:500px;"></div>
<script>
require(['echarts'], function(echarts) {
var chart_e88237881863469a96bef6e7ca24550c = echarts.init(document.getElementById('e88237881863469a96bef6e7ca24550c'), 'white', {renderer: 'canvas'});
var option_e88237881863469a96bef6e7ca24550c = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
"#c23531",
"#2f4554",
"#61a0a8",
"#d48265",
"#749f83",
"#ca8622",
"#bda29a",
"#6e7074",
"#546570",
"#c4ccd3",
"#f05b72",
"#ef5b9c",
"#f47920",
"#905a3d",
"#fab27b",
"#2a5caa",
"#444693",
"#726930",
"#b2d235",
"#6d8346",
"#ac6767",
"#1d953f",
"#6950a1",
"#918597"
],
"series": [
{
"type": "bar",
"name": "\u7cbe\u54c1\u6e38\u5341\u5927\u70ed\u95e8\u56fd\u5bb6",
"data": [
1030,
486,
337,
325,
222,
165,
157,
157,
150,
118
],
"barCategoryGap": "20%",
"label": {
"show": true,
"position": "top",
"margin": 8
}
}
],
"legend": [
{
"data": ["\u7cbe\u54c1\u6e38\u5341\u5927\u70ed\u95e8\u56fd\u5bb6"],
"selected": {"\u7cbe\u54c1\u6e38\u5341\u5927\u70ed\u95e8\u56fd\u5bb6": true}
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {"type": "line"},
"textStyle": {"fontSize": 14},
"borderWidth": 0
},
"xAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
},
"data": [
"\u65e5\u672c",
"\u7f8e\u56fd",
"\u6fb3\u5927\u5229\u4e9a",
"\u4e2d\u56fd\u53f0\u6e7e",
"\u6cf0\u56fd",
"\u6cd5\u56fd",
"\u82f1\u56fd",
"\u9a6c\u6765\u897f\u4e9a",
"\u5370\u5ea6\u5c3c\u897f\u4e9a",
"\u97e9\u56fd"
]
}
],
"yAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
}
}
]
};
chart_e88237881863469a96bef6e7ca24550c.setOption(option_e88237881863469a96bef6e7ca24550c);
});
</script>
bar = Bar()
bar.add_xaxis(country_ordinary.index.tolist())
bar.add_yaxis("普通游十大热门国家", country_ordinary.values.tolist())
bar.render_notebook()
def bar_base_dict_config() -> Bar:
c = (Bar({"theme": ThemeType.MACARONS})
.add_xaxis(country_ordinary.index.tolist())
.add_yaxis("普通游十大热门国家", country_ordinary.values.tolist())
)
return c
bar_base_dict_config().render_notebook()
<script>
require.config({
paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min', 'macarons':'https://assets.pyecharts.org/assets/themes/macarons'}
});
</script>
<div id="de005b7fb43c4f12b4f5e3f16269512a" style="width:900px; height:500px;"></div>
<script>
require(['echarts', 'macarons'], function(echarts) {
var chart_de005b7fb43c4f12b4f5e3f16269512a = echarts.init(document.getElementById('de005b7fb43c4f12b4f5e3f16269512a'), 'macarons', {renderer: 'canvas'});
var option_de005b7fb43c4f12b4f5e3f16269512a = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"series": [
{
"type": "bar",
"name": "\u666e\u901a\u6e38\u5341\u5927\u70ed\u95e8\u56fd\u5bb6",
"data": [
3372,
3228,
2675,
2319,
1781,
1634,
1395,
1250,
760,
738
],
"barCategoryGap": "20%",
"label": {
"show": true,
"position": "top",
"margin": 8
}
}
],
"legend": [
{
"data": ["\u666e\u901a\u6e38\u5341\u5927\u70ed\u95e8\u56fd\u5bb6"],
"selected": {"\u666e\u901a\u6e38\u5341\u5927\u70ed\u95e8\u56fd\u5bb6": true}
}
],
"tooltip": {
"show": true,
"trigger": "item",
"triggerOn": "mousemove|click",
"axisPointer": {"type": "line"},
"textStyle": {"fontSize": 14},
"borderWidth": 0
},
"xAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
},
"data": [
"\u65e5\u672c",
"\u7f8e\u56fd",
"\u6fb3\u5927\u5229\u4e9a",
"\u65b0\u52a0\u5761",
"\u6cf0\u56fd",
"\u9a6c\u6765\u897f\u4e9a",
"\u4e2d\u56fd\u53f0\u6e7e",
"\u4e2d\u56fd\u9999\u6e2f",
"\u6cd5\u56fd",
"\u82f1\u56fd"
]
}
],
"yAxis": [
{
"show": true,
"scale": false,
"nameLocation": "end",
"nameGap": 15,
"gridIndex": 0,
"inverse": false,
"offset": 0,
"splitNumber": 5,
"minInterval": 0,
"splitLine": {
"show": false,
"lineStyle": {
"width": 1,
"opacity": 1,
"curveness": 0,
"type": "solid"
}
}
}
]
};
chart_de005b7fb43c4f12b4f5e3f16269512a.setOption(option_de005b7fb43c4f12b4f5e3f16269512a);
});
</script>
特征工程
import pandas as pd
import numpy as np
from sklearn import preprocessing
import warnings
warnings.filterwarnings('ignore')
user_train = pd.read_csv(r'Data\trainingset\userProfile_train.csv')
action_train = pd.read_csv(r'Data\trainingset\action_train.csv')
comment_train = pd.read_csv(r'Data\trainingset\userComment_train.csv')
orderFuture_train= pd.read_csv(r'Data\trainingset\orderFuture_train.csv')
orderHistory_train= pd.read_csv(r'Data\trainingset\orderHistory_train.csv')
user_test = pd.read_csv(r'Data\test\userProfile_test.csv')
action_test = pd.read_csv(r'Data\test\action_test.csv')
comment_test = pd.read_csv(r'Data\test\userComment_test.csv')
orderFuture_test = pd.read_csv(r'Data\test\orderFuture_test.csv')
orderHistory_test = pd.read_csv(r'Data\test\orderHistory_test.csv')
user = pd.concat([user_train,user_test])
action = pd.concat([action_train,action_test])
comment = pd.concat([comment_train,comment_test])
orderHistory = pd.concat([orderHistory_train,orderHistory_test])
orderFuture = pd.concat([orderFuture_train,orderFuture_test])
orderHistory = orderHistory.sort_values(by=['userid','orderTime'])
#历史订单数量,时间戳统计值
orderHistory_internal_table = orderHistory.groupby('userid').orderTime.agg(['count','max','min','std','mean']).reset_index().rename(columns = {'count':'order_count',
'max':'ordertime_max',
'min':'ordertime_min',
'std':'orderTime_std',
'mean':'ordertime_mean'}).fillna(0)
#历史订单普通、精品订单数
orderHistory_internal_table = orderHistory_internal_table.merge(orderHistory[orderHistory['orderType']==0].groupby('userid').orderid.count().reset_index().rename(columns={'orderid':'ordinary_count'}),how='left',on='userid')
orderHistory_internal_table = orderHistory_internal_table.merge(orderHistory[orderHistory['orderType']==1].groupby('userid').orderid.count().reset_index().rename(columns={'orderid':'unordinary_count'}),how='left',on='userid')
#去过的国家、大陆、城市有几次。orderHistory_internal_table = orderHistory_internal_table.merge(pd.get_dummies(orderHistory[['userid','country','continent','city']]).groupby('userid',as_index=False).sum(),on='userid',how='left')
#最后一次行程信息
orderHistory_internal_table = orderHistory_internal_table.merge(pd.get_dummies(orderHistory.groupby('userid',as_index=False).apply(lambda x:x.iloc[-1])[['userid','orderType','city','country','continent']]),on='userid',how='left')
data = orderFuture.copy() #以 orderFuture 为基础
data = data.merge(user) #连接 user
data = data.merge(comment,how = 'left') #连接 comment
data['tags'] = data.tags.apply(lambda x : 0 if pd.isnull(x) else 1) #将 tag 分为有无
data['commentsKeyWords'] = data.commentsKeyWords.apply(lambda x:0 if pd.isnull(x) else 1) #将评论分为有无
del data['orderid'] #删除 orderid 列
action = action.sort_values(by=['userid','actionTime']) #按照 userid,actiontime 排序
#生成中间表包含 action 信息,首先是每个 id 的 action 数量,最大最小时间, 均值标准差
action_internal_table = action.groupby('userid').actionTime.agg(['count','max','min','std','mean']).reset_index().rename(columns = {'count':'action_count',
'max':'time_last_action',
'min':'time_first_action',
'std':'actiontime_std',
'mean':'actiontime_mean'})
#2- 4 与 5 - 9 的比例
#增加每个 id 的倒数第 1 -20 个行为类别
for i in range(20):
action_internal_table = action_internal_table.merge(action.groupby('userid').actionType.apply(lambda x:x.iloc[-i-1] if len(x)>i else np.nan).reset_index().rename(columns={'actionType':'last_but{}_action_type'.format(i)}).reset_index(),how='left')
del action_internal_table['index']
#每个行为类型所占的比例
count = action.groupby('userid').actionType.count()
for i in range(1,10):
action_internal_table = action_internal_table.merge((action[action['actionType']==i].groupby('userid').actionType.count()/count).reset_index().rename(columns={'actionType':'rate_{}'.format(i)}).fillna(0),on='userid',how='left')
#倒数第 1 -20 个时间戳
for i in range(20):
action_internal_table = action_internal_table.merge(action.groupby('userid').actionTime.apply(lambda x:x.iloc[-i-1] if len(x)>i else np.nan).reset_index().rename(columns={'actionTime':'last_but{}_action_type'.format(i)}).reset_index(),how='left')
del action_internal_table['index']
data = data.merge(action_internal_table,on='userid',how='left')
data = data.merge(orderHistory_internal_table,on='userid',how='left')
data = data.fillna(-999)
data = pd.get_dummies(data)
模型评估与改进
X
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
X_trainval = data[data['userid'].isin(orderFuture_train.userid.tolist())].iloc[:,2:]
y_trainval = data[data['userid'].isin(orderFuture_train.userid.tolist())].iloc[:,0]
X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval,random_state=88,stratify=y_trainval)
xgb_cla = xgb.XGBClassifier(learning_rate=0.1,
n_estimators=1000,
max_depth=3,
min_child_weight=5,
gamma=0,
subsample= 0.8,
colsample_bytree=0.8,
eta=0.05,
silent=1,
objective='binary:logistic',
scale_pos_weight=1).fit(X_train,y_train)
roc_auc_score(y_val,xgb_cla.predict_proba(X_val)[:,1])
X_test = data[data['userid'].isin(orderFuture_test.userid.tolist())].iloc[:,2:]
predict = xgb_cla.predict_proba(X_test)[:,1]
orderFuture_test['orderType']=predict
orderFuture_test.to_csv('submission.csv',encoding='utf-8',index=False)
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
X_trainval = data[data['userid'].isin(orderFuture_train.userid.tolist())].iloc[:,2:]
y_trainval = data[data['userid'].isin(orderFuture_train.userid.tolist())].iloc[:,0]
X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval,random_state=88,stratify=y_trainval)
xgb_cla = xgb.XGBClassifier(learning_rate=0.1,
n_estimators=1000,
max_depth=3,
min_child_weight=5,
gamma=0,
subsample= 0.8,
colsample_bytree=0.8,
eta=0.05,
silent=1,
objective='binary:logistic',
scale_pos_weight=1).fit(X_train,y_train)
roc_auc_score(y_val,xgb_cla.predict_proba(X_val)[:,1])
X_test = data[data['userid'].isin(orderFuture_test.userid.tolist())].iloc[:,2:]
predict = xgb_cla.predict_proba(X_test)[:,1]
orderFuture_test['orderType']=predict
orderFuture_test.to_csv('submission.csv',encoding='utf-8',index=False)
# from sklearn.linear_model import Lasso
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.metrics import roc_auc_score
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import MinMaxScaler
# from sklearn.pipeline import make_pipeline
# X_trainval = orderFuture[orderFuture.userid.isin(orderFuture_train.userid)].iloc[:,2:]
# y_trainval = orderFuture[orderFuture.userid.isin(orderFuture_train.userid)].iloc[:,0]
# X_test = orderFuture[orderFuture.userid.isin(orderFuture_test.userid)].iloc[:,2:]
# X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval,stratify = y_trainval,random_state = 42)
# pipe = make_pipeline(MinMaxScaler(),RandomForestClassifier()).fit(X_train,y_train)
# #RFC = RandomForestClassifier().fit(X_train,y_train)
# print('模型 AUC 为:{}'.format(roc_auc_score(y_val,pipe.predict_proba(X_val)[:,1])))
# result = pd.DataFrame({'userid':orderFuture[orderFuture.userid.isin(orderFuture_test.userid)].iloc[:,1],'orderType':pipe.predict(X_test)}).reset_index(drop=True)
# result.to_csv('submission.csv')
# from sklearn.model_selection import GridSearchCV
# from sklearn.pipeline import make_pipeline
# from sklearn.preprocessing import MinMaxScaler
# from sklearn.svm import SVC
# X_trainval = orderFuture[orderFuture.userid.isin(orderFuture_train.userid)].iloc[:,2:]
# y_trainval = orderFuture[orderFuture.userid.isin(orderFuture_train.userid)].iloc[:,0]
# param_grid = {'svc__C':[0.001,0.01,0.1,1,10,100],
# 'svc__gamma':[0.001,0.01,0.1,1,10,100]}
# pipe = make_pipeline(MinMaxScaler(),SVC())
# grid = GridSearchCV(pipe,param_grid=param_grid,cv=5).fit(X_train,y_train)
# grid.score(X_train,y_train)