数据科学竞赛入门精品旅行服务成单预测

14次阅读

共计 41059 个字符,预计需要花费 103 分钟才能阅读完成。

竞赛背景

  皇包车(HI GUIDES)是一个为中国出境游用户提供全球中文包车游服务的平台。拥有境外 10 万名华人司机兼导游(司导),覆盖全球 90 多个国家,1600 多个城市,300 多个国际机场。截止 2017 年 6 月,已累计服务 400 万中国出境游用户。

  由于消费者消费能力逐渐增强、旅游信息不透明程度的下降,游客的行为逐渐变得难以预测,传统旅行社的旅游路线模式已经不能满足游客需求。如何为用户提供更受欢迎、更合适的包车游路线,就需要借助大数据的力量。结合用户个人喜好、景点受欢迎度、天气交通等维度,制定多套旅游信息化解决方案和产品。

赛题地址:https://www.dcjingsai.com/com…

任务

  黄包车提供五万余条客户浏览 APP 行为,其中有些客户在浏览后完成了订单,且享受了精品旅游服务,而有些用户则没有下单。
  参赛者需要分析用户的个人信息和浏览行为,从而 预测用户是否会在短期内购买精品旅游服务

数据导入及预览

import pandas as pd
import numpy as np
from sklearn import preprocessing
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif'] = [u'SimHei']
plt.rcParams['axes.unicode_minus'] = False

user_train = pd.read_csv(r'Data\trainingset\userProfile_train.csv')
action_train = pd.read_csv(r'Data\trainingset\action_train.csv')
comment_train = pd.read_csv(r'Data\trainingset\userComment_train.csv')
orderFuture_train= pd.read_csv(r'Data\trainingset\orderFuture_train.csv')
orderHistory_train= pd.read_csv(r'Data\trainingset\orderHistory_train.csv')

user_test = pd.read_csv(r'Data\test\userProfile_test.csv')
action_test = pd.read_csv(r'Data\test\action_test.csv')
comment_test = pd.read_csv(r'Data\test\userComment_test.csv')
orderFuture_test = pd.read_csv(r'Data\test\orderFuture_test.csv')
orderHistory_test = pd.read_csv(r'Data\test\orderHistory_test.csv')

user = pd.concat([user_train,user_test])
action = pd.concat([action_train,action_test])
comment = pd.concat([comment_train,comment_test])
orderHistory = pd.concat([orderHistory_train,orderHistory_test])
orderFuture = pd.concat([orderFuture_train,orderFuture_test])
user.head()
userid gender province age
0 100000000013 NaN 60 后
1 100000000111 NaN 上海 NaN
2 100000000127 NaN 上海 NaN
3 100000000231 北京 70 后
4 100000000379 北京 NaN

</div>

action.head()

<div>

userid actionType actionTime
0 100000000013 1 1474300753
1 100000000013 5 1474300763
2 100000000013 6 1474300874
3 100000000013 5 1474300911
4 100000000013 6 1474300936

</div>

orderHistory.head()

<div>
<style scoped>

.dataframe tbody tr th:only-of-type {vertical-align: middle;}

.dataframe tbody tr th {vertical-align: top;}

.dataframe thead th {text-align: right;}

</style>

userid orderid orderTime orderType city country continent
0 100000000013 1000015 1481714516 0 柏林 德国 欧洲
1 100000000013 1000014 1501959643 0 旧金山 美国 北美洲
2 100000000393 1000033 1499440296 0 巴黎 法国 欧洲
3 100000000459 1000036 1480601668 0 纽约 美国 北美洲
4 100000000459 1000034 1479146723 0 巴厘岛 印度尼西亚 亚洲

</div>

orderFuture.head()

<div>
<style scoped>

.dataframe tbody tr th:only-of-type {vertical-align: middle;}

.dataframe tbody tr th {vertical-align: top;}

.dataframe thead th {text-align: right;}

</style>

orderType userid
0 0.0 100000000013
1 0.0 100000000111
2 0.0 100000000127
3 0.0 100000000231
4 0.0 100000000379

</div>

comment.head()

<div>
<style scoped>

.dataframe tbody tr th:only-of-type {vertical-align: middle;}

.dataframe tbody tr th {vertical-align: top;}

.dataframe thead th {text-align: right;}

</style>

userid orderid rating tags commentsKeyWords
0 100000000013 1000015 4.0 NaN [‘ 很 ’,’ 简陋 ’,’ 太 ’,’ 随便 ’]
1 100000000231 1000024 5.0 提前联系 | 耐心等候 [‘ 很 ’,’ 细心 ’]
2 100000000471 1000038 5.0 NaN NaN
3 100000000637 1000040 5.0 主动热情 | 提前联系 | 举牌迎接 | 主动搬运行李 NaN
4 100000000755 1000045 1.0 未举牌服务 NaN

</div>

EDA 及可视化

用户信息

  用户信息表共 40307 条用户数据,userid 唯一标识,数据缺失较为严重。

user.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50383 entries, 0 to 10075
Data columns (total 4 columns):
userid      50383 non-null int64
gender      19769 non-null object
province    45484 non-null object
age         5961 non-null object
dtypes: int64(1), object(3)
memory usage: 1.9+ MB

用户地区分布

(user.province.value_counts()/user.province.value_counts().sum()).head().sum()
0.7712162518687891


  用户以北京、上海、广东、江苏、浙江等发达地区为主,五地区占到总用户数的 77%。

fig,axes = plt.subplots(figsize=(20,10))
sns.countplot(x='province',data=user,order=user.province.value_counts().index.tolist())
<matplotlib.axes._subplots.AxesSubplot at 0x28b00198198>



用户性别信息

  用户性别共 15760 条数据,女性占 54.7%,男性 45.3%。

fig,axes = plt.subplots(1,2,figsize=(12,4))
user.gender.value_counts().plot.bar(ax=axes[0])
axes[0].set_xticklabels(['女','男'],rotation=0)
user.gender.value_counts().plot.pie(ax=axes[1],autopct='%.2f%%')
<matplotlib.axes._subplots.AxesSubplot at 0x28b02fbc208>



用户年龄信息

  用户年龄共 4742 条信息,以 60 后、70 后、80 后、90 后为主。

fig,axes = plt.subplots(figsize=(10,4))
user.age.value_counts().plot.bar()
plt.xticks(rotation=0)
(array([0, 1, 2, 3, 4]), <a list of 5 Text xticklabel objects>)



  提供年龄信息的用户中,男性多于女性,且数据缺失严重。

fig,axes = plt.subplots(figsize=(10,4))
sns.countplot(x='age',data=user,hue='gender')
<matplotlib.axes._subplots.AxesSubplot at 0x28b030b2c50>



用户浏览行为

  行为类型一共有 9 个,其中 1 是唤醒 app;2- 4 是浏览产品,无先后关系;5- 9 则是有先后关系的,从填写表单到提交订单再到最后支付。

import time
def time_convert(timestamp):
    str_time =time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(timestamp))
    return str_time
action.actionTime = action.actionTime.map(lambda x: time_convert(x))
action['year']=action.actionTime.str[:4]
action['month']=action.actionTime.str[5:7]
action['day']=action.actionTime.str[8:10]
action['date']=action.actionTime.str[:10]
action['time']=action.actionTime.str[11:]
action['year_month']=action.actionTime.str[:7]
action['hour']=action.actionTime.str[11:13]

用户月访问量

  MAU 用月内产生用户行为的独立 ID 数量表示,PV 用唤醒 APP(行为 1)次数表示。用户活跃的两个峰值分别在四五月和十月,小长假是人们出国游的首选时间。

fig,axes = plt.subplots(2,1,figsize=(10,10))
action[action['year_month'] !='2016-08'].drop_duplicates(['userid']).groupby('year_month').userid.count().plot(ax=axes[0])
axes[0].set_title('独立用户月访问量(MAU)')
action[action.actionType==1].groupby('year_month').userid.count().plot(ax=axes[1])
axes[1].set_title('用户月访问量(PV)')
Text(0.5, 1.0, '用户月访问量(PV)')



日访问量

  DAU 为日内产生用户行为的独立 ID 数,PV 为日内行为为 1 的行为条数。DAU 峰值出现在 4 月初,但同一时间段内的 PV 却相对 PV 峰值 5 月初较低,说明 4 月初平均每用户唤醒次数较低,可能是有拉新活动。对两项指标相除,可以验证以上猜想。同样的,16 年 12 月之前用户的 PV/DAU 较大,之后较为平稳,说明 APP 进入健康平稳期。

action.head()

<div>
<style scoped>

.dataframe tbody tr th:only-of-type {vertical-align: middle;}

.dataframe tbody tr th {vertical-align: top;}

.dataframe thead th {text-align: right;}

</style>

userid actionType actionTime year month day date time year_month hour
0 100000000013 1 2016-09-19 23:59:13 2016 09 19 2016-09-19 23:59:13 2016-09 23
1 100000000013 5 2016-09-19 23:59:23 2016 09 19 2016-09-19 23:59:23 2016-09 23
2 100000000013 6 2016-09-20 00:01:14 2016 09 20 2016-09-20 00:01:14 2016-09 00
3 100000000013 5 2016-09-20 00:01:51 2016 09 20 2016-09-20 00:01:51 2016-09 00
4 100000000013 6 2016-09-20 00:02:16 2016 09 20 2016-09-20 00:02:16 2016-09 00

</div>

fig,axes = plt.subplots(2,1,figsize=(10,10))
action.drop_duplicates(['userid']).groupby('date').userid.count().plot(ax=axes[0])
axes[0].set_title('独立用户日访问量(DAU)')
action[action['actionType']==1].groupby('date').userid.count().plot(ax=axes[1])
axes[1].set_title('用户日访问量(PV)')
Text(0.5, 1.0, '用户日访问量(PV)')



fig,axes = plt.subplots(figsize=(10,5))
(action.drop_duplicates(['userid']).groupby('date').userid.count()/action[action['actionType']==1].groupby('date').userid.count()).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x28b03211080>



小时访问分析

  数据的点击量呈现一个非常奇怪的形状,在日间(8 点到 16 点)呈现较低的访问量,并在 12 点左右达到最低值,显然数据缺失严重。

fig,axes = plt.subplots(2,1,figsize=(10,10))
action.drop_duplicates(['userid']).groupby('hour').userid.count().plot(ax=axes[0])
axes[0].set_title('独立用户小时访问量(HAU)')
action[action['actionType']==1].groupby('hour').userid.count().plot(ax=axes[1])
axes[1].set_title('用户日小时访问量(PV)')
Text(0.5, 1.0, '用户日小时访问量(PV)')



不同类型用户访问量

# 对访问类型分类
def vis_type(x):
    if x in [2,3,4]:
        return 2
    else:
        return x
action['visitor_type']=action['actionType'].map(lambda x: vis_type(x))
fig,axes = plt.subplots(figsize=(10,5))
diff_visitor = action.groupby(['hour','visitor_type']).userid.count().unstack()
plt.plot(diff_visitor)
plt.title('用户日小时访问量(PV)')
Text(0.5, 1.0, '用户日小时访问量(PV)')



用户转化模型

  首先是唤醒 APP(1)到浏览页面的转化(2)数据结果正常,但填写表单(5)数量远大于操作 1、2,即大量表单在没有使用 APP 的情况下填写,可能是通过其他渠道跳入填写页面,或数据缺失严重。同时,填写表单(7)数量小于(8),可能数据缺失缺失较为严重。

from example.commons import Faker
from pyecharts import options as opts
from pyecharts.charts import Funnel, Page
df = action.groupby('visitor_type',as_index=False).userid.count().values.tolist()

def funnel_base() -> Funnel:
    c = (Funnel()
        .add("访问量",df)
        .set_global_opts(title_opts=opts.TitleOpts(title="访问转化"))
    )
    return c
funnel_base().render_notebook()

<script>

require.config({
    paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});

</script>

<div id="9db50cf8564b4e7490a7551c4e5b8cb4" style="width:900px; height:500px;"></div>

<script>

require(['echarts'], function(echarts) {
    var chart_9db50cf8564b4e7490a7551c4e5b8cb4 = echarts.init(document.getElementById('9db50cf8564b4e7490a7551c4e5b8cb4'), 'white', {renderer: 'canvas'});
    var option_9db50cf8564b4e7490a7551c4e5b8cb4 = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
    "#c23531",
    "#2f4554",
    "#61a0a8",
    "#d48265",
    "#749f83",
    "#ca8622",
    "#bda29a",
    "#6e7074",
    "#546570",
    "#c4ccd3",
    "#f05b72",
    "#ef5b9c",
    "#f47920",
    "#905a3d",
    "#fab27b",
    "#2a5caa",
    "#444693",
    "#726930",
    "#b2d235",
    "#6d8346",
    "#ac6767",
    "#1d953f",
    "#6950a1",
    "#918597"
],
"series": [
    {
        "type": "funnel",
        "name": "\u8bbf\u95ee\u91cf",
        "data": [
            {
                "name": 1,
                "value": 479374
            },
            {
                "name": 2,
                "value": 209297
            },
            {
                "name": 5,
                "value": 599224
            },
            {
                "name": 6,
                "value": 284216
            },
            {
                "name": 7,
                "value": 35036
            },
            {
                "name": 8,
                "value": 35867
            },
            {
                "name": 9,
                "value": 23046
            }
        ],
        "sort": "descending",
        "gap": 0,
        "label": {
            "show": true,
            "position": "top",
            "margin": 8
        }
    }
],
"legend": [
    {
        "data": [
            1,
            2,
            5,
            6,
            7,
            8,
            9
        ],
        "selected": {
            "1": true,
            "2": true,
            "5": true,
            "6": true,
            "7": true,
            "8": true,
            "9": true
        },
        "show": true
    }
],
"tooltip": {
    "show": true,
    "trigger": "item",
    "triggerOn": "mousemove|click",
    "axisPointer": {"type": "line"},
    "textStyle": {"fontSize": 14},
    "borderWidth": 0
},
"title": [
    {"text": "\u8bbf\u95ee\u8f6c\u5316"}
]

};

    chart_9db50cf8564b4e7490a7551c4e5b8cb4.setOption(option_9db50cf8564b4e7490a7551c4e5b8cb4);
});

</script>

def funnel_base() -> Funnel:
    c = (Funnel()
        .add("访问量",df[:2])
        .set_global_opts(title_opts=opts.TitleOpts(title="访问转化"))
    )
    return c
funnel_base().render_notebook()

<script>

require.config({
    paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});

</script>

<div id="78bc04c80e4f48caa9576c3bfeff58ad" style="width:900px; height:500px;"></div>

<script>

require(['echarts'], function(echarts) {
    var chart_78bc04c80e4f48caa9576c3bfeff58ad = echarts.init(document.getElementById('78bc04c80e4f48caa9576c3bfeff58ad'), 'white', {renderer: 'canvas'});
    var option_78bc04c80e4f48caa9576c3bfeff58ad = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
    "#c23531",
    "#2f4554",
    "#61a0a8",
    "#d48265",
    "#749f83",
    "#ca8622",
    "#bda29a",
    "#6e7074",
    "#546570",
    "#c4ccd3",
    "#f05b72",
    "#ef5b9c",
    "#f47920",
    "#905a3d",
    "#fab27b",
    "#2a5caa",
    "#444693",
    "#726930",
    "#b2d235",
    "#6d8346",
    "#ac6767",
    "#1d953f",
    "#6950a1",
    "#918597"
],
"series": [
    {
        "type": "funnel",
        "name": "\u8bbf\u95ee\u91cf",
        "data": [
            {
                "name": 1,
                "value": 479374
            },
            {
                "name": 2,
                "value": 209297
            }
        ],
        "sort": "descending",
        "gap": 0,
        "label": {
            "show": true,
            "position": "top",
            "margin": 8
        }
    }
],
"legend": [
    {
        "data": [
            1,
            2
        ],
        "selected": {
            "1": true,
            "2": true
        },
        "show": true
    }
],
"tooltip": {
    "show": true,
    "trigger": "item",
    "triggerOn": "mousemove|click",
    "axisPointer": {"type": "line"},
    "textStyle": {"fontSize": 14},
    "borderWidth": 0
},
"title": [
    {"text": "\u8bbf\u95ee\u8f6c\u5316"}
]

};

    chart_78bc04c80e4f48caa9576c3bfeff58ad.setOption(option_78bc04c80e4f48caa9576c3bfeff58ad);
});

</script>

def funnel_base() -> Funnel:
    c = (Funnel()
        .add("访问量",df[2:])
        .set_global_opts(title_opts=opts.TitleOpts(title="访问转化"))
    )
    return c
funnel_base().render_notebook()

<script>

require.config({
    paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});

</script>

<div id="d6a65eb8443e4bfb89e1be7e26f03de0" style="width:900px; height:500px;"></div>

<script>

require(['echarts'], function(echarts) {
    var chart_d6a65eb8443e4bfb89e1be7e26f03de0 = echarts.init(document.getElementById('d6a65eb8443e4bfb89e1be7e26f03de0'), 'white', {renderer: 'canvas'});
    var option_d6a65eb8443e4bfb89e1be7e26f03de0 = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
    "#c23531",
    "#2f4554",
    "#61a0a8",
    "#d48265",
    "#749f83",
    "#ca8622",
    "#bda29a",
    "#6e7074",
    "#546570",
    "#c4ccd3",
    "#f05b72",
    "#ef5b9c",
    "#f47920",
    "#905a3d",
    "#fab27b",
    "#2a5caa",
    "#444693",
    "#726930",
    "#b2d235",
    "#6d8346",
    "#ac6767",
    "#1d953f",
    "#6950a1",
    "#918597"
],
"series": [
    {
        "type": "funnel",
        "name": "\u8bbf\u95ee\u91cf",
        "data": [
            {
                "name": 5,
                "value": 599224
            },
            {
                "name": 6,
                "value": 284216
            },
            {
                "name": 7,
                "value": 35036
            },
            {
                "name": 8,
                "value": 35867
            },
            {
                "name": 9,
                "value": 23046
            }
        ],
        "sort": "descending",
        "gap": 0,
        "label": {
            "show": true,
            "position": "top",
            "margin": 8
        }
    }
],
"legend": [
    {
        "data": [
            5,
            6,
            7,
            8,
            9
        ],
        "selected": {
            "5": true,
            "6": true,
            "7": true,
            "8": true,
            "9": true
        },
        "show": true
    }
],
"tooltip": {
    "show": true,
    "trigger": "item",
    "triggerOn": "mousemove|click",
    "axisPointer": {"type": "line"},
    "textStyle": {"fontSize": 14},
    "borderWidth": 0
},
"title": [
    {"text": "\u8bbf\u95ee\u8f6c\u5316"}
]

};

    chart_d6a65eb8443e4bfb89e1be7e26f03de0.setOption(option_d6a65eb8443e4bfb89e1be7e26f03de0);
});

</script>

用户评价

用户评分

  评价表中共 9863 条数据,其中评分无缺失值,平均分为 4.91, 五星好评占绝大多数。

comment.rating.mean()
4.916672610845424



from pyecharts.charts import Bar

bar = Bar()
bar.add_xaxis(comment.rating.value_counts().index.tolist())
bar.add_yaxis("评分", comment.rating.value_counts().values.tolist())
bar.render_notebook()

<script>

require.config({
    paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});

</script>

<div id="85693d3bcea24bc0b08a8172fbf40f5c" style="width:900px; height:500px;"></div>

<script>

require(['echarts'], function(echarts) {
    var chart_85693d3bcea24bc0b08a8172fbf40f5c = echarts.init(document.getElementById('85693d3bcea24bc0b08a8172fbf40f5c'), 'white', {renderer: 'canvas'});
    var option_85693d3bcea24bc0b08a8172fbf40f5c = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
    "#c23531",
    "#2f4554",
    "#61a0a8",
    "#d48265",
    "#749f83",
    "#ca8622",
    "#bda29a",
    "#6e7074",
    "#546570",
    "#c4ccd3",
    "#f05b72",
    "#ef5b9c",
    "#f47920",
    "#905a3d",
    "#fab27b",
    "#2a5caa",
    "#444693",
    "#726930",
    "#b2d235",
    "#6d8346",
    "#ac6767",
    "#1d953f",
    "#6950a1",
    "#918597"
],
"series": [
    {
        "type": "bar",
        "name": "\u8bc4\u5206",
        "data": [
            11833,
            247,
            118,
            97,
            35,
            3,
            2,
            2
        ],
        "barCategoryGap": "20%",
        "label": {
            "show": true,
            "position": "top",
            "margin": 8
        }
    }
],
"legend": [
    {
        "data": ["\u8bc4\u5206"],
        "selected": {"\u8bc4\u5206": true}
    }
],
"tooltip": {
    "show": true,
    "trigger": "item",
    "triggerOn": "mousemove|click",
    "axisPointer": {"type": "line"},
    "textStyle": {"fontSize": 14},
    "borderWidth": 0
},
"xAxis": [
    {
        "show": true,
        "scale": false,
        "nameLocation": "end",
        "nameGap": 15,
        "gridIndex": 0,
        "inverse": false,
        "offset": 0,
        "splitNumber": 5,
        "minInterval": 0,
        "splitLine": {
            "show": false,
            "lineStyle": {
                "width": 1,
                "opacity": 1,
                "curveness": 0,
                "type": "solid"
            }
        },
        "data": [
            5.0,
            4.0,
            1.0,
            3.0,
            2.0,
            4.33,
            2.33,
            3.67
        ]
    }
],
"yAxis": [
    {
        "show": true,
        "scale": false,
        "nameLocation": "end",
        "nameGap": 15,
        "gridIndex": 0,
        "inverse": false,
        "offset": 0,
        "splitNumber": 5,
        "minInterval": 0,
        "splitLine": {
            "show": false,
            "lineStyle": {
                "width": 1,
                "opacity": 1,
                "curveness": 0,
                "type": "solid"
            }
        }
    }
]

};

    chart_85693d3bcea24bc0b08a8172fbf40f5c.setOption(option_85693d3bcea24bc0b08a8172fbf40f5c);
});

</script>

用户评价标签

  以四分为分界线划分好评与差评, 分别制作词云图如下:

tags_count = comment[comment.rating>=4].tags.str.split("|").dropna().apply(pd.value_counts).sum()
path=r'C:\Windows\Fonts\simhei.ttf'
import wordcloud
w = wordcloud.WordCloud(font_path=path,width=1400, height=1400, margin=2)
w.fit_words(tags_count)
plt.figure(dpi=1000)
plt.imshow(w)
plt.axis('off')
(-0.5, 1399.5, 1399.5, -0.5)



tags_count = comment[comment.rating<4].tags.str.split("|").dropna().apply(pd.value_counts).sum()
path=r'C:\Windows\Fonts\simhei.ttf'
import wordcloud
w = wordcloud.WordCloud(font_path=path,width=1400, height=1400, margin=2)
w.fit_words(tags_count)
plt.figure(dpi=500)
plt.imshow(w)
plt.axis('off')
(-0.5, 1399.5, 1399.5, -0.5)



用户评论关键词

  用户评论关键词同样以 4 分为分界线,分别制作词云图。

Keyword_count=comment[comment['rating']>=4].commentsKeyWords.dropna().str[1:-1].str.split(',').apply(pd.value_counts).sum()
path=r'C:\Windows\Fonts\simhei.ttf'
import wordcloud
w = wordcloud.WordCloud(font_path=path,width=1400, height=1400, margin=2)
w.fit_words(Keyword_count)
plt.figure(dpi=1000)
plt.imshow(w)
plt.axis('off')
(-0.5, 1399.5, 1399.5, -0.5)



Keyword_count=comment[comment['rating']<4].commentsKeyWords.dropna().str[1:-1].str.split(',').apply(pd.value_counts).sum()
path=r'C:\Windows\Fonts\simhei.ttf'
import wordcloud
w = wordcloud.WordCloud(font_path=path,width=1400, height=1400, margin=2)
w.fit_words(Keyword_count)
plt.figure(dpi=1000)
plt.imshow(w)
plt.axis('off')
(-0.5, 1399.5, 1399.5, -0.5)



订单数据

  该数据描述了用户的历史订单信息。数据共有 7 列,分别是用户 id,订单 id,订单时间,订单类型,旅游城市,国家,大陆。其中 1 表示购买了精品旅游服务,0 表示普通旅游服务。

用户复购

  订单数据共 20653 项,涵盖 10637 名用户, 用户复购图如下:

order_number=orderHistory.groupby(['userid'],as_index=False).orderid.count().groupby('orderid',as_index=False).userid.count().rename(columns={'orderid':'order_quantity','userid':'count'})
order_number=pd.concat([order_number[:8],pd.DataFrame([{'order_quantity':'8 次以上','count':order_number[8:].count().sum()}])])

from pyecharts.charts import Page, Pie
def pie_base() -> Pie:
    c = (Pie()
        .add('',order_number.values.tolist())
        .set_global_opts(title_opts=opts.TitleOpts(title="所有服务用户复购图"))
        .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
    )
    return c
pie_base().render_notebook()

<script>

require.config({
    paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});

</script>

<div id="5b1552eb64bf4829b2e1ea71b0604bde" style="width:900px; height:500px;"></div>

<script>

require(['echarts'], function(echarts) {
    var chart_5b1552eb64bf4829b2e1ea71b0604bde = echarts.init(document.getElementById('5b1552eb64bf4829b2e1ea71b0604bde'), 'white', {renderer: 'canvas'});
    var option_5b1552eb64bf4829b2e1ea71b0604bde = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
    "#c23531",
    "#2f4554",
    "#61a0a8",
    "#d48265",
    "#749f83",
    "#ca8622",
    "#bda29a",
    "#6e7074",
    "#546570",
    "#c4ccd3",
    "#f05b72",
    "#ef5b9c",
    "#f47920",
    "#905a3d",
    "#fab27b",
    "#2a5caa",
    "#444693",
    "#726930",
    "#b2d235",
    "#6d8346",
    "#ac6767",
    "#1d953f",
    "#6950a1",
    "#918597"
],
"series": [
    {
        "type": "pie",
        "clockwise": true,
        "data": [
            {
                "name": 7992,
                "value": 1
            },
            {
                "name": 2638,
                "value": 2
            },
            {
                "name": 1244,
                "value": 3
            },
            {
                "name": 566,
                "value": 4
            },
            {
                "name": 349,
                "value": 5
            },
            {
                "name": 181,
                "value": 6
            },
            {
                "name": 113,
                "value": 7
            },
            {
                "name": 64,
                "value": 8
            },
            {
                "name": 46,
                "value": "8\u6b21\u4ee5\u4e0a"
            }
        ],
        "radius": [
            "0%",
            "75%"
        ],
        "center": [
            "50%",
            "50%"
        ],
        "label": {
            "show": true,
            "position": "top",
            "margin": 8,
            "formatter": "{b}: {c}"
        },
        "rippleEffect": {
            "show": true,
            "brushType": "stroke",
            "scale": 2.5,
            "period": 4
        }
    }
],
"legend": [
    {
        "data": [
            7992,
            2638,
            1244,
            566,
            349,
            181,
            113,
            64,
            46
        ],
        "selected": {},
        "show": true
    }
],
"tooltip": {
    "show": true,
    "trigger": "item",
    "triggerOn": "mousemove|click",
    "axisPointer": {"type": "line"},
    "textStyle": {"fontSize": 14},
    "borderWidth": 0
},
"title": [
    {"text": "\u6240\u6709\u670d\u52a1\u7528\u6237\u590d\u8d2d\u56fe"}
]

};

    chart_5b1552eb64bf4829b2e1ea71b0604bde.setOption(option_5b1552eb64bf4829b2e1ea71b0604bde);
});

</script>

order_number=orderHistory[orderHistory.orderType==1].groupby(['userid'],as_index=False).orderid.count().groupby('orderid',as_index=False).userid.count().rename(columns={'orderid':'order_quantity','userid':'count'})
order_number=pd.concat([order_number[:8],pd.DataFrame([{'order_quantity':'8 次以上','count':order_number[8:].count().sum()}])])

from pyecharts.charts import Page, Pie
def pie_base() -> Pie:
    c = (Pie()
        .add('',order_number.values.tolist())
        .set_global_opts(title_opts=opts.TitleOpts(title="精品服务用户复购图"))
        .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
    )
    return c
pie_base().render_notebook()

<script>

require.config({
    paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});

</script>

<div id="92f3c2bcefce447e8e180e96af4c4f1c" style="width:900px; height:500px;"></div>

<script>

require(['echarts'], function(echarts) {
    var chart_92f3c2bcefce447e8e180e96af4c4f1c = echarts.init(document.getElementById('92f3c2bcefce447e8e180e96af4c4f1c'), 'white', {renderer: 'canvas'});
    var option_92f3c2bcefce447e8e180e96af4c4f1c = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
    "#c23531",
    "#2f4554",
    "#61a0a8",
    "#d48265",
    "#749f83",
    "#ca8622",
    "#bda29a",
    "#6e7074",
    "#546570",
    "#c4ccd3",
    "#f05b72",
    "#ef5b9c",
    "#f47920",
    "#905a3d",
    "#fab27b",
    "#2a5caa",
    "#444693",
    "#726930",
    "#b2d235",
    "#6d8346",
    "#ac6767",
    "#1d953f",
    "#6950a1",
    "#918597"
],
"series": [
    {
        "type": "pie",
        "clockwise": true,
        "data": [
            {
                "name": 1359,
                "value": 1
            },
            {
                "name": 407,
                "value": 2
            },
            {
                "name": 182,
                "value": 3
            },
            {
                "name": 81,
                "value": 4
            },
            {
                "name": 50,
                "value": 5
            },
            {
                "name": 20,
                "value": 6
            },
            {
                "name": 11,
                "value": 7
            },
            {
                "name": 7,
                "value": 8
            },
            {
                "name": 22,
                "value": "8\u6b21\u4ee5\u4e0a"
            }
        ],
        "radius": [
            "0%",
            "75%"
        ],
        "center": [
            "50%",
            "50%"
        ],
        "label": {
            "show": true,
            "position": "top",
            "margin": 8,
            "formatter": "{b}: {c}"
        },
        "rippleEffect": {
            "show": true,
            "brushType": "stroke",
            "scale": 2.5,
            "period": 4
        }
    }
],
"legend": [
    {
        "data": [
            1359,
            407,
            182,
            81,
            50,
            20,
            11,
            7,
            22
        ],
        "selected": {},
        "show": true
    }
],
"tooltip": {
    "show": true,
    "trigger": "item",
    "triggerOn": "mousemove|click",
    "axisPointer": {"type": "line"},
    "textStyle": {"fontSize": 14},
    "borderWidth": 0
},
"title": [
    {"text": "\u7cbe\u54c1\u670d\u52a1\u7528\u6237\u590d\u8d2d\u56fe"}
]

};

    chart_92f3c2bcefce447e8e180e96af4c4f1c.setOption(option_92f3c2bcefce447e8e180e96af4c4f1c);
});

</script>

用户出游

orderHistory.orderTime = orderHistory.orderTime.map(lambda x: time_convert(x))

orderHistory['year']=orderHistory.orderTime.str[:4]
orderHistory['month']=orderHistory.orderTime.str[5:7]
orderHistory['day']=orderHistory.orderTime.str[8:10]
orderHistory['date']=orderHistory.orderTime.str[:10]
orderHistory['time']=orderHistory.orderTime.str[11:]
orderHistory['year_month']=orderHistory.orderTime.str[:7]
orderHistory['hour']=orderHistory.orderTime.str[11:13]
from pyecharts.charts import Bar
from pyecharts import options as opts
jingpin_top10 = orderHistory[orderHistory.orderType==1].city.value_counts()[:10]
bar = Bar()
bar.add_xaxis(jingpin_top10.index.tolist())
bar.add_yaxis("精品游十大热门城市", jingpin_top10.values.tolist())
bar.render_notebook()

<script>

require.config({
    paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});

</script>

<div id="89a32a8cd3ca4f11b91be9d041538579" style="width:900px; height:500px;"></div>

<script>

require(['echarts'], function(echarts) {
    var chart_89a32a8cd3ca4f11b91be9d041538579 = echarts.init(document.getElementById('89a32a8cd3ca4f11b91be9d041538579'), 'white', {renderer: 'canvas'});
    var option_89a32a8cd3ca4f11b91be9d041538579 = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
    "#c23531",
    "#2f4554",
    "#61a0a8",
    "#d48265",
    "#749f83",
    "#ca8622",
    "#bda29a",
    "#6e7074",
    "#546570",
    "#c4ccd3",
    "#f05b72",
    "#ef5b9c",
    "#f47920",
    "#905a3d",
    "#fab27b",
    "#2a5caa",
    "#444693",
    "#726930",
    "#b2d235",
    "#6d8346",
    "#ac6767",
    "#1d953f",
    "#6950a1",
    "#918597"
],
"series": [
    {
        "type": "bar",
        "name": "\u7cbe\u54c1\u6e38\u5341\u5927\u70ed\u95e8\u57ce\u5e02",
        "data": [
            445,
            248,
            163,
            158,
            150,
            147,
            147,
            133,
            131,
            130
        ],
        "barCategoryGap": "20%",
        "label": {
            "show": true,
            "position": "top",
            "margin": 8
        }
    }
],
"legend": [
    {
        "data": ["\u7cbe\u54c1\u6e38\u5341\u5927\u70ed\u95e8\u57ce\u5e02"],
        "selected": {"\u7cbe\u54c1\u6e38\u5341\u5927\u70ed\u95e8\u57ce\u5e02": true}
    }
],
"tooltip": {
    "show": true,
    "trigger": "item",
    "triggerOn": "mousemove|click",
    "axisPointer": {"type": "line"},
    "textStyle": {"fontSize": 14},
    "borderWidth": 0
},
"xAxis": [
    {
        "show": true,
        "scale": false,
        "nameLocation": "end",
        "nameGap": 15,
        "gridIndex": 0,
        "inverse": false,
        "offset": 0,
        "splitNumber": 5,
        "minInterval": 0,
        "splitLine": {
            "show": false,
            "lineStyle": {
                "width": 1,
                "opacity": 1,
                "curveness": 0,
                "type": "solid"
            }
        },
        "data": [
            "\u4e1c\u4eac",
            "\u5927\u962a",
            "\u66fc\u8c37",
            "\u53f0\u5317",
            "\u5df4\u5398\u5c9b",
            "\u4eac\u90fd",
            "\u58a8\u5c14\u672c",
            "\u6089\u5c3c",
            "\u5409\u9686\u5761",
            "\u5df4\u9ece"
        ]
    }
],
"yAxis": [
    {
        "show": true,
        "scale": false,
        "nameLocation": "end",
        "nameGap": 15,
        "gridIndex": 0,
        "inverse": false,
        "offset": 0,
        "splitNumber": 5,
        "minInterval": 0,
        "splitLine": {
            "show": false,
            "lineStyle": {
                "width": 1,
                "opacity": 1,
                "curveness": 0,
                "type": "solid"
            }
        }
    }
]

};

    chart_89a32a8cd3ca4f11b91be9d041538579.setOption(option_89a32a8cd3ca4f11b91be9d041538579);
});

</script>

from pyecharts.globals import ThemeType
putong_top10 = orderHistory[orderHistory.orderType==0].city.value_counts()[:10]
bar = Bar({"theme": ThemeType.ESSOS})
bar.add_xaxis(putong_top10.index.tolist())
bar.add_yaxis("普通游十大热门城市", putong_top10.values.tolist())
bar.render_notebook()

<script>

require.config({
    paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min', 'essos':'https://assets.pyecharts.org/assets/themes/essos'}
});

</script>

<div id="cfbdf5d205b04d3293e3e23fd3adc8b7" style="width:900px; height:500px;"></div>

<script>

require(['echarts', 'essos'], function(echarts) {
    var chart_cfbdf5d205b04d3293e3e23fd3adc8b7 = echarts.init(document.getElementById('cfbdf5d205b04d3293e3e23fd3adc8b7'), 'essos', {renderer: 'canvas'});
    var option_cfbdf5d205b04d3293e3e23fd3adc8b7 = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"series": [
    {
        "type": "bar",
        "name": "\u666e\u901a\u6e38\u5341\u5927\u70ed\u95e8\u57ce\u5e02",
        "data": [
            2319,
            1951,
            1274,
            1250,
            1240,
            1237,
            1151,
            1006,
            900,
            869
        ],
        "barCategoryGap": "20%",
        "label": {
            "show": true,
            "position": "top",
            "margin": 8
        }
    }
],
"legend": [
    {
        "data": ["\u666e\u901a\u6e38\u5341\u5927\u70ed\u95e8\u57ce\u5e02"],
        "selected": {"\u666e\u901a\u6e38\u5341\u5927\u70ed\u95e8\u57ce\u5e02": true}
    }
],
"tooltip": {
    "show": true,
    "trigger": "item",
    "triggerOn": "mousemove|click",
    "axisPointer": {"type": "line"},
    "textStyle": {"fontSize": 14},
    "borderWidth": 0
},
"xAxis": [
    {
        "show": true,
        "scale": false,
        "nameLocation": "end",
        "nameGap": 15,
        "gridIndex": 0,
        "inverse": false,
        "offset": 0,
        "splitNumber": 5,
        "minInterval": 0,
        "splitLine": {
            "show": false,
            "lineStyle": {
                "width": 1,
                "opacity": 1,
                "curveness": 0,
                "type": "solid"
            }
        },
        "data": [
            "\u65b0\u52a0\u5761",
            "\u4e1c\u4eac",
            "\u53f0\u5317",
            "\u9999\u6e2f",
            "\u7ebd\u7ea6",
            "\u5409\u9686\u5761",
            "\u6089\u5c3c",
            "\u5927\u962a",
            "\u66fc\u8c37",
            "\u58a8\u5c14\u672c"
        ]
    }
],
"yAxis": [
    {
        "show": true,
        "scale": false,
        "nameLocation": "end",
        "nameGap": 15,
        "gridIndex": 0,
        "inverse": false,
        "offset": 0,
        "splitNumber": 5,
        "minInterval": 0,
        "splitLine": {
            "show": false,
            "lineStyle": {
                "width": 1,
                "opacity": 1,
                "curveness": 0,
                "type": "solid"
            }
        }
    }
]

};

    chart_cfbdf5d205b04d3293e3e23fd3adc8b7.setOption(option_cfbdf5d205b04d3293e3e23fd3adc8b7);
});

</script>

continent_jingpin=orderHistory[orderHistory['orderType']==1].groupby(['continent'],as_index=False).orderid.count()
continent_putong = orderHistory[orderHistory['orderType']==0].groupby(['continent'],as_index=False).orderid.count()
continent_putong = pd.concat([continent_putong,pd.DataFrame([{'continent':'南美洲','userid':0}])]).sort_values('continent')
bar = Bar()
bar.add_xaxis(continent_jingpin.continent.tolist())
bar.add_yaxis("精品游大陆分布", continent_jingpin.orderid.values.tolist())
bar.add_yaxis("普通游大陆分布", continent_putong.orderid.values.tolist())
bar.render_notebook()

<script>

require.config({
    paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});

</script>

<div id="365792879cfe489090b32b4db9a3e16d" style="width:900px; height:500px;"></div>

<script>

require(['echarts'], function(echarts) {
    var chart_365792879cfe489090b32b4db9a3e16d = echarts.init(document.getElementById('365792879cfe489090b32b4db9a3e16d'), 'white', {renderer: 'canvas'});
    var option_365792879cfe489090b32b4db9a3e16d = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
    "#c23531",
    "#2f4554",
    "#61a0a8",
    "#d48265",
    "#749f83",
    "#ca8622",
    "#bda29a",
    "#6e7074",
    "#546570",
    "#c4ccd3",
    "#f05b72",
    "#ef5b9c",
    "#f47920",
    "#905a3d",
    "#fab27b",
    "#2a5caa",
    "#444693",
    "#726930",
    "#b2d235",
    "#6d8346",
    "#ac6767",
    "#1d953f",
    "#6950a1",
    "#918597"
],
"series": [
    {
        "type": "bar",
        "name": "\u7cbe\u54c1\u6e38\u5927\u9646\u5206\u5e03",
        "data": [
            2204,
            558,
            2,
            370,
            763,
            4
        ],
        "barCategoryGap": "20%",
        "label": {
            "show": true,
            "position": "top",
            "margin": 8
        }
    },
    {
        "type": "bar",
        "name": "\u666e\u901a\u6e38\u5927\u9646\u5206\u5e03",
        "data": [
            12936.0,
            3638.0,
            null,
            2878.0,
            2351.0,
            8.0
        ],
        "barCategoryGap": "20%",
        "label": {
            "show": true,
            "position": "top",
            "margin": 8
        }
    }
],
"legend": [
    {
        "data": [
            "\u7cbe\u54c1\u6e38\u5927\u9646\u5206\u5e03",
            "\u666e\u901a\u6e38\u5927\u9646\u5206\u5e03"
        ],
        "selected": {
            "\u7cbe\u54c1\u6e38\u5927\u9646\u5206\u5e03": true,
            "\u666e\u901a\u6e38\u5927\u9646\u5206\u5e03": true
        }
    }
],
"tooltip": {
    "show": true,
    "trigger": "item",
    "triggerOn": "mousemove|click",
    "axisPointer": {"type": "line"},
    "textStyle": {"fontSize": 14},
    "borderWidth": 0
},
"xAxis": [
    {
        "show": true,
        "scale": false,
        "nameLocation": "end",
        "nameGap": 15,
        "gridIndex": 0,
        "inverse": false,
        "offset": 0,
        "splitNumber": 5,
        "minInterval": 0,
        "splitLine": {
            "show": false,
            "lineStyle": {
                "width": 1,
                "opacity": 1,
                "curveness": 0,
                "type": "solid"
            }
        },
        "data": [
            "\u4e9a\u6d32",
            "\u5317\u7f8e\u6d32",
            "\u5357\u7f8e\u6d32",
            "\u5927\u6d0b\u6d32",
            "\u6b27\u6d32",
            "\u975e\u6d32"
        ]
    }
],
"yAxis": [
    {
        "show": true,
        "scale": false,
        "nameLocation": "end",
        "nameGap": 15,
        "gridIndex": 0,
        "inverse": false,
        "offset": 0,
        "splitNumber": 5,
        "minInterval": 0,
        "splitLine": {
            "show": false,
            "lineStyle": {
                "width": 1,
                "opacity": 1,
                "curveness": 0,
                "type": "solid"
            }
        }
    }
]

};

    chart_365792879cfe489090b32b4db9a3e16d.setOption(option_365792879cfe489090b32b4db9a3e16d);
});

</script>

country_boutique = orderHistory[orderHistory['orderType']==1].groupby('country').country.count().sort_values(ascending = False)[:10]
country_ordinary = orderHistory[orderHistory['orderType']==0].groupby('country').country.count().sort_values(ascending = False)[:10]
bar = Bar()
bar.add_xaxis(country_boutique.index.tolist())
bar.add_yaxis("精品游十大热门国家", country_boutique.values.tolist())
bar.render_notebook()

<script>

require.config({
    paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min'}
});

</script>

<div id="e88237881863469a96bef6e7ca24550c" style="width:900px; height:500px;"></div>

<script>

require(['echarts'], function(echarts) {
    var chart_e88237881863469a96bef6e7ca24550c = echarts.init(document.getElementById('e88237881863469a96bef6e7ca24550c'), 'white', {renderer: 'canvas'});
    var option_e88237881863469a96bef6e7ca24550c = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"color": [
    "#c23531",
    "#2f4554",
    "#61a0a8",
    "#d48265",
    "#749f83",
    "#ca8622",
    "#bda29a",
    "#6e7074",
    "#546570",
    "#c4ccd3",
    "#f05b72",
    "#ef5b9c",
    "#f47920",
    "#905a3d",
    "#fab27b",
    "#2a5caa",
    "#444693",
    "#726930",
    "#b2d235",
    "#6d8346",
    "#ac6767",
    "#1d953f",
    "#6950a1",
    "#918597"
],
"series": [
    {
        "type": "bar",
        "name": "\u7cbe\u54c1\u6e38\u5341\u5927\u70ed\u95e8\u56fd\u5bb6",
        "data": [
            1030,
            486,
            337,
            325,
            222,
            165,
            157,
            157,
            150,
            118
        ],
        "barCategoryGap": "20%",
        "label": {
            "show": true,
            "position": "top",
            "margin": 8
        }
    }
],
"legend": [
    {
        "data": ["\u7cbe\u54c1\u6e38\u5341\u5927\u70ed\u95e8\u56fd\u5bb6"],
        "selected": {"\u7cbe\u54c1\u6e38\u5341\u5927\u70ed\u95e8\u56fd\u5bb6": true}
    }
],
"tooltip": {
    "show": true,
    "trigger": "item",
    "triggerOn": "mousemove|click",
    "axisPointer": {"type": "line"},
    "textStyle": {"fontSize": 14},
    "borderWidth": 0
},
"xAxis": [
    {
        "show": true,
        "scale": false,
        "nameLocation": "end",
        "nameGap": 15,
        "gridIndex": 0,
        "inverse": false,
        "offset": 0,
        "splitNumber": 5,
        "minInterval": 0,
        "splitLine": {
            "show": false,
            "lineStyle": {
                "width": 1,
                "opacity": 1,
                "curveness": 0,
                "type": "solid"
            }
        },
        "data": [
            "\u65e5\u672c",
            "\u7f8e\u56fd",
            "\u6fb3\u5927\u5229\u4e9a",
            "\u4e2d\u56fd\u53f0\u6e7e",
            "\u6cf0\u56fd",
            "\u6cd5\u56fd",
            "\u82f1\u56fd",
            "\u9a6c\u6765\u897f\u4e9a",
            "\u5370\u5ea6\u5c3c\u897f\u4e9a",
            "\u97e9\u56fd"
        ]
    }
],
"yAxis": [
    {
        "show": true,
        "scale": false,
        "nameLocation": "end",
        "nameGap": 15,
        "gridIndex": 0,
        "inverse": false,
        "offset": 0,
        "splitNumber": 5,
        "minInterval": 0,
        "splitLine": {
            "show": false,
            "lineStyle": {
                "width": 1,
                "opacity": 1,
                "curveness": 0,
                "type": "solid"
            }
        }
    }
]

};

    chart_e88237881863469a96bef6e7ca24550c.setOption(option_e88237881863469a96bef6e7ca24550c);
});

</script>

bar = Bar()
bar.add_xaxis(country_ordinary.index.tolist())
bar.add_yaxis("普通游十大热门国家", country_ordinary.values.tolist())
bar.render_notebook()
def bar_base_dict_config() -> Bar:
    c = (Bar({"theme": ThemeType.MACARONS})
        .add_xaxis(country_ordinary.index.tolist())
        .add_yaxis("普通游十大热门国家", country_ordinary.values.tolist())
        )
    return c

bar_base_dict_config().render_notebook()

<script>

require.config({
    paths: {'echarts':'https://assets.pyecharts.org/assets/echarts.min', 'macarons':'https://assets.pyecharts.org/assets/themes/macarons'}
});

</script>

<div id="de005b7fb43c4f12b4f5e3f16269512a" style="width:900px; height:500px;"></div>

<script>

require(['echarts', 'macarons'], function(echarts) {
    var chart_de005b7fb43c4f12b4f5e3f16269512a = echarts.init(document.getElementById('de005b7fb43c4f12b4f5e3f16269512a'), 'macarons', {renderer: 'canvas'});
    var option_de005b7fb43c4f12b4f5e3f16269512a = {
"animation": true,
"animationThreshold": 2000,
"animationDuration": 1000,
"animationEasing": "cubicOut",
"animationDelay": 0,
"animationDurationUpdate": 300,
"animationEasingUpdate": "cubicOut",
"animationDelayUpdate": 0,
"series": [
    {
        "type": "bar",
        "name": "\u666e\u901a\u6e38\u5341\u5927\u70ed\u95e8\u56fd\u5bb6",
        "data": [
            3372,
            3228,
            2675,
            2319,
            1781,
            1634,
            1395,
            1250,
            760,
            738
        ],
        "barCategoryGap": "20%",
        "label": {
            "show": true,
            "position": "top",
            "margin": 8
        }
    }
],
"legend": [
    {
        "data": ["\u666e\u901a\u6e38\u5341\u5927\u70ed\u95e8\u56fd\u5bb6"],
        "selected": {"\u666e\u901a\u6e38\u5341\u5927\u70ed\u95e8\u56fd\u5bb6": true}
    }
],
"tooltip": {
    "show": true,
    "trigger": "item",
    "triggerOn": "mousemove|click",
    "axisPointer": {"type": "line"},
    "textStyle": {"fontSize": 14},
    "borderWidth": 0
},
"xAxis": [
    {
        "show": true,
        "scale": false,
        "nameLocation": "end",
        "nameGap": 15,
        "gridIndex": 0,
        "inverse": false,
        "offset": 0,
        "splitNumber": 5,
        "minInterval": 0,
        "splitLine": {
            "show": false,
            "lineStyle": {
                "width": 1,
                "opacity": 1,
                "curveness": 0,
                "type": "solid"
            }
        },
        "data": [
            "\u65e5\u672c",
            "\u7f8e\u56fd",
            "\u6fb3\u5927\u5229\u4e9a",
            "\u65b0\u52a0\u5761",
            "\u6cf0\u56fd",
            "\u9a6c\u6765\u897f\u4e9a",
            "\u4e2d\u56fd\u53f0\u6e7e",
            "\u4e2d\u56fd\u9999\u6e2f",
            "\u6cd5\u56fd",
            "\u82f1\u56fd"
        ]
    }
],
"yAxis": [
    {
        "show": true,
        "scale": false,
        "nameLocation": "end",
        "nameGap": 15,
        "gridIndex": 0,
        "inverse": false,
        "offset": 0,
        "splitNumber": 5,
        "minInterval": 0,
        "splitLine": {
            "show": false,
            "lineStyle": {
                "width": 1,
                "opacity": 1,
                "curveness": 0,
                "type": "solid"
            }
        }
    }
]

};

    chart_de005b7fb43c4f12b4f5e3f16269512a.setOption(option_de005b7fb43c4f12b4f5e3f16269512a);
});

</script>

特征工程

import pandas as pd
import numpy as np
from sklearn import preprocessing

import warnings
warnings.filterwarnings('ignore')

user_train = pd.read_csv(r'Data\trainingset\userProfile_train.csv')
action_train = pd.read_csv(r'Data\trainingset\action_train.csv')
comment_train = pd.read_csv(r'Data\trainingset\userComment_train.csv')
orderFuture_train= pd.read_csv(r'Data\trainingset\orderFuture_train.csv')
orderHistory_train= pd.read_csv(r'Data\trainingset\orderHistory_train.csv')

user_test = pd.read_csv(r'Data\test\userProfile_test.csv')
action_test = pd.read_csv(r'Data\test\action_test.csv')
comment_test = pd.read_csv(r'Data\test\userComment_test.csv')
orderFuture_test = pd.read_csv(r'Data\test\orderFuture_test.csv')
orderHistory_test = pd.read_csv(r'Data\test\orderHistory_test.csv')

user = pd.concat([user_train,user_test])
action = pd.concat([action_train,action_test])
comment = pd.concat([comment_train,comment_test])
orderHistory = pd.concat([orderHistory_train,orderHistory_test])
orderFuture = pd.concat([orderFuture_train,orderFuture_test])

orderHistory = orderHistory.sort_values(by=['userid','orderTime'])
#历史订单数量,时间戳统计值
orderHistory_internal_table = orderHistory.groupby('userid').orderTime.agg(['count','max','min','std','mean']).reset_index().rename(columns = {'count':'order_count',
                                                                                                                'max':'ordertime_max',
                                                                                                                'min':'ordertime_min',
                                                                                                                'std':'orderTime_std',
                                                                                                                'mean':'ordertime_mean'}).fillna(0)
#历史订单普通、精品订单数
orderHistory_internal_table = orderHistory_internal_table.merge(orderHistory[orderHistory['orderType']==0].groupby('userid').orderid.count().reset_index().rename(columns={'orderid':'ordinary_count'}),how='left',on='userid')
orderHistory_internal_table = orderHistory_internal_table.merge(orderHistory[orderHistory['orderType']==1].groupby('userid').orderid.count().reset_index().rename(columns={'orderid':'unordinary_count'}),how='left',on='userid')
#去过的国家、大陆、城市有几次。orderHistory_internal_table = orderHistory_internal_table.merge(pd.get_dummies(orderHistory[['userid','country','continent','city']]).groupby('userid',as_index=False).sum(),on='userid',how='left')
#最后一次行程信息
orderHistory_internal_table = orderHistory_internal_table.merge(pd.get_dummies(orderHistory.groupby('userid',as_index=False).apply(lambda x:x.iloc[-1])[['userid','orderType','city','country','continent']]),on='userid',how='left')

data = orderFuture.copy() #以 orderFuture 为基础

data = data.merge(user)  #连接 user

data = data.merge(comment,how = 'left') #连接 comment
data['tags'] = data.tags.apply(lambda x : 0 if pd.isnull(x) else 1) #将 tag 分为有无
data['commentsKeyWords'] = data.commentsKeyWords.apply(lambda x:0 if pd.isnull(x) else 1) #将评论分为有无
del data['orderid'] #删除 orderid 列


action = action.sort_values(by=['userid','actionTime']) #按照 userid,actiontime 排序
#生成中间表包含 action 信息,首先是每个 id 的 action 数量,最大最小时间, 均值标准差
action_internal_table = action.groupby('userid').actionTime.agg(['count','max','min','std','mean']).reset_index().rename(columns = {'count':'action_count',
                                                                                                                                    'max':'time_last_action',
                                                                                                                                    'min':'time_first_action',
                                                                                                                                   'std':'actiontime_std',
                                                                                                                                   'mean':'actiontime_mean'})
#2- 4 与 5 - 9 的比例

#增加每个 id 的倒数第 1 -20 个行为类别
for i in range(20):
    action_internal_table = action_internal_table.merge(action.groupby('userid').actionType.apply(lambda x:x.iloc[-i-1] if len(x)>i else np.nan).reset_index().rename(columns={'actionType':'last_but{}_action_type'.format(i)}).reset_index(),how='left')
del action_internal_table['index']
#每个行为类型所占的比例
count = action.groupby('userid').actionType.count()
for i in range(1,10):
    action_internal_table = action_internal_table.merge((action[action['actionType']==i].groupby('userid').actionType.count()/count).reset_index().rename(columns={'actionType':'rate_{}'.format(i)}).fillna(0),on='userid',how='left')

#倒数第 1 -20 个时间戳
for i in range(20):
    action_internal_table = action_internal_table.merge(action.groupby('userid').actionTime.apply(lambda x:x.iloc[-i-1] if len(x)>i else np.nan).reset_index().rename(columns={'actionTime':'last_but{}_action_type'.format(i)}).reset_index(),how='left')
del action_internal_table['index']


data = data.merge(action_internal_table,on='userid',how='left')
data = data.merge(orderHistory_internal_table,on='userid',how='left')
data = data.fillna(-999)
data = pd.get_dummies(data)

模型评估与改进

X
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score

X_trainval = data[data['userid'].isin(orderFuture_train.userid.tolist())].iloc[:,2:]
y_trainval = data[data['userid'].isin(orderFuture_train.userid.tolist())].iloc[:,0]
X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval,random_state=88,stratify=y_trainval)

xgb_cla = xgb.XGBClassifier(learning_rate=0.1,
        n_estimators=1000,
        max_depth=3,
        min_child_weight=5,
        gamma=0,
        subsample= 0.8,
        colsample_bytree=0.8,
        eta=0.05,
        silent=1,
        objective='binary:logistic',
        scale_pos_weight=1).fit(X_train,y_train)
roc_auc_score(y_val,xgb_cla.predict_proba(X_val)[:,1])
X_test = data[data['userid'].isin(orderFuture_test.userid.tolist())].iloc[:,2:]
predict = xgb_cla.predict_proba(X_test)[:,1]
orderFuture_test['orderType']=predict
orderFuture_test.to_csv('submission.csv',encoding='utf-8',index=False)
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score

X_trainval = data[data['userid'].isin(orderFuture_train.userid.tolist())].iloc[:,2:]
y_trainval = data[data['userid'].isin(orderFuture_train.userid.tolist())].iloc[:,0]
X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval,random_state=88,stratify=y_trainval)

xgb_cla = xgb.XGBClassifier(learning_rate=0.1,
        n_estimators=1000,
        max_depth=3,
        min_child_weight=5,
        gamma=0,
        subsample= 0.8,
        colsample_bytree=0.8,
        eta=0.05,
        silent=1,
        objective='binary:logistic',
        scale_pos_weight=1).fit(X_train,y_train)
roc_auc_score(y_val,xgb_cla.predict_proba(X_val)[:,1])
X_test = data[data['userid'].isin(orderFuture_test.userid.tolist())].iloc[:,2:]
predict = xgb_cla.predict_proba(X_test)[:,1]
orderFuture_test['orderType']=predict
orderFuture_test.to_csv('submission.csv',encoding='utf-8',index=False)
# from sklearn.linear_model import Lasso
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.metrics import roc_auc_score
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import MinMaxScaler
# from sklearn.pipeline import make_pipeline


# X_trainval = orderFuture[orderFuture.userid.isin(orderFuture_train.userid)].iloc[:,2:]
# y_trainval = orderFuture[orderFuture.userid.isin(orderFuture_train.userid)].iloc[:,0]
# X_test = orderFuture[orderFuture.userid.isin(orderFuture_test.userid)].iloc[:,2:]


# X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval,stratify = y_trainval,random_state = 42)
# pipe = make_pipeline(MinMaxScaler(),RandomForestClassifier()).fit(X_train,y_train)


# #RFC = RandomForestClassifier().fit(X_train,y_train)
# print('模型 AUC 为:{}'.format(roc_auc_score(y_val,pipe.predict_proba(X_val)[:,1])))
# result = pd.DataFrame({'userid':orderFuture[orderFuture.userid.isin(orderFuture_test.userid)].iloc[:,1],'orderType':pipe.predict(X_test)}).reset_index(drop=True)
# result.to_csv('submission.csv')
# from sklearn.model_selection import GridSearchCV
# from sklearn.pipeline import make_pipeline
# from sklearn.preprocessing import MinMaxScaler
# from sklearn.svm import SVC

# X_trainval = orderFuture[orderFuture.userid.isin(orderFuture_train.userid)].iloc[:,2:]
# y_trainval = orderFuture[orderFuture.userid.isin(orderFuture_train.userid)].iloc[:,0]
# param_grid = {'svc__C':[0.001,0.01,0.1,1,10,100],
#             'svc__gamma':[0.001,0.01,0.1,1,10,100]}

# pipe = make_pipeline(MinMaxScaler(),SVC())
# grid = GridSearchCV(pipe,param_grid=param_grid,cv=5).fit(X_train,y_train)
# grid.score(X_train,y_train)

正文完
 0