原文链接:http://tecdat.cn/?p=5521

Data background

A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service. 

The data set  is Churn . The fields are as follows:

State

 discrete.

account length

 continuous.

area code

 continuous.

phone number

 discrete.

international plan

 discrete.

voice mail plan

 discrete.

number vmail messages

 continuous.

total day minutes

 continuous.

total day calls

 continuous.

total day charge

 continuous.

total eve minutes

 continuous.

total eve calls

 continuous.

total eve charge

 continuous.

total night minutes

 continuous.

total night calls

 continuous.

total night charge

 continuous.

total intl minutes

 continuous.

total intl calls

 continuous.

total intl charge

 continuous.

number customer service calls

 continuous.

churn

 Discrete

Data Preparation and Exploration 

查看数据概览##      state      account.length    area.code        phone.number ##  WV     : 158   Min.   :  1.0   Min.   :408.0    327-1058:   1  ##  MN     : 125   1st Qu.: 73.0   1st Qu.:408.0    327-1319:   1  ##  AL     : 124   Median :100.0   Median :415.0    327-2040:   1  ##  ID     : 119   Mean   :100.3   Mean   :436.9    327-2475:   1  ##  VA     : 118   3rd Qu.:127.0   3rd Qu.:415.0    327-3053:   1  ##  OH     : 116   Max.   :243.0   Max.   :510.0    327-3587:   1  ##  (Other):4240                                   (Other)  :4994  ##  international.plan voice.mail.plan number.vmail.messages##   no :4527           no :3677       Min.   : 0.000       ##   yes: 473           yes:1323       1st Qu.: 0.000       ##                                     Median : 0.000       ##                                     Mean   : 7.755       ##                                     3rd Qu.:17.000       ##                                     Max.   :52.000       ##                                                          ##  total.day.minutes total.day.calls total.day.charge total.eve.minutes##  Min.   :  0.0     Min.   :  0     Min.   : 0.00    Min.   :  0.0    ##  1st Qu.:143.7     1st Qu.: 87     1st Qu.:24.43    1st Qu.:166.4    ##  Median :180.1     Median :100     Median :30.62    Median :201.0    ##  Mean   :180.3     Mean   :100     Mean   :30.65    Mean   :200.6    ##  3rd Qu.:216.2     3rd Qu.:113     3rd Qu.:36.75    3rd Qu.:234.1    ##  Max.   :351.5     Max.   :165     Max.   :59.76    Max.   :363.7    ##                                                                      ##  total.eve.calls total.eve.charge total.night.minutes total.night.calls##  Min.   :  0.0   Min.   : 0.00    Min.   :  0.0       Min.   :  0.00   ##  1st Qu.: 87.0   1st Qu.:14.14    1st Qu.:166.9       1st Qu.: 87.00   ##  Median :100.0   Median :17.09    Median :200.4       Median :100.00   ##  Mean   :100.2   Mean   :17.05    Mean   :200.4       Mean   : 99.92   ##  3rd Qu.:114.0   3rd Qu.:19.90    3rd Qu.:234.7       3rd Qu.:113.00   ##  Max.   :170.0   Max.   :30.91    Max.   :395.0       Max.   :175.00   ##                                                                        ##  total.night.charge total.intl.minutes total.intl.calls total.intl.charge##  Min.   : 0.000     Min.   : 0.00      Min.   : 0.000   Min.   :0.000    ##  1st Qu.: 7.510     1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    ##  Median : 9.020     Median :10.30      Median : 4.000   Median :2.780    ##  Mean   : 9.018     Mean   :10.26      Mean   : 4.435   Mean   :2.771    ##  3rd Qu.:10.560     3rd Qu.:12.00      3rd Qu.: 6.000   3rd Qu.:3.240    ##  Max.   :17.770     Max.   :20.00      Max.   :20.000   Max.   :5.400    ##                                                                          ##  number.customer.service.calls     churn     ##  Min.   :0.00                   False.:4293  ##  1st Qu.:1.00                   True. : 707  ##  Median :1.00                                ##  Mean   :1.57                                ##  3rd Qu.:2.00                                ##  Max.   :9.00                                ## 

 从数据概览中咱们能够发现没有缺失数据,同时能够发现电话号 地区代码是没有价值的变量,能够删去

Examine the variables graphically

   

从下面的后果中,咱们能够看到churn为no的样本数目要远远大于churn为yes的样本,因而所有样本中churn占多数。

从下面的后果中,咱们能够看到除了emailcode和areacode之外,其余数值变量近似合乎正态分布。

##  account.length    area.code     number.vmail.messages total.day.minutes##  Min.   :  1.0   Min.   :408.0   Min.   : 0.000        Min.   :  0.0    ##  1st Qu.: 73.0   1st Qu.:408.0   1st Qu.: 0.000        1st Qu.:143.7    ##  Median :100.0   Median :415.0   Median : 0.000        Median :180.1    ##  Mean   :100.3   Mean   :436.9   Mean   : 7.755        Mean   :180.3    ##  3rd Qu.:127.0   3rd Qu.:415.0   3rd Qu.:17.000        3rd Qu.:216.2    ##  Max.   :243.0   Max.   :510.0   Max.   :52.000        Max.   :351.5    ##  total.day.calls total.day.charge total.eve.minutes total.eve.calls##  Min.   :  0     Min.   : 0.00    Min.   :  0.0     Min.   :  0.0  ##  1st Qu.: 87     1st Qu.:24.43    1st Qu.:166.4     1st Qu.: 87.0  ##  Median :100     Median :30.62    Median :201.0     Median :100.0  ##  Mean   :100     Mean   :30.65    Mean   :200.6     Mean   :100.2  ##  3rd Qu.:113     3rd Qu.:36.75    3rd Qu.:234.1     3rd Qu.:114.0  ##  Max.   :165     Max.   :59.76    Max.   :363.7     Max.   :170.0  ##  total.eve.charge total.night.minutes total.night.calls total.night.charge##  Min.   : 0.00    Min.   :  0.0       Min.   :  0.00    Min.   : 0.000    ##  1st Qu.:14.14    1st Qu.:166.9       1st Qu.: 87.00    1st Qu.: 7.510    ##  Median :17.09    Median :200.4       Median :100.00    Median : 9.020    ##  Mean   :17.05    Mean   :200.4       Mean   : 99.92    Mean   : 9.018    ##  3rd Qu.:19.90    3rd Qu.:234.7       3rd Qu.:113.00    3rd Qu.:10.560    ##  Max.   :30.91    Max.   :395.0       Max.   :175.00    Max.   :17.770    ##  total.intl.minutes total.intl.calls total.intl.charge##  Min.   : 0.00      Min.   : 0.000   Min.   :0.000    ##  1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    ##  Median :10.30      Median : 4.000   Median :2.780    ##  Mean   :10.26      Mean   : 4.435   Mean   :2.771    ##  3rd Qu.:12.00      3rd Qu.: 6.000   3rd Qu.:3.240    ##  Max.   :20.00      Max.   :20.000   Max.   :5.400    ##  number.customer.service.calls##  Min.   :0.00                 ##  1st Qu.:1.00                 ##  Median :1.00                 ##  Mean   :1.57                 ##  3rd Qu.:2.00                 ##  Max.   :9.00

Relationships between variables

从后果中咱们能够看到两者之间存在显著的正相干线性关系。


 

Using the statistics node, report

##                               account.length    area.code## account.length                  1.0000000000 -0.018054187## area.code                      -0.0180541874  1.000000000## number.vmail.messages          -0.0145746663 -0.003398983## total.day.minutes              -0.0010174908 -0.019118245## total.day.calls                 0.0282402279 -0.019313854## total.day.charge               -0.0010191980 -0.019119256## total.eve.minutes              -0.0095913331  0.007097877## total.eve.calls                 0.0091425790 -0.012299947## total.eve.charge               -0.0095873958  0.007114130## total.night.minutes             0.0006679112  0.002083626## total.night.calls              -0.0078254785  0.014656846## total.night.charge              0.0006558937  0.002070264## total.intl.minutes              0.0012908394 -0.004153729## total.intl.calls                0.0142772733 -0.013623309## total.intl.charge               0.0012918112 -0.004219099## number.customer.service.calls  -0.0014447918  0.020920513##                               number.vmail.messages total.day.minutes## account.length                        -0.0145746663      -0.001017491## area.code                             -0.0033989831      -0.019118245## number.vmail.messages                  1.0000000000       0.005381376## total.day.minutes                      0.0053813760       1.000000000## total.day.calls                        0.0008831280       0.001935149## total.day.charge                       0.0053767959       0.999999951## total.eve.minutes                      0.0194901208      -0.010750427## total.eve.calls                       -0.0039543728       0.008128130## total.eve.charge                       0.0194959757      -0.010760022## total.night.minutes                    0.0055413838       0.011798660## total.night.calls                      0.0026762202       0.004236100## total.night.charge                     0.0055349281       0.011782533## total.intl.minutes                     0.0024627018      -0.019485746## total.intl.calls                       0.0001243302      -0.001303123## total.intl.charge                      0.0025051773      -0.019414797## number.customer.service.calls         -0.0070856427       0.002732576##                               total.day.calls total.day.charge## account.length                   0.0282402279     -0.001019198## area.code                       -0.0193138545     -0.019119256## number.vmail.messages            0.0008831280      0.005376796## total.day.minutes                0.0019351487      0.999999951## total.day.calls                  1.0000000000      0.001935884## total.day.charge                 0.0019358844      1.000000000## total.eve.minutes               -0.0006994115     -0.010747297## total.eve.calls                  0.0037541787      0.008129319## total.eve.charge                -0.0006952217     -0.010756893## total.night.minutes              0.0028044650      0.011801434## total.night.calls               -0.0083083467      0.004234934## total.night.charge               0.0028018169      0.011785301## total.intl.minutes               0.0130972198     -0.019489700## total.intl.calls                 0.0108928533     -0.001306635## total.intl.charge                0.0131613976     -0.019418755## number.customer.service.calls   -0.0107394951      0.002726370##                               total.eve.minutes total.eve.calls## account.length                    -0.0095913331     0.009142579## area.code                          0.0070978766    -0.012299947## number.vmail.messages              0.0194901208    -0.003954373## total.day.minutes                 -0.0107504274     0.008128130## total.day.calls                   -0.0006994115     0.003754179## total.day.charge                  -0.0107472968     0.008129319## total.eve.minutes                  1.0000000000     0.002763019## total.eve.calls                    0.0027630194     1.000000000## total.eve.charge                   0.9999997749     0.002778097## total.night.minutes               -0.0166391160     0.001781411## total.night.calls                  0.0134202163    -0.013682341## total.night.charge                -0.0166420421     0.001799380## total.intl.minutes                 0.0001365487    -0.007458458## total.intl.calls                   0.0083881559     0.005574500## total.intl.charge                  0.0001593155    -0.007507151## number.customer.service.calls     -0.0138234228     0.006234831##                               total.eve.charge total.night.minutes## account.length                   -0.0095873958        0.0006679112## area.code                         0.0071141298        0.0020836263## number.vmail.messages             0.0194959757        0.0055413838## total.day.minutes                -0.0107600217        0.0117986600## total.day.calls                  -0.0006952217        0.0028044650## total.day.charge                 -0.0107568931        0.0118014339## total.eve.minutes                 0.9999997749       -0.0166391160## total.eve.calls                   0.0027780971        0.0017814106## total.eve.charge                  1.0000000000       -0.0166489191## total.night.minutes              -0.0166489191        1.0000000000## total.night.calls                 0.0134220174        0.0269718182## total.night.charge               -0.0166518367        0.9999992072## total.intl.minutes                0.0001320238       -0.0067209669## total.intl.calls                  0.0083930603       -0.0172140162## total.intl.charge                 0.0001547783       -0.0066545873## number.customer.service.calls    -0.0138363623       -0.0085325365
如果把高相关性的变量保留下来,可能会造成多重共线性问题,因而须要把高相干关系的变量删去。

Data Manipulation

从后果中能够看到,total.day.calls和total.day.charge之间存在肯定的相干关系。
特地是voicemial为no的变量之间存在负相关关系。

Discretize (make categorical) a relevant numeric variable

对变量进行离散化

construct a distribution of the variable with a churn overlay

construct a histogram of the variable with a churn overlay

Find a pair of numeric variables which are interesting with respect to churn.

从后果中能够看到,total.day.calls和total.day.charge之间存在肯定的相干关系。

Model Building

特地是churn为no的变量之间存在相干关系。
 

##                                 Estimate Std. Error t value Pr(>|t|)    ## (Intercept)                    0.3082150  0.0735760   4.189 2.85e-05 ***## stateAL                        0.0151188  0.0462343   0.327 0.743680    ## stateAR                        0.0894792  0.0490897   1.823 0.068399 .  ## stateAZ                        0.0329566  0.0494195   0.667 0.504883    ## stateCA                        0.1951511  0.0567439   3.439 0.000588 ***## international.plan yes         0.3059341  0.0151677  20.170  < 2e-16 ***## voice.mail.plan yes           -0.1375056  0.0337533  -4.074 4.70e-05 ***## number.vmail.messages          0.0017068  0.0010988   1.553 0.120402    ## total.day.minutes              0.3796323  0.2629027   1.444 0.148802    ## total.day.calls                0.0002191  0.0002235   0.981 0.326781    ## total.day.charge              -2.2207671  1.5464583  -1.436 0.151056    ## total.eve.minutes              0.0288233  0.1307496   0.220 0.825533    ## total.eve.calls               -0.0001585  0.0002238  -0.708 0.478915    ## total.eve.charge              -0.3316041  1.5382391  -0.216 0.829329    ## total.night.minutes            0.0083224  0.0695916   0.120 0.904814    ## total.night.calls             -0.0001824  0.0002225  -0.820 0.412290    ## total.night.charge            -0.1760782  1.5464674  -0.114 0.909355    ## total.intl.minutes            -0.0104679  0.4192270  -0.025 0.980080    ## total.intl.calls              -0.0063448  0.0018062  -3.513 0.000447 ***## total.intl.charge              0.0676460  1.5528267   0.044 0.965254    ## number.customer.service.calls  0.0566474  0.0033945  16.688  < 2e-16 ***## total.day.minutes1medium       0.0502681  0.0160228   3.137 0.001715 ** ## total.day.minutes1short        0.2404020  0.0322293   7.459 1.02e-13 ***
从后果中看,咱们能够发现 state  total.intl.calls   、number.customer.service.calls 、 total.day.minutes1medium 、    total.day.minutes1short    的变量有重要的影响。

Use K-Nearest-Neighbors (K-NN) algorithm to develop a model for predicting Churn

##         Direction.2005## knn.pred   1   2##        1 760  97##        2 100  43 [1] 0.803
混同矩阵(英语:confusion matrix)是可视化工具,特地用于监督学习,在无监督学习个别叫做匹配矩阵。 矩阵的每一列代表一个类的实例预测,而每一行示意一个理论的类的实例。
##         Direction.2005## knn.pred   1   2##        1 827 104##        2  33  36 [1] 0.863
从测试集的后果,咱们能够看到准确度达到86%。

Findings

咱们能够发现 ,total.day.calls和total.day.charge之间存在肯定的相干关系。特地是churn为no的变量之间存在相干关系。同时咱们能够发现 state  total.intl.calls   、number.customer.service.calls 、 total.day.minutes1medium、    total.day.minutes1short    的变量有重要的影响。同时咱们能够发现,total.day.calls和total.day.charge之间存在肯定的相干关系。最初从knn模型后果中,咱们能够发现从训练集的后果中,咱们能够看到准确度有80%,从测试集的后果,咱们能够看到准确度达到86%。阐明模型有很好的预测成果。
 

相干文章:

 Python中用PyTorch机器学习分类预测银行_客户散失_模型

决策树算法建设电信_客户散失_模型

【大数据部落】(数据挖掘)如何用大数据做用户异样行为