共计 10194 个字符,预计需要花费 26 分钟才能阅读完成。
原文链接:http://tecdat.cn/?p=5521
Data background
A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service.
The data set is Churn . The fields are as follows:
State
discrete.
account length
continuous.
area code
continuous.
phone number
discrete.
international plan
discrete.
voice mail plan
discrete.
number vmail messages
continuous.
total day minutes
continuous.
total day calls
continuous.
total day charge
continuous.
total eve minutes
continuous.
total eve calls
continuous.
total eve charge
continuous.
total night minutes
continuous.
total night calls
continuous.
total night charge
continuous.
total intl minutes
continuous.
total intl calls
continuous.
total intl charge
continuous.
number customer service calls
continuous.
churn
Discrete
Data Preparation and Exploration
- 查看数据概览
- ## state account.length area.code phone.number
- ## WV : 158 Min. : 1.0 Min. :408.0 327-1058: 1
- ## MN : 125 1st Qu.: 73.0 1st Qu.:408.0 327-1319: 1
- ## AL : 124 Median :100.0 Median :415.0 327-2040: 1
- ## ID : 119 Mean :100.3 Mean :436.9 327-2475: 1
- ## VA : 118 3rd Qu.:127.0 3rd Qu.:415.0 327-3053: 1
- ## OH : 116 Max. :243.0 Max. :510.0 327-3587: 1
- ## (Other):4240 (Other) :4994
- ## international.plan voice.mail.plan number.vmail.messages
- ## no :4527 no :3677 Min. : 0.000
- ## yes: 473 yes:1323 1st Qu.: 0.000
- ## Median : 0.000
- ## Mean : 7.755
- ## 3rd Qu.:17.000
- ## Max. :52.000
- ## total.day.minutes total.day.calls total.day.charge total.eve.minutes
- ## Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0
- ## 1st Qu.:143.7 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4
- ## Median :180.1 Median :100 Median :30.62 Median :201.0
- ## Mean :180.3 Mean :100 Mean :30.65 Mean :200.6
- ## 3rd Qu.:216.2 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1
- ## Max. :351.5 Max. :165 Max. :59.76 Max. :363.7
- ## total.eve.calls total.eve.charge total.night.minutes total.night.calls
- ## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
- ## 1st Qu.: 87.0 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00
- ## Median :100.0 Median :17.09 Median :200.4 Median :100.00
- ## Mean :100.2 Mean :17.05 Mean :200.4 Mean : 99.92
- ## 3rd Qu.:114.0 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00
- ## Max. :170.0 Max. :30.91 Max. :395.0 Max. :175.00
- ## total.night.charge total.intl.minutes total.intl.calls total.intl.charge
- ## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. :0.000
- ## 1st Qu.: 7.510 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
- ## Median : 9.020 Median :10.30 Median : 4.000 Median :2.780
- ## Mean : 9.018 Mean :10.26 Mean : 4.435 Mean :2.771
- ## 3rd Qu.:10.560 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
- ## Max. :17.770 Max. :20.00 Max. :20.000 Max. :5.400
- ## number.customer.service.calls churn
- ## Min. :0.00 False.:4293
- ## 1st Qu.:1.00 True. : 707
- ## Median :1.00
- ## Mean :1.57
- ## 3rd Qu.:2.00
- ## Max. :9.00
从数据概览中咱们能够发现没有缺失数据,同时能够发现电话号 地区代码是没有价值的变量,能够删去
Examine the variables graphically
从下面的后果中,咱们能够看到 churn 为 no 的样本数目要远远大于 churn 为 yes 的样本,因而所有样本中 churn 占多数。
从下面的后果中,咱们能够看到除了 emailcode 和 areacode 之外,其余数值变量近似合乎正态分布。
- ## account.length area.code number.vmail.messages total.day.minutes
- ## Min. : 1.0 Min. :408.0 Min. : 0.000 Min. : 0.0
- ## 1st Qu.: 73.0 1st Qu.:408.0 1st Qu.: 0.000 1st Qu.:143.7
- ## Median :100.0 Median :415.0 Median : 0.000 Median :180.1
- ## Mean :100.3 Mean :436.9 Mean : 7.755 Mean :180.3
- ## 3rd Qu.:127.0 3rd Qu.:415.0 3rd Qu.:17.000 3rd Qu.:216.2
- ## Max. :243.0 Max. :510.0 Max. :52.000 Max. :351.5
- ## total.day.calls total.day.charge total.eve.minutes total.eve.calls
- ## Min. : 0 Min. : 0.00 Min. : 0.0 Min. : 0.0
- ## 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4 1st Qu.: 87.0
- ## Median :100 Median :30.62 Median :201.0 Median :100.0
- ## Mean :100 Mean :30.65 Mean :200.6 Mean :100.2
- ## 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 3rd Qu.:114.0
- ## Max. :165 Max. :59.76 Max. :363.7 Max. :170.0
- ## total.eve.charge total.night.minutes total.night.calls total.night.charge
- ## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000
- ## 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510
- ## Median :17.09 Median :200.4 Median :100.00 Median : 9.020
- ## Mean :17.05 Mean :200.4 Mean : 99.92 Mean : 9.018
- ## 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560
- ## Max. :30.91 Max. :395.0 Max. :175.00 Max. :17.770
- ## total.intl.minutes total.intl.calls total.intl.charge
- ## Min. : 0.00 Min. : 0.000 Min. :0.000
- ## 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
- ## Median :10.30 Median : 4.000 Median :2.780
- ## Mean :10.26 Mean : 4.435 Mean :2.771
- ## 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
- ## Max. :20.00 Max. :20.000 Max. :5.400
- ## number.customer.service.calls
- ## Min. :0.00
- ## 1st Qu.:1.00
- ## Median :1.00
- ## Mean :1.57
- ## 3rd Qu.:2.00
- ## Max. :9.00
Relationships between variables
从后果中咱们能够看到两者之间存在显著的正相干线性关系。
Using the statistics node, report
- ## account.length area.code
- ## account.length 1.0000000000 -0.018054187
- ## area.code -0.0180541874 1.000000000
- ## number.vmail.messages -0.0145746663 -0.003398983
- ## total.day.minutes -0.0010174908 -0.019118245
- ## total.day.calls 0.0282402279 -0.019313854
- ## total.day.charge -0.0010191980 -0.019119256
- ## total.eve.minutes -0.0095913331 0.007097877
- ## total.eve.calls 0.0091425790 -0.012299947
- ## total.eve.charge -0.0095873958 0.007114130
- ## total.night.minutes 0.0006679112 0.002083626
- ## total.night.calls -0.0078254785 0.014656846
- ## total.night.charge 0.0006558937 0.002070264
- ## total.intl.minutes 0.0012908394 -0.004153729
- ## total.intl.calls 0.0142772733 -0.013623309
- ## total.intl.charge 0.0012918112 -0.004219099
- ## number.customer.service.calls -0.0014447918 0.020920513
- ## number.vmail.messages total.day.minutes
- ## account.length -0.0145746663 -0.001017491
- ## area.code -0.0033989831 -0.019118245
- ## number.vmail.messages 1.0000000000 0.005381376
- ## total.day.minutes 0.0053813760 1.000000000
- ## total.day.calls 0.0008831280 0.001935149
- ## total.day.charge 0.0053767959 0.999999951
- ## total.eve.minutes 0.0194901208 -0.010750427
- ## total.eve.calls -0.0039543728 0.008128130
- ## total.eve.charge 0.0194959757 -0.010760022
- ## total.night.minutes 0.0055413838 0.011798660
- ## total.night.calls 0.0026762202 0.004236100
- ## total.night.charge 0.0055349281 0.011782533
- ## total.intl.minutes 0.0024627018 -0.019485746
- ## total.intl.calls 0.0001243302 -0.001303123
- ## total.intl.charge 0.0025051773 -0.019414797
- ## number.customer.service.calls -0.0070856427 0.002732576
- ## total.day.calls total.day.charge
- ## account.length 0.0282402279 -0.001019198
- ## area.code -0.0193138545 -0.019119256
- ## number.vmail.messages 0.0008831280 0.005376796
- ## total.day.minutes 0.0019351487 0.999999951
- ## total.day.calls 1.0000000000 0.001935884
- ## total.day.charge 0.0019358844 1.000000000
- ## total.eve.minutes -0.0006994115 -0.010747297
- ## total.eve.calls 0.0037541787 0.008129319
- ## total.eve.charge -0.0006952217 -0.010756893
- ## total.night.minutes 0.0028044650 0.011801434
- ## total.night.calls -0.0083083467 0.004234934
- ## total.night.charge 0.0028018169 0.011785301
- ## total.intl.minutes 0.0130972198 -0.019489700
- ## total.intl.calls 0.0108928533 -0.001306635
- ## total.intl.charge 0.0131613976 -0.019418755
- ## number.customer.service.calls -0.0107394951 0.002726370
- ## total.eve.minutes total.eve.calls
- ## account.length -0.0095913331 0.009142579
- ## area.code 0.0070978766 -0.012299947
- ## number.vmail.messages 0.0194901208 -0.003954373
- ## total.day.minutes -0.0107504274 0.008128130
- ## total.day.calls -0.0006994115 0.003754179
- ## total.day.charge -0.0107472968 0.008129319
- ## total.eve.minutes 1.0000000000 0.002763019
- ## total.eve.calls 0.0027630194 1.000000000
- ## total.eve.charge 0.9999997749 0.002778097
- ## total.night.minutes -0.0166391160 0.001781411
- ## total.night.calls 0.0134202163 -0.013682341
- ## total.night.charge -0.0166420421 0.001799380
- ## total.intl.minutes 0.0001365487 -0.007458458
- ## total.intl.calls 0.0083881559 0.005574500
- ## total.intl.charge 0.0001593155 -0.007507151
- ## number.customer.service.calls -0.0138234228 0.006234831
- ## total.eve.charge total.night.minutes
- ## account.length -0.0095873958 0.0006679112
- ## area.code 0.0071141298 0.0020836263
- ## number.vmail.messages 0.0194959757 0.0055413838
- ## total.day.minutes -0.0107600217 0.0117986600
- ## total.day.calls -0.0006952217 0.0028044650
- ## total.day.charge -0.0107568931 0.0118014339
- ## total.eve.minutes 0.9999997749 -0.0166391160
- ## total.eve.calls 0.0027780971 0.0017814106
- ## total.eve.charge 1.0000000000 -0.0166489191
- ## total.night.minutes -0.0166489191 1.0000000000
- ## total.night.calls 0.0134220174 0.0269718182
- ## total.night.charge -0.0166518367 0.9999992072
- ## total.intl.minutes 0.0001320238 -0.0067209669
- ## total.intl.calls 0.0083930603 -0.0172140162
- ## total.intl.charge 0.0001547783 -0.0066545873
- ## number.customer.service.calls -0.0138363623 -0.0085325365
如果把高相关性的变量保留下来,可能会造成多重共线性问题,因而须要把高相干关系的变量删去。
Data Manipulation
从后果中能够看到,total.day.calls 和 total.day.charge 之间存在肯定的相干关系。
特地是 voicemial 为 no 的变量之间存在负相关关系。
Discretize (make categorical) a relevant numeric variable
对变量进行离散化
construct a distribution of the variable with a churn overlay
construct a histogram of the variable with a churn overlay
Find a pair of numeric variables which are interesting with respect to churn.
从后果中能够看到,total.day.calls 和 total.day.charge 之间存在肯定的相干关系。
Model Building
特地是 churn 为 no 的变量之间存在相干关系。
- ## Estimate Std. Error t value Pr(>|t|)
- ## (Intercept) 0.3082150 0.0735760 4.189 2.85e-05 ***
- ## stateAL 0.0151188 0.0462343 0.327 0.743680
- ## stateAR 0.0894792 0.0490897 1.823 0.068399 .
- ## stateAZ 0.0329566 0.0494195 0.667 0.504883
- ## stateCA 0.1951511 0.0567439 3.439 0.000588 ***
- ## international.plan yes 0.3059341 0.0151677 20.170 < 2e-16 ***
- ## voice.mail.plan yes -0.1375056 0.0337533 -4.074 4.70e-05 ***
- ## number.vmail.messages 0.0017068 0.0010988 1.553 0.120402
- ## total.day.minutes 0.3796323 0.2629027 1.444 0.148802
- ## total.day.calls 0.0002191 0.0002235 0.981 0.326781
- ## total.day.charge -2.2207671 1.5464583 -1.436 0.151056
- ## total.eve.minutes 0.0288233 0.1307496 0.220 0.825533
- ## total.eve.calls -0.0001585 0.0002238 -0.708 0.478915
- ## total.eve.charge -0.3316041 1.5382391 -0.216 0.829329
- ## total.night.minutes 0.0083224 0.0695916 0.120 0.904814
- ## total.night.calls -0.0001824 0.0002225 -0.820 0.412290
- ## total.night.charge -0.1760782 1.5464674 -0.114 0.909355
- ## total.intl.minutes -0.0104679 0.4192270 -0.025 0.980080
- ## total.intl.calls -0.0063448 0.0018062 -3.513 0.000447 ***
- ## total.intl.charge 0.0676460 1.5528267 0.044 0.965254
- ## number.customer.service.calls 0.0566474 0.0033945 16.688 < 2e-16 ***
- ## total.day.minutes1medium 0.0502681 0.0160228 3.137 0.001715 **
- ## total.day.minutes1short 0.2404020 0.0322293 7.459 1.02e-13 ***
从后果中看,咱们能够发现 state total.intl.calls、number.customer.service.calls、total.day.minutes1medium、total.day.minutes1short 的变量有重要的影响。
Use K-Nearest-Neighbors (K-NN) algorithm to develop a model for predicting Churn
- ## Direction.2005
- ## knn.pred 1 2
- ## 1 760 97
- ## 2 100 43
- [1] 0.803
混同矩阵(英语:confusion matrix)是可视化工具,特地用于监督学习,在无监督学习个别叫做匹配矩阵。矩阵的每一列代表一个类的实例预测,而每一行示意一个理论的类的实例。
- ## Direction.2005
- ## knn.pred 1 2
- ## 1 827 104
- ## 2 33 36
- [1] 0.863
从测试集的后果,咱们能够看到准确度达到 86%。
Findings
咱们能够发现,total.day.calls 和 total.day.charge 之间存在肯定的相干关系。特地是 churn 为 no 的变量之间存在相干关系。同时咱们能够发现 state total.intl.calls、number.customer.service.calls、total.day.minutes1medium、total.day.minutes1short 的变量有重要的影响。同时咱们能够发现,total.day.calls 和 total.day.charge 之间存在肯定的相干关系。最初从 knn 模型后果中,咱们能够发现从训练集的后果中,咱们能够看到准确度有 80%,从测试集的后果,咱们能够看到准确度达到 86%。阐明模型有很好的预测成果。
相干文章:
Python 中用 PyTorch 机器学习分类预测银行_客户散失_模型
决策树算法建设电信_客户散失_模型
【大数据部落】(数据挖掘) 如何用大数据做用户异样行为