关于数据挖掘:R语言逻辑回归GAMLDAKNNPCA主成分分类分析预测房价及交叉验证

<article class=“article fmt article-content”><h2>全文链接：https://tecdat.cn/?p=35263</h2><h2>原文出处：拓端数据部落公众号</h2><p>本钻研旨在帮忙客户利用房价数据集进行数据分析，该数据集蕴含82个变量和2930个数据点。钻研指标是通过分类算法将房价分为两个类别。在数据预处理阶段，排除了Order、PID和SalesPrice等变量，对数据进行整合和转换以适应非线性关系。随后使用逻辑回归、GAM、LDA和KNN等算法进行建模和评估。</p><p>此外，通过PCA剖析和不同分类模型的建模及穿插验证，评估模型的性能并抉择最佳模型进行进一步剖析和预测。综合钻研后果，逻辑回归和LDA模型体现较好，GAM模型在穿插验证中体现最佳，而KNN模型体现较差。钻研后果为数据分析和模型抉择提供了领导，有助于优化预测准确率和泛化能力。</p><p>本钻研旨在应用Ames Housing数据进行数据分析，该数据集蕴含82个变量和2930个数据点。</p><p></p><p>剖析指标：</p><p>使用分类算法将Sales分成2个class，一个class是大于USD 200，000，另一类小于USD 20,000。</p><p>剖析要求：</p><p>1. 在变量中，去除以下变量：Order, PID, 以及SalesPrice</p><p>2. 用以下代码来定义本次剖析的训练数据，余下的数据做验证数据</p><p>3. 整合相干的变量，比如说把square feet加起来</p><p>4. 对数据进行变换（transformation），如果存在非线性关系</p><p>5. 进行least logistic regression(逻辑回归), GAM, LDA, 和KNN</p><p>在变量中要去除Order, PID, 当然SalesPrice也要去掉。</p><pre><code>AmesHousing=AmesHousing[,-c(1,2 )]</code></pre><p>一个class是大于USD 200，000，另一类小于USD 20,000</p><pre><code>AmesHousing$SalePrice <- ifelse(AmesHousing$SalePrice>200000,1,0)</code></pre><p>查看线性关系，如果不现实，则思考进行转换。</p><pre><code>head(AmesHousing2)</code></pre><p> </p><h2>合并关键词</h2><p>一些变量可能须要整合，如蕴含关键词“Flr”、“Porch”、“Bath”、“Overall”、“Sold”、“SF”、“Year”、“AbvGr”、“Garage”和“Area”。</p><pre><code>AmesHousing2$Flr=apply(AmesHousing2[,grep(“Flr” ,colnames(AmesHousing2))],1,sum)AmesHousing2=AmesHousing2[,-grep(“Flr” ,colnames(AmesHousing2))[-length(grep(“Flr” ,colnames(AmesHousing2)))]]</code></pre><p></p><pre><code>plot(AmesHousi2)</code></pre><p></p><h2>跑logistic regression, GAM, LDA, KNN这几个模型</h2><p>在数据筹备实现后，能够通过运行不同的模型来进行剖析。以下是对logistic regression、GAM、LDA和KNN模型的准确率评估：</p><h3>1. 逻辑回归（Logistic Regression）模型：</h3><p>对数据进行逻辑回归建模，代码如下：</p><pre><code>model.glm <- glm(as.factor(SalePrice) ~ ., data = AmesHousing, family = “binomial”)</code></pre><p>通过逻辑回归模型的训练和验证，失去的准确率为0.932166301969365，表明模型在对销售额进行分类预测时较为精确。</p><h3>2. 狭义加性模型（Generalized Additive Model，GAM）：</h3><p>进行GAM建模，计算准确率如下：</p><pre><code>misClasificError <- mean(fitted.results != Ames.test$SalePrice, na.rm = TRUE) print(paste(‘Accuracy’, 1 - misClasificError))</code></pre><p>GAM模型的准确率为0.911062906724512，显示其在销售额分类预测方面的体现。</p><h3>3. K最近邻（K-Nearest Neighbors，KNN）模型：</h3><p>引入kknn库进行KNN模型的建模和评估：</p><pre><code>library(kknn)print(paste(‘Accuracy’, 1 - misClasificError))</code></pre><p>KNN模型的准确率为0.585284280936455，绝对较低，可能须要进一步调整模型参数或数据处理形式以进步准确性。</p><h3>4. 线性判别分析（Linear Discriminant Analysis，LDA）模型：</h3><p>对LDA模型的准确率进行评估：</p><pre><code>misClasificError <- mean(fitted.results != Ames.test$SalePrice, na.rm = TRUE) print(paste(‘Accuracy’, 1 - misClasificError))</code></pre><p>LDA模型的准确率为0.923413566739606，显示其在销售额分类预测方面体现较好。</p><p>通过以上模型的评估后果，能够得悉不同算法在对销售额进行分类预测时的体现。逻辑回归和LDA模型体现较为优异，而KNN模型的准确率绝对较低，可能须要进一步优化。综合思考不同模型的准确率后果，能够抉择最适宜数据集和剖析目标的模型进行进一步钻研和利用。</p><ul><li>逻辑回归模型的准确率为0.932166301969365。</li><li>狭义加性模型（GAM）的准确率为0.911062906724512。</li><li>K最近邻（KNN）模型的准确率为0.585284280936455。</li><li>线性判别分析（LDA）模型的准确率为0.923413566739606。</li></ul><p>通过以上剖析，能够得出不同模型在预测销售额类别上的准确率，进一步理解销售额与其余变量之间的关系，为将来的预测和决策提供参考。</p><h2>穿插验证（规范看最小的test error 验证误差）</h2><p>穿插验证是一种罕用的机器学习办法，用于评估模型的性能并抉择最佳的超参数。在本文中，咱们将应用穿插验证来评估Logistic回归、LDA、KNN和GAM四种分类模型的性能。</p><h3>logistic regression</h3><p>首先，咱们应用Logistic回归模型进行穿插验证。咱们将数据集分成10个不同的子集，每次应用其中9个子集进行训练，而后在残余的一个子集上进行测试。反复这个过程10次，计算每次测试的准确率，并将所有准确率的平均值作为最终的准确率。通过计算，Logistic回归模型的均匀准确率为0.9410194。</p><pre><code class=“js”> precisek=0 k=10 for(kk in 1:k){ …. precisek=precisek+1-misClasificError }</code></pre><p> </p><pre><code class=“js”> 1 th accuracy of logistic regression is 0.9491525 2 th accuracy of logistic regression is 0.9321267 3 th accuracy of logistic regression is 0.9434783 4 th accuracy of logistic regression is 0.9244444 5 th accuracy of logistic regression is 0.9480519 6 th accuracy of logistic regression is 0.9480519 7 th accuracy of logistic regression is 0.9356223 8 th accuracy of logistic regression is 0.9516129 9 th accuracy of logistic regression is 0.94067810 th accuracy of logistic regression is 0.9369748</code></pre><pre><code>precisek/kcaculate precision</code></pre><pre><code> [1] 0.9410194</code></pre><p> </p><h2>LDA</h2><p>接下来，咱们应用LDA模型进行穿插验证。同样地，咱们将数据集分成10个子集，每次训练时应用9个子集，而后在残余的一个子集上进行测试。反复这个过程10次，计算每次测试的准确率，并计算均匀准确率。通过计算，LDA模型的均匀准确率为0.937719。</p><pre><code>precisek=0 k=10 for(kk in 1:k){ … cat(kk," th accuracy of LDA is “,1-misClasificError,"\n”) precisek=precisek+1-misClasificError }</code></pre><p> </p><pre><code> 1 th accuracy of LDA is 0.9537815 2 th accuracy of LDA is 0.9324324 3 th accuracy of LDA is 0.9497908 4 th accuracy of LDA is 0.9141631 5 th accuracy of LDA is 0.9304348 6 th accuracy of LDA is 0.9240506 7 th accuracy of LDA is 0.9396552 8 th accuracy of LDA is 0.9471366 9 th accuracy of LDA is 0.9672897 10 th accuracy of LDA is 0.9184549</code></pre><p> </p><pre><code>precisek/kcaculate precision [1] 0.937719</code></pre><p> </p><p> </p><h2>knn</h2><p> <br/> 而后，咱们应用KNN模型进行穿插验证。同样地，咱们将数据集分成10个子集，每次训练时应用9个子集，而后在残余的一个子集上进行测试。反复这个过程10次，计算每次测试的准确率，并计算均匀准确率。通过计算，KNN模型的均匀准确率为0.5928328。</p><pre><code>precisek=0 k=10 for(kk in 1:k){ cat(kk," th accuracy of KNN is “,1-misClasificError,"\n”) precisek=precisek+1-misClasificError } </code></pre><pre><code> 1 th accuracy of KNN is 0.6382253 2 th accuracy of KNN is 0.5870307 3 th accuracy of KNN is 0.5631399 4 th accuracy of KNN is 0.556314 5 th accuracy of KNN is 0.6143345 6 th accuracy of KNN is 0.6075085 7 th accuracy of KNN is 0.5733788 8 th accuracy of KNN is 0.556314 9 th accuracy of KNN is 0.6143345 10 th accuracy of KNN is 0.6177474</code></pre><pre><code>precisek/kcaculate precision [1] 0.5928328</code></pre><p> </p><h2>GAM</h2><p>最初，咱们应用GAM模型进行穿插验证。同样地，咱们将数据集分成10个子集，每次训练时应用9个子集，而后在残余的一个子集上进行测试。反复这个过程10次，计算每次测试的准确率，并计算均匀准确率。通过计算，GAM模型的均匀准确率为0.9217754。</p><pre><code>precisek=0 k=10 for(kk in 1:k){ index=sample(1:dim(AmesHousing2)[1],floor(dim(AmesHousing2)[1]*(1/k)), cat(kk," th accuracy of GAM is “,1-misClasificError,"\n”) precisek=precisek+1-misClasificError }</code></pre><pre><code> 1 th accuracy of GAM is 0.9429825 2 th accuracy of GAM is 0.8974359 3 th accuracy of GAM is 0.9116279 4 th accuracy of GAM is 0.9230769 5 th accuracy of GAM is 0.9173913 6 th accuracy of GAM is 0.8826087 7 th accuracy of GAM is 0.9531915 8 th accuracy of GAM is 0.9282511 9 th accuracy of GAM is 0.9469027 10 th accuracy of GAM is 0.9142857</code></pre><p> </p><pre><code>precisek/kcaculate precision [1] 0.9217754 </code></pre><p> 综合来看，咱们能够看到LDA和GAM模型在这个数据集上体现较好，而Logistic回归和KNN模型的体现绝对较差。因而，在抉择模型时，咱们应该参考穿插验证的后果，抉择体现最好的模型来进行进一步的剖析和预测。</p><p> </p><h2>PCA</h2><p>主成分剖析（PCA）是一种罕用的降维技术，能够帮忙咱们发现数据中的模式并缩小特色的数量。在本文中，咱们首先对PCA进行了剖析，通过主成分的方差和累积方差来评估主成分的重要性。依据PCA的后果，咱们能够看到前几个主成分的方差和累积方差，以及它们对数据的奉献水平。</p><pre><code>summary(pr.out)</code></pre><pre><code> Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 Standard deviation 2.1963 1.3088 1.14690 1.01946 0.99468 0.98157 Proportion of Variance 0.3216 0.1142 0.08769 0.06929 0.06596 0.06423 Cumulative Proportion 0.3216 0.4358 0.52348 0.59277 0.65873 0.72296 PC7 PC8 PC9 PC10 PC11 PC12 Standard deviation 0.88807 0.83298 0.7520 0.70579 0.6686 0.63593 Proportion of Variance 0.05258 0.04626 0.0377 0.03321 0.0298 0.02696 Cumulative Proportion 0.77554 0.82179 0.8595 0.89270 0.9225 0.94946 PC13 PC14 PC15 Standard deviation 0.58684 0.5254 0.37095 Proportion of Variance 0.02296 0.0184 0.00917 Cumulative Proportion 0.97242 0.9908 1.00000</code></pre><p></p><p></p><pre><code>pve=pr.var/sum(pr.var)</code></pre><p></p><p></p><p>接下来，咱们对PCA降维后的数据应用Logistic回归、LDA、KNN和GAM四种分类模型进行建模，并评估它们的准确率。</p><h2>logistic regression</h2><pre><code> misClasificError <- mean(fitted.results != Ames.test$SalePrice,na.rm=T) print(paste(‘Accuracy’,1-misClasificError))</code></pre><pre><code> [1] “Accuracy 0.984210526315789”</code></pre><h2>gam建模</h2><pre><code>library(“mgcv”) model.gam=gam() print(paste(‘Accuracy’,1-misClasificError))</code></pre><p> </p><pre><code> [1] “Accuracy 0.975438596491228”</code></pre><p> </p><h2>knn</h2><p> </p><pre><code>library(kknn) model.kknn <- train.kknn( print(paste(‘Accuracy’,1-misClasificError))</code></pre><pre><code> [1] “Accuracy 0.554385964912281”</code></pre><p> </p><p> </p><h2>LDA</h2><p> </p><pre><code>misClasificError <- mean(fitted.results != Ames.test$SalePrice,na.rm=T) print(paste(‘Accuracy’,1-misClasificError))</code></pre><pre><code> [1] “Accuracy 0.978947368421053” </code></pre><p>在Logistic回归模型中，咱们计算了模型的准确率为0.984210526315789；在GAM模型中，准确率为0.975438596491228；在KNN模型中，准确率为0.554385964912281；在LDA模型中，准确率为0.978947368421053。通过比拟这些准确率，咱们能够看到Logistic回归和LDA模型体现较好，而KNN模型体现较差。<br/> </p><p> </p><h2>穿插验证（规范看最小的test error 验证误差）</h2><p>接着，咱们进行了穿插验证，通过计算十次验证的准确率并求平均值来评估模型的性能。</p><h2>logistic regression</h2><pre><code>precisek=0 k=10 for(kk in 1:k){ cat(kk," th accuracy of logistic regression is “,1-misClasificError,"\n”) precisek=precisek+1-misClasificError }precisek/kcaculate precision</code></pre><pre><code> [1] 0.9779736</code></pre><p> </p><p> </p><h2>LDA</h2><pre><code>precisek=0 k=10 for(kk in 1:k){ cat(kk," th accuracy of LDA is “,1-misClasificError,"\n”) precisek=precisek+1-misClasificError }</code></pre><pre><code>precisek/kcaculate precision</code></pre><pre><code> [1] 0.9792952</code></pre><p> </p><p> </p><h2>knn</h2><pre><code>precisek=0 k=10 for(kk in 1:k){ cat(kk," th accuracy of KNN is “,1-misClasificError,"\n”) precisek=precisek+1-misClasificError }</code></pre><p> </p><pre><code>precisek/kcaculate precision</code></pre><pre><code> [1] 0.9656388</code></pre><p> </p><p> </p><h2>GAM</h2><pre><code>precisek=0 k=10 for(kk in 1:k){ cat(kk," th accuracy of GAM is “,1-misClasificError,"\n”) precisek=precisek+1-misClasificError }</code></pre><p> </p><pre><code>precisek/kcaculate precision</code></pre><pre><code> [1] 0.9814978</code></pre><p>在Logistic回归模型中，十次验证的均匀准确率为0.9779736；在LDA模型中，均匀准确率为0.9792952；在KNN模型中，均匀准确率为0.9656388；在GAM模型中，均匀准确率为0.9814978。通过穿插验证的后果，咱们能够看到GAM模型在这个数据集上体现最好，而KNN模型体现绝对较差。</p><p>综上所述，通过PCA的剖析和不同分类模型的建模及穿插验证，咱们能够评估模型的性能并抉择最佳的模型来进行进一步的剖析和预测。在理论利用中，咱们应该依据理论状况和需要抉择适合的模型，并一直优化和调整模型以进步预测准确率和泛化能力。</p><p></p></article>