原文链接:http://tecdat.cn/?p=8531

  1. 执行多项式回归应用age预测wage。应用穿插验证为多项式抉择最佳次数。抉择了什么水平,这与应用ANOVA进行假设检验的后果相比如何?对所得多项式拟合数据进行绘图。

加载工资数据集。保留所有穿插验证误差的数组。咱们执行K=10  K倍穿插验证。

1.  rm(list = ls())    2.  set.seed(1)    5.  # 测试误差    6.  cv.MSE <- NA    8.  # 循环遍历年龄    9.  for (i in 1:15) {    10.    glm.fit <-  glm(wage ~ poly(age, i), data = Wage)    11.    # 咱们应用cv.glm的穿插验证并保留cv误差    12.    cv.MSE[i] <-  cv.glm(Wage, glm.fit, K = 10)$delta[1]    13.  }    14.  # 查看后果对象    15.  cv.MSE    
1.  ##  [1] 1675.837 1601.012 1598.801 1594.217 1594.625 1594.888 1595.500    2.  ##  [8] 1595.436 1596.335 1595.835 1595.970 1597.971 1598.713 1599.253    3.  ## [15] 1595.332    

咱们通过绘制type = "b"点与线之间的关系图来阐明后果。

1.  # 用线图阐明后果    2.  plot( x = 1:15, y = cv.MSE, xlab = "power of age", ylab = "CV error",     3.        type = "b", pch = 19, lwd = 2, bty = "n",     4.        ylim = c( min(cv.MSE) - sd(cv.MSE), max(cv.MSE) + sd(cv.MSE) ) )    6.  # 水平线    7.  abline(h = min(cv.MSE) + sd(cv.MSE) , lty = "dotted")    9.  # 最小值    10.  points( x = which.min(cv.MSE), y = min(cv.MSE), col = "red", pch = "X", cex = 1.5 )    

咱们再次以较高的年龄权重对模型进行拟合以进行方差分析。

1.  ## Analysis of Deviance Table    2.  ##     3.  ## Model  1: wage ~ poly(age, a)    4.  ## Model  2: wage ~ poly(age, a)    5.  ## Model  3: wage ~ poly(age, a)    6.  ## Model  4: wage ~ poly(age, a)    7.  ## Model  5: wage ~ poly(age, a)    8.  ## Model  6: wage ~ poly(age, a)    9.  ## Model  7: wage ~ poly(age, a)    10.  ## Model  8: wage ~ poly(age, a)    11.  ## Model  9: wage ~ poly(age, a)    12.  ## Model 10: wage ~ poly(age, a)    13.  ## Model 11: wage ~ poly(age, a)    14.  ## Model 12: wage ~ poly(age, a)    15.  ## Model 13: wage ~ poly(age, a)    16.  ## Model 14: wage ~ poly(age, a)    17.  ## Model 15: wage ~ poly(age, a)    18.  ##    Resid. Df Resid. Dev Df Deviance        F    Pr(>F)     19.  ## 1       2998    5022216     20.  ## 2       2997    4793430  1   228786 143.5637 < 2.2e-16 ***    21.  ## 3       2996    4777674  1    15756   9.8867  0.001681 **     22.  ## 4       2995    4771604  1     6070   3.8090  0.051070 .     23.  ## 5       2994    4770322  1     1283   0.8048  0.369731     24.  ## 6       2993    4766389  1     3932   2.4675  0.116329     25.  ## 7       2992    4763834  1     2555   1.6034  0.205515     26.  ## 8       2991    4763707  1      127   0.0795  0.778016     27.  ## 9       2990    4756703  1     7004   4.3952  0.036124 *     28.  ## 10      2989    4756701  1        3   0.0017  0.967552     29.  ## 11      2988    4756597  1      103   0.0648  0.799144     30.  ## 12      2987    4756591  1        7   0.0043  0.947923     31.  ## 13      2986    4756401  1      190   0.1189  0.730224     32.  ## 14      2985    4756158  1      243   0.1522  0.696488     33.  ## 15      2984    4755364  1      795   0.4986  0.480151     34.  ## ---    35.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1    

依据F测验,咱们应该抉择年龄进步到3次的模型,通过穿插验证 。

当初,咱们绘制多项式拟合的后果。

1.  plot(wage ~ age, data = Wage, col = "darkgrey",  bty = "n")    2.  ...    

  1. 拟合step函数以wage应用进行预测age 。绘制取得的拟合图。
1.  cv.error <-  NA    2.  ...    4.  # 突出显示最小值    5.  points( x = which.min(cv.error), y = min(cv.error, na.rm = TRUE),    

穿插验证表明,在k = 8的状况下,测试误差最小。最小值的1sd之内的最简洁模型具备k = 4,因而将数据分为5个不同的区域。

44

1.  lm.fit <-  glm(wage ~ cut(age, 4), data = Wage)    3.  matlines(age.grid, cbind( lm.pred$fit + 2* lm.pred$se.fit,    4.                            lm.pred$fit - 2* lm.pred$se.fit),    5.           col = "red", lty ="dashed")    

Wage数据集蕴含了一些其余的变量,如婚姻状况(maritl),工作级别(jobclass),等等。摸索其中一些其余预测变量与的关系wage,并应用非线性拟合技术将模型拟合到数据中。

1.  ...    2.  summary(Wage[, c("maritl", "jobclass")] )    
1.  ##               maritl               jobclass     2.  ##  1. Never Married: 648   1. Industrial :1544     3.  ##  2. Married      :2074   2. Information:1456     4.  ##  3. Widowed      :  19     5.  ##  4. Divorced     : 204     6.  ##  5. Separated    :  55    
1.  # 关系箱线图    2.  par(mfrow=c(1,2))    3.  plot(Wage$maritl, Wage$wage, frame.plot = "FALSE")    

看来一对已婚夫妇均匀比其余群体挣更多的钱。信息类工作的工资均匀高于工业类工作。

多项式和step函数

1.  m1 <-  lm(wage ~ maritl, data = Wage)    2.  deviance(m1) # fit (RSS in linear; -2*logLik)    

## [1] 4858941

1.  m2 <-  lm(wage ~ jobclass, data = Wage)    2.  deviance(m2)    

## [1] 4998547

1.  m3 <-  lm(wage ~ maritl + jobclass, data = Wage)    2.  deviance(m3)    

## [1] 4654752

正如预期的那样,应用最简单的模型能够使样本内数据拟合最小化。

咱们不能使样条曲线拟合分类变量。

咱们不能将样条曲线拟合到因子,但能够应用一个样条曲线拟合一个连续变量并增加其余预测变量的模型。

1.  library(gam)    2.  m4 <-  gam(...)    3.  deviance(m4)    

## [1] 4476501

anova(m1, m2, m3, m4) 
1.  ## Analysis of Variance Table    2.  ##     3.  ## Model 1: wage ~ maritl    4.  ## Model 2: wage ~ jobclass    5.  ## Model 3: wage ~ maritl + jobclass    6.  ## Model 4: wage ~ maritl + jobclass + s(age, 4)    7.  ##   Res.Df     RSS      Df Sum of Sq      F    Pr(>F)     8.  ## 1   2995 4858941     9.  ## 2   2998 4998547 -3.0000   -139606 31.082 < 2.2e-16 ***    10.  ## 3   2994 4654752  4.0000    343795 57.408 < 2.2e-16 ***    11.  ## 4   2990 4476501  4.0002    178252 29.764 < 2.2e-16 ***    12.  ## ---    13.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1    

F测验表明,咱们从模型四到模型一统计显著改善的变量有年龄,wagemaritl,和jobclass

Boston数据回归

这个问题应用的变量dis(到五个波士顿待业核心的间隔的加权平均值)和nox(每百万人口中一氧化氮的浓度,单位为百万)。咱们将dis作为预测变量,将nox作为因变量。

1.  rm(list = ls())    2.  set.seed(1)    3.  library(MASS)    4.  attach(Boston)    
1.  ## The following objects are masked from Boston (pos = 14):    2.  ##     3.  ##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio,    4.  ##     rad, rm, tax, zn    
  1. 应用poly()函数拟合三次多项式回归来预测nox应用dis。报告回归输入,并绘制后果数据和多项式拟合。
1.  m1 <-  lm(nox ~ poly(dis, 3))    2.  summary(m1)
1.  ##     2.  ## Call:    3.  ## lm(formula = nox ~ poly(dis, 3))    4.  ##     5.  ## Residuals:    6.  ##       Min        1Q    Median        3Q       Max     7.  ## -0.121130 -0.040619 -0.009738  0.023385  0.194904     8.  ##     9.  ## Coefficients:    10.  ##                Estimate Std. Error t value Pr(>|t|)     11.  ## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***    12.  ## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***    13.  ## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***    14.  ## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***    15.  ## ---    16.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1    17.  ##     18.  ## Residual standard error: 0.06207 on 502 degrees of freedom    19.  ## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131     20.  ## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16    
1.  dislim <-  range(dis)    2.  ...    3.  lines(x = dis.grid, y = lm.pred$fit, col = "red", lwd = 2)    4.  matlines(x = dis.grid, y = cbind(lm.pred$fit + 2* lm.pred$se.fit,    5.                                   lm.pred$fit - 2* lm.pred$se.fit)     6.           , col = "red", lwd = 1.5, lty = "dashed")    

摘要显示,在nox应用进行预测时,所有多项式项都是无效的dis。该图显示了一条平滑的曲线,很好地拟合了数据。

  1. 绘制多项式适宜不同多项式次数的范畴(例如,从1到10),并报告相干的残差平方和。

咱们绘制1到10度的多项式并保留RSS。

1.  #     2.  train.rss <-  NA    3.  ...    4.  # 在训练集中显示模型拟合    5.  train.rss    
1.  ##  [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484    2.  ##  [8] 1.835630 1.833331 1.832171    

正如预期的那样,RSS随多项式次数枯燥递加。

  1. 执行穿插验证或其余办法来抉择多项式的最佳次数,并解释您的后果。

咱们执行LLOCV并手工编码:

1.  cv.error <- matrix(NA, nrow = nrow(Boston), ncol = 10)    3.  ...    4.  names(result) <- paste( "^", 1:10, sep= "" )    5.  result    
1.  ##          ^1          ^2          ^3          ^4          ^5          ^6     2.  ## 0.005471468 0.004022257 0.003822345 0.003820121 0.003785158 0.003711971     3.  ##          ^7          ^8          ^9         ^10     4.  ## 0.003655106 0.003627727 0.003623183 0.003620892    
1.  plot(result ~ seq(1,10), type = "b", pch = 19, bty = "n", xlab = "powers of dist to empl. center",    2.       ylab = "cv error")    3.  abline(h = min(cv.error) + sd(cv.error), col = "red", lty = "dashed")    

基于穿插验证,咱们将抉择dis平方。

  1. 应用bs()函数拟合回归样条曲线应用nox进行预测dis

咱们以[3,6,9]大抵相等的4个区间划分此范畴

1.  library(splines)    2.  m4 <-  lm(nox ~ bs(dis, knots = c(3, 6, 9)))    3.  summary(m4)    
1.  ##     2.  ## Call:    3.  ## lm(formula = nox ~ bs(dis, knots = c(3, 6, 9)))    4.  ##     5.  ## Residuals:    6.  ##       Min        1Q    Median        3Q       Max     7.  ## -0.132134 -0.039466 -0.009042  0.025344  0.187258     8.  ##     9.  ## Coefficients:    10.  ##                               Estimate Std. Error t value Pr(>|t|)     11.  ## (Intercept)                   0.709144   0.016099  44.049  < 2e-16 ***    12.  ## bs(dis, knots = c(3, 6, 9))1  0.006631   0.025467   0.260    0.795     13.  ## bs(dis, knots = c(3, 6, 9))2 -0.258296   0.017759 -14.544  < 2e-16 ***    14.  ## bs(dis, knots = c(3, 6, 9))3 -0.233326   0.027248  -8.563  < 2e-16 ***    15.  ## bs(dis, knots = c(3, 6, 9))4 -0.336530   0.032140 -10.471  < 2e-16 ***    16.  ## bs(dis, knots = c(3, 6, 9))5 -0.269575   0.058799  -4.585 5.75e-06 ***    17.  ## bs(dis, knots = c(3, 6, 9))6 -0.303386   0.062631  -4.844 1.70e-06 ***    18.  ## ---    19.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1    20.  ##     21.  ## Residual standard error: 0.0612 on 499 degrees of freedom    22.  ## Multiple R-squared:  0.7244, Adjusted R-squared:  0.7211     23.  ## F-statistic: 218.6 on 6 and 499 DF,  p-value: < 2.2e-16    
1.  # 绘图后果    2.  ...    3.  #     4.  matlines( dis.grid,    5.  ...    6.            col = "black", lwd = 2, lty = c("solid", "dashed", "dashed"))    

  1. 当初针对肯定范畴的自由度拟合样条回归,并绘制后果拟合并报告后果RSS。形容取得的后果。

咱们应用3到16之间的dfs拟合回归样条曲线。

1.  box <-  NA    3.  for (i in 3:16) {    4.  ...    5.  }    7.  box[-c(1, 2)]    
1.  ##  [1] 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995 1.825653    2.  ##  [8] 1.792535 1.796992 1.788999 1.782350 1.781838 1.782798 1.783546    

ISLR包中的College数据集。

  1. 将数据分为训练集和测试集。应用学费作为因变量,应用其余变量作为预测变量,对训练集执行前向逐渐抉择,确定仅应用预测变量子集的令人满意的模型。
1.  rm(list = ls())    2.  set.seed(1)    3.  library(leaps)    4.  attach(College)    
1.  ## The following objects are masked from College (pos = 14):    2.  ##     3.  ##     Accept, Apps, Books, Enroll, Expend, F.Undergrad, Grad.Rate,    4.  ##     Outstate, P.Undergrad, perc.alumni, Personal, PhD, Private,    5.  ##     Room.Board, S.F.Ratio, Terminal, Top10perc, Top25perc    
1.  # 训练/测试拆分行的索引号    2.  train <-  sample( length(Outstate), length(Outstate)/2)    3.  test <-  -train    4.  ...    5.  abline(h=max.adjr2 - std.adjr2, col="red", lty=2)    

所有cp,BIC和adjr2得分均显示6是该子集的最小大小。然而,依据1个标准误差规定,咱们将抉择具备4个预测变量的模型。

1.  ...    2.  coefi <-  coef(m5, id = 4)    3.  names(coefi)    

## [1] "(Intercept)" "PrivateYes" "Room.Board" "perc.alumni" "Expend"

  1. 将GAM拟合到训练数据上,应用学费作为响应,并应用在上一步中抉择的函数作为预测变量。绘制后果,并解释您的发现。
1.  library(gam)    2.  ...    3.  plot(gam.fit, se=TRUE, col="blue")    

  1. 评估在测试集上取得的模型,并解释取得的后果。
1.  gam.pred <-  predict(gam.fit, College.test)    2.  gam.err <-  mean((College.test$Outstate - gam.pred)^2)    3.  gam.err    

## [1] 3745460

1.  gam.tss <-  mean((College.test$Outstate - mean(College.test$Outstate))^2)    2.  test.rss <-  1 - gam.err / gam.tss    3.  test.rss    

## [1] 0.7696916

  1. 对于哪些变量(如果有),是否存在与因变量呈非线性关系的证据?
summary(gam.fit) 
1.  ##     2.  ## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD,     3.  ##     df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate,     4.  ##     df = 2), data = College.train)    5.  ## Deviance Residuals:    6.  ##      Min       1Q   Median       3Q      Max     7.  ## -4977.74 -1184.52    58.33  1220.04  7688.30     8.  ##     9.  ## (Dispersion Parameter for gaussian family taken to be 3300711)    10.  ##     11.  ##     Null Deviance: 6221998532 on 387 degrees of freedom    12.  ## Residual Deviance: 1231165118 on 373 degrees of freedom    13.  ## AIC: 6941.542     14.  ##     15.  ## Number of Local Scoring Iterations: 2     16.  ##     17.  ## Anova for Parametric Effects    18.  ##                         Df     Sum Sq    Mean Sq F value    Pr(>F)     19.  ## Private                  1 1779433688 1779433688 539.106 < 2.2e-16 ***    20.  ## s(Room.Board, df = 2)    1 1221825562 1221825562 370.171 < 2.2e-16 ***    21.  ## s(PhD, df = 2)           1  382472137  382472137 115.876 < 2.2e-16 ***    22.  ## s(perc.alumni, df = 2)   1  328493313  328493313  99.522 < 2.2e-16 ***    23.  ## s(Expend, df = 5)        1  416585875  416585875 126.211 < 2.2e-16 ***    24.  ## s(Grad.Rate, df = 2)     1   55284580   55284580  16.749 5.232e-05 ***    25.  ## Residuals              373 1231165118    3300711     26.  ## ---    27.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1    28.  ##     29.  ## Anova for Nonparametric Effects    30.  ##                        Npar Df  Npar F     Pr(F)     31.  ## (Intercept)     32.  ## Private     33.  ## s(Room.Board, df = 2)        1  3.5562   0.06010 .     34.  ## s(PhD, df = 2)               1  4.3421   0.03786 *     35.  ## s(perc.alumni, df = 2)       1  1.9158   0.16715     36.  ## s(Expend, df = 5)            4 16.8636 1.016e-12 ***    37.  ## s(Grad.Rate, df = 2)         1  3.7208   0.05450 .     38.  ## ---    39.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1    

非参数Anova测验显示了因变量与收入之间存在非线性关系的无力证据,以及因变量与Grad.Rate或PhD之间具备中等强度的非线性关系(应用p值为0.05)。


最受欢迎的见解

1.R语言多元Logistic逻辑回归 利用案例

2.面板平滑转移回归(PSTR)剖析案例实现

3.matlab中的偏最小二乘回归(PLSR)和主成分回归(PCR)

4.R语言泊松Poisson回归模型剖析案例

5.R语言回归中的Hosmer-Lemeshow拟合优度测验

6.r语言中对LASSO回归,Ridge岭回归和Elastic Net模型实现

7.在R语言中实现Logistic逻辑回归

8.python用线性回归预测股票价格

9.R语言如何在生存剖析与Cox回归中计算IDI,NRI指标