关于数据挖掘:R语言代做编程辅导M3S2-Spring-Assessed-Courseworklinear-model附答案

3次阅读

共计 4297 个字符,预计需要花费 11 分钟才能阅读完成。

全文链接:http://tecdat.cn/?p=30888

For this coursework you are required to download a dataset personal to you. Your dataset is available at:
http://wwwf.imperial.ac.uk/~f…<CID>.RData
where you must replace <CID> with your CID number. Any problems, email me. This
dataset contains a dataframe called mydat | it consists of a response y and 3 columns of
covariates x1, x2 and x3. Be aware!

Q1) (a) In R fit the normal linear model with:

Based upon the summary of the model, do you think that the model fits the data
well? Explain your reasoning using the values reported in the R summary | but
do not include the whole summary in your report.
(b) Perform a hypothesis test to ascertain whether or not to include the intercept
term | use a 5% significance level. Include your code.
(c) Conduct a hypothesis test comparing the models:
E(Y) = β1 against E(Y) = β1 + β2×2 + β3×3 + β4×4
as a 5% level. Include your code.
(d) By inspecting the leverages and residuals, identify any potential outliers. Name
these data points by their index number. Give your reasoning as to why you
believe these are potential outliers. You may present up to three plots if necessary

   mod=lm(y~x1+x2+x3,data=mydat)

summary(mod) 



从残差值来看,拟合模型的预测值与理论数值差值较小,因而模型拟合较好。



常数项,x1,x2 的 p 值均小于 0.05,阐明以上变量对 y 均有显著的影响。


从 R -square 值来看,该模型的拟合水平仍有进步的空间。

 

B)#b.r

    mod2=lm(y~x1+x2+x3-1,data=mydat)# 删除常数项

t.test(mod2$fitted.values,mod$fitted.values,conf.level=0.95)


从测验后果来看,在 5% 的显著性程度上能够看到两个模型存在差别。



和模型 1 的拟合后果相比能够发现去除常数项后,模型 2 的 R -squre 要大于模型 1,即拟合水平要好于模型 1.


C)#c.r

    mod3=lm(y~1)

summary(mod3)




能够发现蕴含常数项和仅蕴含常数项的两个模型十分类似。P 值大于 0.05,因而能够承受原假如,即这两个模型是类似的。

D)#d.r



能够发现第 6,57,38 个样本的预测值与理论样本值的规范残差要大于其余值,因而能够认为 6,57,38 个样本为离群点。



能够看到底 38,101 个样本对 cook 间隔的值产生了较大的影响,显著不同与其余样本。因而

能够认为第 38 和 101 个样本对模型产生了影响,因而能够认为是离群点。

Q2) We shall now consider a GLM with a Gamma response distribution.
(a) Show that a random variable Y where Y follows a Gamma distribution with
probability density function:



(c) Rewrite (by \hand") the IWLS algorithm (similar to Algorithm 3.1 in notes on  

page 38) specifically for the Gamma response and using the link:



This is called the inverse link function.  

Continue to use the inverse link function for the remainder of the
questions.
(d) Write the components of the total score U1; : : : ; Up and the Fisher information
matrix for this model.
(e) Given the observations y, what is a sensible initial guess to begin the IWLS
algorithm in general?
(f) Manually write an IWLS algorithm to fit a Gamma GLM using your data, mydat,
using the inverse link and same linear predictor in Q1a). Use the deviance as the
convergence criteria and initial guess of β as (0:5; 0:5; 0:5; 0:5). Present your code
and along with your final estimate of β and final deviance.
(g) Based on your IWLS results, compute φbD and φbp and the estimates of var(βb2)




In R fit the model again with a Gamma response i.e.  

glm(y~x1+x2+x3,family=Gamma,data=mydat)
Note the capital G in Gamma. Verify the results with your IWLS results.
(h) Give a prediction for the response given by the model for x1= 13, x2= 5 x3= 0:255
and give a 91% confidence interval for this prediction. Include your code.
(i) Perform a hypothesis test between this model and another model with the same
link and response distribution but with linear predictor η where
ηi = β1 + β2xi1 + β3xi2 for i = 1; : : : ; n:
Use a 5% significance level. You may use the deviance function here. Include
your code.
(j) Using your IWLS results, manually compute the leverages of the observations for this model | present your code (but not the values) and plot the leverages
against the observation index number.
(k) Proceed to investigate diagnostic plots for your Gamma GLM. Identify any potential outliers | give your reasoning. Remove the most suspicious data point | you must remove 1 and only 1 | and refit the same model. Compare and
comment on the change of the model with and without this data point | you
may wish to refer to the relative change in the estimated coefficients. You may present up to three plots if necessary.

        
    x3 <- mydat$x3

X=cbind(1,x1,x2,x3)

ilogit <- function(u)

  1/(1+exp(-u))

D <- function(mu){#deviance 函数

  a <- (y-mu)/mu

  b <- -log(y/mu)

       G)#g.r

 

eta = cbind(1,x1,x2,x3)%*%beta

mu=1/(eta)

z = eta+((y-mu)/(-mu^2)) #form the adjusted variate

w = mu^2 #weights

       
       H)#h.r

mod= glm(y~x1+x2+x3,family=Gamma,data=mydat)

 

  x1= 13
   
   
   pp=predict(mod, newdata=data.frame(x1,x2,x3), level = 0.91, int = 'p')# 用预计的参数对样本点进行预测
   
   
   

       I)#i.r

mod2=lm(y~x1+x2,data=mydat)

   
   
   因为 p 值大于 0.05,无奈回绝原假如 H0,因而从 deviance 的差别度来看,能够认为两个模型并没有显著的差异。
       J)#j.r

plot(mod)

       K)#k.r

y1=exp(beta[1]+beta[2]*x1+beta[3]*x2+beta[4]*x3)

   
   
   
   从残差拟合状况图来看,第 44,28,81 号样本点的残差值较大,可能为异样点,其中 81 号样本与拟合值的残差是最大的。

   
   
   从正态分布 qq 图来看,大部分样本点散布在正态分布直线四周,能够认为样本点的总体遵从正态分布。其中 44,28,81 号样本点里正态分布直线较远,因而能够认为其不合乎正态分布,可能是离群点。

 

从残差 leverage 图来看,第 57,101,40 号样本具备较大的 cook 间隔,即都对咱们的预测值产生了较大的影响。

   
   
   计算这 3 个样本的 leverage 统计量,能够发现第 44 号样本的值大于其余连个样本,因而认为第 44 号样本为异样点,能够删去。

比照删去 44 号样本的模型和原来的模型

       mod2=lm(y~x1+x2+x3, family = Gamma,data=mydat1)

summary(mod2)

能够看到批改后的模型 deviance residuals 值缩小了,不同变量对因变量的影响也更加显著,因而模型的拟合度提高。

正文完
 0