关于数据挖掘:R语言代做编程辅导STATGM003-STATISTICAL-COMPUTING-INCOURSE附答案

3次阅读

共计 6033 个字符,预计需要花费 16 分钟才能阅读完成。

全文链接:http://tecdat.cn/?p=30885

• Your solutions should be your own work and are to be handed in by yourself to the Statistical Science Departmental office by 1600hrs on MONDAY, 23rd FEBRUARY

Declaration: I am aware of the UCL Statistical Science Department’s regulations on plagiarism for assessed coursework. I have read the guidelines in the student handbook and understand what constitutes plagiarism. I hereby affirm that the work I am submitting for this in-course assessment is entirely my own.

G3 Assessment 2 | Instructions

  1. Answer both questions.
  2. For Question 1 you should hand in a paper copy of your report for part (d), and you should also upload an electronic copy of your report to the course Moodle page. You are NOT required to submit your R script for this question.
  3. For Question 2 you should submit:
    • an electronic copy of your Surname.r file containing your R script (see below),
    • a printout of your Surname.r file containing your R script (see below),
    • a printout of the graph in part (b) that is produced by your script (black and white copy is sufficient), and
    • a printout of your Surname_out.txt file (see below) containing output to parts
    (d), (e) and (f) produced by your R script.
    Your R script should be saved in a file named Surname.r, where Surname is your surname. For example, if your name is Karl Pearson, your R script should be saved in the file Pearson.r. Only one file is required/allowed. Do not submit a zip file. Your code should create an output file called Surname_out.txt, where again Surname is your surname. This file should contain the output from parts (d), (e) and (f) and it should include text and comments indicating what the results are. This should be done by inserting remarks in the source file using the cat() function.
    Your program should be well commented and have a header section summarising the logical structure, followed by the main body of the script, which should also contain comments. You should clearly indicate the question and part numbers in both your code and output.
    It should execute without the need for user intervention when the R command
    source() is run on it.
  4. Paper copies of your answers and printouts should be handed in to the Statistical Science departmental office. Your answers to both questions should be submitted in a single document (not in loose pages) that is clearly identified with your name. Remember to complete a plagiarism declaration, and to attach it to your
    work.
  5. Electronic copies of your script (i.e., just the file Surname.r) should be submitted via the Moodle page for the course | look for the heading \Submission of in-course assessment 2″ and follow the link beneath it, to obtain instructions on how to do this

STATG003/M003 Assessment 2 | Questions

  1. The file lungfunction.dat contains data from 50 people. Lung function can be measured by a forced out breath of air into a device called a spirometer and is used as a marker for lung health. There are four quantitative variables: forced expiratory volume in 1 second (FEV1) measurement (to be denoted by fev and measured in
    litres), the person’s height in metres (to be denoted by height), age in years (to be denoted by age) and average time spent exercising in hours per week (to be denoted by exercise). In addition there is a variable indicating if the person has used an inhaler (to be denoted inhaler) within the last 24 hours (1=no, 2=yes). Clinicians are interested in how a person’s lung health depends on height, age, exercise and inhaler usage

(a) Download the file lungfunction.dat from the G3 Moodle page. Read the
data into R using read.table and then name the columns as fev, height,
inhaler, age, exercise.
(b) Obtain summary statistics for each quantitative variable and make useful plots of the data | i.e., that are relevant to the objectives of the study. Such plots may include, but are not necessarily restricted to, pairwise scatter plots with different plotting symbols for those who have or haven’t used an inhaler
recently. Put plots together in a single figure where appropriate and consider possibly using log scales for the quantitative variables.
(c) Find a linear model that enables fev to be predicted from the other variables and that is not more complicated than necessary. You may wish to consider
using log transformations of one or more of the explanatory variables. All your models should be fitted using the lm function, and wide range of models
should be considered to make your choice of model convincing with the use
appropriate diagnostics to assess them. Ultimately you are required to
recommend a single model that is suitable for interpretation and to justify
your recommendation.
(d) Write a brief report on your analysis in three sections:
I Describe briefly what you found in your exploratory analysis in part (a)
II Describe briefly (without too many technical details) what models you
considered in part (b) and why you chose the model you did, and
III State your final model clearly and describe it in words. Remember to
include an estimate of the error standard deviation and say what this
means. Give an estimate of what would be the effect on the average FEV1
by being older (e.g, by 1 year of age). Give an appropriate assessment of
the uncertainty in your estimate

  1. The file trnormal.dat contains observations from a truncated normal distribution.
    Each observation was originally drawn from a normal N(µ; σ2) distribution, and
    any values less than u are replaced as u. In this instance, u is taken to be 1.
    So if X ∼ N(µ; σ2) then the observations are from the distribution of
    W = max(u; X) where u = 1. Hence:

and

The log-likelihood of µ and σ given a set of observations w1; : : : wn is

The function I(C) is the indicator function, taking the value 1 if the condition C is
true and 0 if the condition C is false.
(a) Download the data trnormal.dat from the G3 Moodle page. Read it into R
using scan.
(b) Obtain summary statistics for the data and plot a histogram.
(c) Write a function called negll that takes two arguments
(i) params, a vector containing the values of the two parameters (µ; σ), and
(ii) dat, a vector w of the data,
and returns the negative log-likelihood, -l(µ; σjw). (Hint R functions pnorm
and dnorm maybe useful in computing the negative log-likelihood.)
(d) Use your function negll to evaluate and print out the negative log-likelihood
for the data in trnormal.dat for a few sensible values of µ and σ.
(e) Use the R function nlm to find and print out the maximum likelihood
estimates of µ and σ for the data in trnormal.dat by minimising the negative
log likelihood.
(f) Obtain and print out approximate standard errors for these estimates.


#1

colnames(data)=c("fev", "height","inhaler", "age", "exercise")# 给列名赋值
summary(data)
cor(data)# 查看各个变量之间的关系
plot(data)
attach(data)# 绑定数据

boxplot(fev ~ inhaler,
          col = "yellow",
         main = "inhaler 与 fev 箱线图",
         xlab = "inhaler",
         ylab = "fev",
         xlim = c(0, 3), ylim = c(5, 9), yaxs = "i")
boxplot(fev ~ height,
        
          col = "red",
          summary(lm1)
#应用向前向后线性拟合剔除无关变量
lm2=step(lm1,direction="both")
summary(lm2)
#对变量进行 log 变换
lm3=lm(fev~height+inhaler+age+log(inhaler),data=data)



#2

#用几个参数进行测试
l=negll(c(1,1),data)
l=negll(c(2,2),data)
l=negll(c(1,3),data)
#用极大似然法预计 negll 函数的参数 

正文完
 0