关于机器学习:STAT-5701

STAT 5701 Homework 4 – Fall 2021
This homework is due on Tuesday November 16 at 11:59pm. There is a total of 38 points. Submit
your solutions in a pdf document on Canvas. Include your R code (which must be commented and
properly indented) in the pdf file. Copying code from websites is not permitted. Cite all sources
(including lecture notes). Show all of the steps that you took to solve each problem. Please name
the pdf file -HW4.pdf. Please also submit one text file with your R code, which
must be commented and properly indented.

You will analyze a reduced version of a dataset from Karagas et al. (1996). There are n = 21
subjects. The response is arsenic.toenail, which is the level of arsenic in the subject’s
toenail. There are three explanatory variables:
? arsenic.water, the level of arsenic in the subject’s household water supply;
? gender, the gender of the subject;
? age, the age of the subject in years.
The dataset is in the dataframe object arsenic in the R binary file “arsenic.rdata” posted on
canvas. If this file is in R’s current working directory, then the command load(“arsenic.rdata”)
puts the dataframe object arsenic in R’s workspace. Calling the functions lm() or glm() is
not allowed in this problem.
(a) (4 points) Fit a linear regression model to these data, where the response is the natural
logarithm of arsenic.toenail, and the explanatory variables are those listed above with
the addition of an interaction between gender and arsenic.water. Report estimates
of the regression coefficients and the error variance.
(b) (5 points) What does the model used in part 1a assume about these data? We are
looking for a full specification of the data-generating model here, where all symbols are
defined and it is clear what is unknown. Phrases like “realization of” should be used.
(c) (5 points) Let the model with the three explanatory variables listed (without interac-
tions) be our full model. Determine the submodel of this full model (which has a subset
of the explanatory variables) that is selected by AIC. Ensure that all possible submodels
that respect the hierarchy of terms are evaluated.
Suppose that the yet-to-be observed measurements of a response X1, . . . , Xn are iid N(μ?, μ?),
where μ? ∈ (0,∞) is unknown. We will study three competing estimators of μ?: Xˉ =
n?1
∑n
i=1Xi, S
= (n ? 1)?1∑ni=1(Xi ? Xˉ)2, and μ?, defined as the maximum likelihood
estimator of μ?. The negative loglikelihood function f : (0,∞)→ R is defined by
f(μ) =
n
2
log(2pi) +
n
2
log(μ) +
1
2μ
n∑
i=1
(Xi ? μ)2.
(a) (3 points) A statistician claims that cov(Xˉ, S2) = 0. Perform a simulation study to
see if there is simulation-based statistical evidence that cov(Xˉ, S2) 6= 0. Set μ? = 0.5
and n = 10. It is recommended that you make a 95% approximate simulation-based
confidence interval for cov(Xˉ, S2) = E((Xˉ ? μ?)(S2 ? μ?)) based on 10000 independent
replications.
1
(b) (2 points) Show that every convex combination of Xˉ and S2 is unbiased for μ?.
(c) (5 points) Consider the competing unbiased estimator of μ? defined by λ?Xˉ + (1? λ?)S2,
where
λ? = arg min
λ∈[0,1]
E
[(
λXˉ + (1? λ)S2 ? μ?
)2]
.
Using the fact that cov(Xˉ, S2) = 0, derive a simple formula for λ?. This formula should
involve n and μ?. Since μ? is unknown in practice, this estimator would need to be
modified for practical use, e.g. by replacing μ? with its maximum likelihood estimator
in the formula for λ?.
(d) (4 points) Find the convex subset of (0,∞) over which the negative loglikelihood is a
convex function. At least one endpoint for this interval should involve n and X1, . . . , Xn.
(e) (2 points) Set n = 10 and μ? = 0.5. Generate a realization of X1, . . . , X10 and graph
the realization of f over the interval derived in part 2d. Since the left boundary of
this interval is zero, which is not in the domain of f , I recommend choosing the left
endpoint close to 0.05 or 0.1 (instead of values very close to zero like 10?7) to improve
the illustration.
(f) (2 points) Let μ? be the maximum likelihood estimator of μ?. Derive a simplified expres-
sion for μ?.
(g) (6 points) Set n = 10. For each μ? ∈ {10?2, 10?1, 100, 101, 102}, perform a simulation
study that computes 99% approximate simulation-based confidence intervals, based on
10,000 replications, for the following five expected values: E
(|Xˉ ? μ?|), E (|S2 ? μ?|),
E (|μ?? μ?|), E
(|Xˉ ? μ?| ? |μ?? μ?|), and E (|S2 ? μ?| ? |μ?? μ?|). In addition, for each
value of μ? used, report the value of λ? derived in part 2c. Based on the results of this
simulation study, which of the three estimators of μ? is the best? Explain.

关于机器学习:STAT-5701

评论

发表回复取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

关于机器学习:STAT-5701

评论

发表回复 取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

发表回复取消回复