关于机器学习:STAT-5701

43次阅读

共计 3767 个字符,预计需要花费 10 分钟才能阅读完成。

STAT 5701 Homework 4 – Fall 2021
This homework is due on Tuesday November 16 at 11:59pm. There is a total of 38 points. Submit
your solutions in a pdf document on Canvas. Include your R code (which must be commented and
properly indented) in the pdf file. Copying code from websites is not permitted. Cite all sources
(including lecture notes). Show all of the steps that you took to solve each problem. Please name
the pdf file -HW4.pdf. Please also submit one text file with your R code, which
must be commented and properly indented.

  1. You will analyze a reduced version of a dataset from Karagas et al. (1996). There are n = 21
    subjects. The response is arsenic.toenail, which is the level of arsenic in the subject’s
    toenail. There are three explanatory variables:
    ? arsenic.water, the level of arsenic in the subject’s household water supply;
    ? gender, the gender of the subject;
    ? age, the age of the subject in years.
    The dataset is in the dataframe object arsenic in the R binary file“arsenic.rdata”posted on
    canvas. If this file is in R’s current working directory, then the command load(“arsenic.rdata”)
    puts the dataframe object arsenic in R’s workspace. Calling the functions lm() or glm() is
    not allowed in this problem.
    (a) (4 points) Fit a linear regression model to these data, where the response is the natural
    logarithm of arsenic.toenail, and the explanatory variables are those listed above with
    the addition of an interaction between gender and arsenic.water. Report estimates
    of the regression coefficients and the error variance.
    (b) (5 points) What does the model used in part 1a assume about these data? We are
    looking for a full specification of the data-generating model here, where all symbols are
    defined and it is clear what is unknown. Phrases like“realization of”should be used.
    (c) (5 points) Let the model with the three explanatory variables listed (without interac-
    tions) be our full model. Determine the submodel of this full model (which has a subset
    of the explanatory variables) that is selected by AIC. Ensure that all possible submodels
    that respect the hierarchy of terms are evaluated.
  2. Suppose that the yet-to-be observed measurements of a response X1, . . . , Xn are iid N(μ?, μ?),
    where μ? ∈ (0,∞) is unknown. We will study three competing estimators of μ?: Xˉ =
    n?1
    ∑n
    i=1Xi, S
  3. = (n ? 1)?1∑ni=1(Xi ? Xˉ)2, and μ?, defined as the maximum likelihood
    estimator of μ?. The negative loglikelihood function f : (0,∞)→ R is defined by
    f(μ) =
    n
    2
    log(2pi) +
    n
    2
    log(μ) +
    1

    n∑
    i=1
    (Xi ? μ)2.
    (a) (3 points) A statistician claims that cov(Xˉ, S2) = 0. Perform a simulation study to
    see if there is simulation-based statistical evidence that cov(Xˉ, S2) 6= 0. Set μ? = 0.5
    and n = 10. It is recommended that you make a 95% approximate simulation-based
    confidence interval for cov(Xˉ, S2) = E((Xˉ ? μ?)(S2 ? μ?)) based on 10000 independent
    replications.
    1
    (b) (2 points) Show that every convex combination of Xˉ and S2 is unbiased for μ?.
    (c) (5 points) Consider the competing unbiased estimator of μ? defined by λ?Xˉ + (1? λ?)S2,
    where
    λ? = arg min
    λ∈[0,1]
    E
    [(
    λXˉ + (1? λ)S2 ? μ?
    )2]
    .
    Using the fact that cov(Xˉ, S2) = 0, derive a simple formula for λ?. This formula should
    involve n and μ?. Since μ? is unknown in practice, this estimator would need to be
    modified for practical use, e.g. by replacing μ? with its maximum likelihood estimator
    in the formula for λ?.
    (d) (4 points) Find the convex subset of (0,∞) over which the negative loglikelihood is a
    convex function. At least one endpoint for this interval should involve n and X1, . . . , Xn.
    (e) (2 points) Set n = 10 and μ? = 0.5. Generate a realization of X1, . . . , X10 and graph
    the realization of f over the interval derived in part 2d. Since the left boundary of
    this interval is zero, which is not in the domain of f , I recommend choosing the left
    endpoint close to 0.05 or 0.1 (instead of values very close to zero like 10?7) to improve
    the illustration.
    (f) (2 points) Let μ? be the maximum likelihood estimator of μ?. Derive a simplified expres-
    sion for μ?.
    (g) (6 points) Set n = 10. For each μ? ∈ {10?2, 10?1, 100, 101, 102}, perform a simulation
    study that computes 99% approximate simulation-based confidence intervals, based on
    10,000 replications, for the following five expected values: E
    (|Xˉ ? μ?|), E (|S2 ? μ?|),
    E (|μ?? μ?|), E
    (|Xˉ ? μ?| ? |μ?? μ?|), and E (|S2 ? μ?| ? |μ?? μ?|). In addition, for each
    value of μ? used, report the value of λ? derived in part 2c. Based on the results of this
    simulation study, which of the three estimators of μ? is the best? Explain.
正文完
 0