关于算法:MATH-523线性模型

35次阅读

共计 6908 个字符,预计需要花费 18 分钟才能阅读完成。

Generalized Linear Models MATH 523
McGill University, Winter Term 2022
Assignment 2 due on February 16 at noon.
Q1 Lecture 5a
Consider a Poisson GLM with the log link and linear predictor of the form
ηi = β1 + β2ai, i ∈ {1, . . . n},
where ai is the value of a factor predictor with two levels, such that ai = 1 for
i ∈ {1, . . . n1} and ai = 0 for i ∈ {n1 + 1, . . . , n}.
Suppose that at the beginning of the t-th iteration of the Fisher Scoring algorithm
(formulated as iterative reweighted least squares), we get
β(t+1) = (β
(t+1)
1 , β
(t+1)
2 ) = (log yˉ2, log yˉ1 ? log yˉ2),
where yˉ1 = 1n1
∑n1
i=1 yi, and yˉ2 =
1
n2
∑n
i=n1+1
yi.
(1) Calculate the remaining part of the iteration step of the algorithm: η(t+1), μ(t+1),
z(t+1), W (t+1), D(t+1), and u(t+1).
(2) Does the algorithm terminate after this iteration? Justify your answer.
(3) Did the algorithm find the exact solution after this iteration? Justify your answer.
Q2 Lecture 6a
Consider the Gamma GLM (viz. page 2 of Lecture 3a) with a linear predictor of the
form
ηi = β1 + β2xi,
where xi is the value of a continuous predictor corresponding to the ith response.
(1) Calculate the standard error of β?1 using (a) the reciprocal link (g(μ) = 1/μ) and
(b) the identity link (g(μ) = μ).
(2) Calculate the deviance. Does it depend on the link function? Explain.
Q3 R excercise
Consider the data from the German General Social Survey on the number of children.
This data set contains 3548 observations on the following 6 variables: child (num-
ber of children), age (age of the woman in years), dur (years of education), nation
(nationality of the woman; 0 = German, 1 = otherwise), god (Belief of the woman
in God: 1 = Strong agreement, 2 = Agreement 3 = No definite opinion, 4 = Rather
no agreement, 5= No agreement at all 6= Never thougt about it), and univ (whether
the woman visited university: 0 = no, 1 = yes).
The dataset is available in the catdata library in R and can be loaded as follows (after
having installed the catdata library):
Johanna G. Ne?lehová
Generalized Linear Models MATH 523
McGill University, Winter Term 2022
Assignment 2 due on February 16 at noon.
library(catdata)

Loading required package: MASS

data(children)
attach(children)
head(children)

child age dur nation god univ

6 2 33 9 0 6 0

10 2 80 7 0 1 0

11 1 63 8 0 1 0

12 2 82 7 0 1 0

13 2 49 8 0 1 0

14 1 54 9 0 5 0

The data can be used to investigate the effect of age, dur, nation, god and univ on
the number of children. Therefore, we will treat child as a response in the analysis
below.
(1) Fit a Poisson GLM with the canonical link to the data with child as a response,
and age, dur, nation, god and univ as main effects but no interactions and
display the summary of the fit.
(2) List the predictors of the model in part (1). For each age, dur, nation, god
and univ, decide whether it is a factor or a continuous predictor and determine
whether it is significant at the 5% level using appropriate Wald tests.
(3) Using the model fitted in part (1), estimate the expected number of children
of a German woman aged 44 with 12 years of education, who did not attend
university and is in agreement with the statement that she beliefs in God, along
with a two-sided 95% approximate (large-sample) confidence interval.
(4) Fit a Poisson GLM to these data using only the intercept and dur predictors,
and the identity link. Why do you think the glm function produced an error mes-
sage? Try to fix the problem by supplying starting values to glm using glm(….
start=c(,)).
Due on February 16 at noon.Johanna G. Ne?lehová
Generalized Linear Models MATH 523
McGill University, Winter Term 2022
Assignment 1 due on January 28 at noon.
Q1 Lecture 2b
Consider the exponential distribution with density
f(y;λ) =
1
λ
e?x/λ, x ≥ 0,
and parameter λ > 0.
(1) Determine whether the exponential family of distributions is an exponential dis-
persion family. If it is not, explain why. If it is, identify the canonical and the
dispersion parameters, and the functions a, b, and c.
(2) Using the methods discussed in Lecture 2b, calculate the mean and variance of
an exponential random variable.
(3) Determine the mean-variance relationship.
(4) Consider the location extension of the exponential distribution, viz.
f(y;λ, μ) =
1
λ
e?(x?μ)/λ, x ≥ μ.
and parameters λ > 0 and μ ∈ R. Is this family an exponential dispersion family?
Why or why not?
Q2 Lecture 3a
Consider the Negative Binomial distribution with parameters μ > 0 and θZ > 0; the
corresponding probability mass function is given by
f(y;μ, θz) =
Γ(y + θz)
Γ(y + 1)Γ(θz)
(
θz
μ+ θz
)θz (μ
μ+ θz
)y
, y = 0, 1, . . . ,
where Γ(·) denotes the Gamma function. Assume throughout that θZ, the“number of
successes until the experiment is stopped”, is known.
(1) Show that the Negative Binomial family with known θZ is an exponential disper-
sion family. Identify the functions a, b, and c, and the canonical and dispersion
parameters.
(2) Using the formulas derived in class, calculate E(Y) and var(Y) of a Negative
Binomial random variable Y .
(3) Find the mean-variance relationship.
(4) Find the canonical link for a Negative Binomial GLM and discuss its pros and
cons.
Johanna G. Ne?lehová
Generalized Linear Models MATH 523
McGill University, Winter Term 2022
Assignment 1 due on January 28 at noon.
(5) Can you think of another link function that might be more appropriate than the
canonical link?
Q3 Lecture 4a Consider a Negative Binomial GLM with known θZ , viz. Q2. Suppose
the model contains the intercept and one factor predictor, A, with three levels, viz.
A ∈ {1, 2, 3}. This means that
g(μi) = α + β11(Ai = 2) + β21(Ai = 3).
To simplify notation, suppose that Ai = 1 for i = 1, . . . , n1, Ai = 2 for i = n1 +
1, . . . , n1 + n2 and Ai = 3 for i = n1 + n2 + 1, . . . , n for some n1, n2 ∈ {1, . . . , n} such
that n1 + n2 < n.
(1) Write down the log-likelihood for α, β1, β2 when (i) the canonical link is used and
(ii) when the log link is used.
(2) Write down the likelihood equations for α, β1, β2 when (i) the canonical link is
used and (ii) when the log link is used.
(3) Solve the likelihood equations in part (2) explicitly (it’s indeed possible to do
this in this case) when (i) the canonical link is used and (ii) when the log link is
used.
Q4 Lecture 4b
Consider again the Negative Binomial GLM with known θZ and one factor predictor
with three levels described in Q3.
(1) Calculate the Fisher Information Matrix when (i) the canonical link is used and
(ii) when the log link is used.
(2) Calculate the Hessian when (i) the canonical link is used and (ii) when the log
link is used.
Johanna G. Ne?lehová
Generalized Linear Models MATH 523
McGill University, Winter Term 2022
Assignment 1 due on January 28 at noon.
Q5 R Excercise
Load the data crabs2.txt available on myCourses in the Assignments unit under
Content. These data were collected with the goal to explore the effect of various
characteristics of a female horseshoe crab on the number of her satellites, i.e., male
mates attached to her nest. The data contain the following variables:
satell: number of satellites
color: color of the female crab, with values 1=light, 2=light medium, 3=medium,
4=dark medium, 5=dark
spine: condition of the two spines of the female horseshoe crab, with values
1=both good, 2=one worn or broken, 3=both worn or broken
width: carapace width of the female horseshoe crab in cm
weight: weight of the female horseshoe crab in g
Analyze these data with linear regression models, using satell as the response, using
the following steps:
(1) Explain which explanatory variables are factors and which are continuous. Cal-
culate the correlation between width and weight and explain why it is advisable
to keep only one of these variables in the model (and keep width henceforth).
(2) Using satell as the explanatory variable and color, spine, and width as inputs,
build the most suitable linear regression model for these data. Don’t forget that
you can include interactions between the inputs.
(3) Redo the analysis in part (2), but treating color and spine as continuous ex-
planatory variables this time. Explain why this makes sense. Do you obtain a
different model than in part (2)?
(4) Using your analyses in parts (2) and (3), single out a linear regression model that
you find the most appropriate for these data. Using various model diagnostics,
comment on the quality of the fit. Interpret your final model and formulate which
drawbacks it has, in your opinion.

正文完
 0