关于机器学习:STA3021001题

STA302/1001 – Final Assignment
Due Wednesday April 22 by 11:59PM EST on Crowdmark
Student Name:
Student Email:
Instructions:
This final assignment must be completed individually. Any sharing or discussion of questions or
answers with other students will be considered an academic integrity offence. To ensure that all
students understand the consequences of violating academic integrity, you will need to upload the
attached academic integrity acknowledgment (on page 2), signed at the beginning and at the completion
of this assignment. In the event of suspected integrity violations, this will serve as evidence
that a student knowingly committed an act of academic misconduct. So please ensure that you read
and understand what constitutes academic misconduct as well as the consequences of so doing.
Assignments must be submitted electronically through Crowdmark. Each student will receive a
personalized link to view the assignment (this is where you will submit your assignment when
finished). If you do not receive this email from Crowdmark, check your spam/junk folder. Instructions
for how to upload completed assignments can be found here: https://crowdmark.com/
help/completing-and-submitting-an-assignment/. Note that only PDF, PNG or JPG
file types are accepted by Crowdmark. You will need to upload certain questions into certain
places, so make sure you are submitting pages in the right place.
The assignment is divided into five questions. Each question needs to be uploaded under the correct
section in Crowdmark, otherwise it may be overlooked when graded. For questions that require
hand calculations or proofs/derivations, you must show all your work. You may submit handwritten
answers for these question, but they must be legible and neat. For questions involving R, you must
provide an appendix that contains all the R code used to complete the question. We need to be
able to verify that your answers can actually be produced from your code. Please do not have R
code or unnecessary R output in your solutions. This should be in the appendix.
Note that as this is meant to replace a final exam and we are keeping the due date of the original
final exam, this means that NO EXTENSIONS WILL BE GRANTED. Therefore, if you have not
submitted by April 22 at 11:59PM EST, you will receive a grade of zero. To ensure that
you submit on time, please start the submission process early, especially if you have unreliable
internet access.
1
Academic Integrity Acknowledgement Form
Academic integrity is a fundamental value of learning and scholarship at the UofT. Participating
honestly, respectfully, responsibly, and fairly in this academic community ensures that your UofT
degree is valued and respected as a true signifier of your individual academic achievement.
Prior to beginning this final assignment, you must attest that you will follow the Code of Behaviour
on Academic Matters and will not commit academic misconduct in the completion of this assessment.
Affirm your agreement to this by completing the following Statement:
By signing this Statement, I, , agree to fully
abide to the Code of Behaviour on Academic Matters. I will not commit academic
misconduct and am aware of the penalties that may be imposed if I commit an academic
offence.
The University of Toronto’s Code of Behaviour on Academic Matters outlines the behaviours that
constitute academic misconduct, the processes for addressing academic offences, and the penalties
that may be imposed. You are expected to be familiar with the contents of this document.
Potential offences include, but are not limited to:
• Using someone elses ideas or words without appropriate acknowledgement (this includes from
internet sources or textbooks).
• Submitting your own work in more than one course without the permission of the instructor.
• Making up sources or facts.
• Obtaining or providing unauthorized assistance on any assignment (this includes working in
groups on assignments that are supposed to be individual work).
• Looking at someone else’s answers, or working together to answer questions.
• Letting someone else look at your answers.
• Misrepresenting your identity or having someone else complete your test or exam.
All suspected cases of academic dishonesty will be investigated following the procedures outlined in
the Code of Behaviour on Academic Matters.
Please sign the Statement below to complete your assessment.
By signing this Statement, I, , am attesting to
the fact that I have abided fully to the Code of Behaviour on Academic Matters. I
have not committed academic misconduct, and am aware of the penalties that may
be imposed if I have committed an academic offence.
2
Question 1 (12 points) – This question must be done by hand (but may be typed for
submission)
Consider a study design in which we have collected multiple response measurements at each value of
the predictor. Suppose we have ni observed responses at each value of xi
, indexed by i = 1, . . . , m,
and yij corresponds to the j-th observation on the response, j = 1, . . . , ni
for the i-th value of the
predictor. This means we have m unique predictor values, and ni response measurements for each
of the m values of the predictor. In this situation, it is possible to create a test that can be used to
test for how poorly the regression line captures the linear relationship.
(a) (4 points) Consider the traditional variance decomposition of a simple regression model:
SST = SSReg + RSS. Show that we can further decompose the residual sum of squares
into
• the pure error (i.e. deviations of the individual responses from the average response at
each unique value of the predictor), denoted by SSP ure
• and the lack of fit error (i.e. deviations of the average response at each x value from the
regression line), denoted by SSLack.
(b) (1 points) Determine the degrees of freedom for the pure error and the lack of fit error.
(c) (3 points) Determine the expected values of the mean squares of the pure error (MSPure) and
the lack of fit error (MSLack). You may assume that model assumptions are satisfied.
(d) (2 points) The test statistic for this test is
F =
MSLack
MSP ure.
Explain why this should follow an F distribution.
(e) (2 points) Based on the test statistic in (d) and the expected values in (c), explain why a large
value of the test statistic implies that the true regression function is not linear, and thus the
fit of our regression model is poor.
3
Question 2 (15 points) – This question must be done by hand (but may be typed for
submission)
A study was run to compare the effect of three different drugs on reducing the pain caused by
a particular condition. The drugs are labelled A, B, and C, and the response of interest is a pain
scale rating (integer-valued), where higher values implies more pain. The goal of the study was
to determine whether there exists a difference in the average pain rating between the three drug
treatments. We can answer this question using multiple linear regression methods. The data can
be found below:
Drug A 4 5 4 3 2 4 3 4 4
Drug B 6 8 4 5 4 6 5 8 6
Drug C 6 7 6 6 7 5 6 5 5
(a) (2 points) Show that we can represent the three treatments/drugs in the form of two indicator
variables. Why don’t we require the use of a third indicator variable?
(b) (2 points) Find the X0X and X0Y matrices for these data.
(c) (3 points) Estimate the regression coefficients for a multiple linear regression model relating
the pain response Y to the three drugs, X.
(d) (3 points) Show that the above regression model can be re-expressed as
yij = µ + τi + ij
where µ is the overall average pain rating, τi
is the average pain rating for drug i, ij is the
random error in the pain rating for individual j and drug i, and yij is the pain rating for
individual j on drug i.
(e) (5 points) Perform an appropriate hypothesis test using your model from (c) to determine
whether the average pain ratings for each drug are equal (i.e. τi = 0 for all i). Use a
significance level α = 0.05 and the residual standard error of 1.089.
4
Question 3 (8 points) – This question must be done by hand (but may be typed for
submission)
For each of the parts below, please provide a concise (up to three sentences) but detailed explanation
for each of the concepts. Make sure you use your own words for your answers.
(a) (2 points) Suppose we have the following correlations between a response variable and two
predictor variables. Explain which predictor the forward selection method would add to the
model first. Would the method then add the second predictor variable? Why or why not?
Y X1 X2
Y 1 0.93 -0.99
X1 0.93 1 0.985
X2 -0.99 0.985 1
(b) (2 points) Explain how violations in the model assumptions affect the ANOVA test of overall
significance in simple linear regression.
(c) (2 points) In the event that condition 1 or 2 fails, explain why we are unable to use the specific
patterns seen in the residual plots to tell us in what way the model assumptions are violated.
(d) (2 points) Explain why, when you have response measurements that are means or medians,
using a weight equal to the number of observations used to create that value can correct for
violations of constant variance.
5
Question 4 (10 points) – This question must be completed using R
Consider the New York City menu dataset, which can be found on the assignment page on Quercus
or attached with this question.
(a) (1 points) Fit a multiple linear regression model to predict Price from the variables Food,
Decor, and East. Extract the residuals from this model and save them. What do they
represent in the context of this model?
(b) (1 points) Fit a multiple linear model to predict Service from Decor, Food and East. Extract
the residuals from this model and save them. What do they represent in the context of this
model?
(c) (1 points) What can we say about the predictors based on the model from (b)?
(d) (2 points) Plot the residuals saved from part (a) against the residuals saved from part (b).
Add a line representing the simple linear regression relationship between these two sets of
residuals. What relationship do you see between the two sets of residuals?
(e) (3 points) Compare the relationship in your plot from (d) to a multiple linear model predicting
Price from the variables Food, Decor, Service and East. What similarities do you see? What
does the plot represent and how does it achieve this?
(f) (2 points) How else might this plot be used for diagnostic purposes?
6
Question 5 (20 points) – This question must be completed using R
For this question, you will be using the housing.proper.csv dataset which can be found on the
assignment page on Quercus or attached to this question on Crowdmark. These data consist of the
median value of owner-occupied homes (Y) in suburbs of Boston, along with a number of different
neighbourhood characteristics. It contains 506 observations on 13 covariates. You are asked by a
real estate developer to build the best possible model to predict the median value of homes in a
new subdivision being built, but that is also interpretable so they can justify the use of this model
to shareholders. The possible predictors for this model include:
• X1 = per capita crime rate by town
• X2 = proportion of residential land zoned for lots over 25000 square feet.
• X3 = proportion of non-retail business acres per town
• X4 = Charles River indicator variable (1 = near river, 0 = far from river)
• X5 = nitric oxide concentration (parts per 10 million)
• X6 = average number of rooms per dwelling
• X7 = proportion of owner occupied units built prior to 1940
• X8 = weighted distance to five Boston employment centres
• X9 = index of accessibility to radial highways
• X10 = full-value property-tax rate
• X11 = pupil-teacher ratio by town
• X12 = 1000(B − 0.63)2
, where B is the proportion of African Americans by town
• X13 = a numeric vector of percentage values of lower status population
You may use any technique shown in class to arrive at your final model, but you must justify every
decision you make. You will be asked to interpret your final model, explain how you arrived at this
model and defend why you think this is the best possible model. You may use up to 5 plots in
your explanations and each plot must have a reason for being presented. Please do not include too
much R output (ideally fewer than 5 outputs) as all your decisions and model diagnostics should
be discussed in the text rather than presented with R output. The discussion of your model should
be no longer than 500 words. All R code should be at the end in an appendix so we can verify
your final model and the steps you took to arrive there. Your report with plots and output should
reasonably be no longer than 3 pages, with the appendix attached after. Do not overload your
appendix with code or output that is not relevant to the creation of your final model.
7