Survey Sampling
Statistics 4234/5234 — Fall 2018
Take-home final exam
The following problems are due to Room 203 Mathematics between 7:10pm and 8:00pm on
Tuesday, December 18. You can also submit your paper to the course mailbox in Room 904
SSW, any time before 7:00pm on Tuesday, December 18.
You are not to discuss these problems with anyone other than the instructor, nor consult any
published or on-line reference other than Sampling: Design and Analysis; by Sharon L. Lohr.
Please refer to the Homework requirements section of the Course Information document posted
at the beginning of the course. A portion of your score on the final exam will be based on
presentation; any paper that fails to comply with items 2 through 8 of those requirements will
not earn the presentation points.
- Consider the population of size N = 50 given in the file FinalPop.csv, in the“Data”
folder on Courseworks. Verify that the population mean and variance are ˉyU = 14.882
and S - = 0.977457, respectively.
(a) Find the design-based bias and MSE of the sample mean for an SRS of size n = 10,
that is, find E(ˉy yˉU) and E[(ˉy yˉU)
2
].
(b) Consider a stratified random sample of 2 units drawn from each of the 5 strata,
defined as units 1–10, 11–20, 21–30, 31–40, and 41–50. Find the design-based bias
and MSE of ˉystr =
1
5
P5
h=1 yˉh; that is, find E(ˉystr yˉU) and E[(ˉystr yˉU)
2
].
Now suppose that the population Y1, Y2, . . . , YN is itself a random sample of size N = 50
from a normally distributed superpopulation with mean μ = 15 and variance σ - = 1.
(c) Find the model-based bias and MSE of the estimator in part (a), that is, find
EM(ˉy yˉU) and EM[(ˉy yˉU)
2
].
(d) Find the model-based bias and MSE of the estimator in part (b), that is, find
EM(ˉystr yˉU) and EM[(ˉystr yˉU)
2
].
1 - The data for this problem are in the SDaA package. Enter
library(SDaA)
Data <- counties[,c(2,3,5,17)]
and Data will contain total population and number of veterans for a random sample of - of the 3141 counties in the United States. The total population at the time of this
data set was 255,077,536.
(a) Using ratio estimation, find an approximate 95% confidence interval for the total
number of veterans in the United States. Report your answer in millions of veterans,
rounded to the nearest 10,000.
(b) Using regression estimation, find an approximate 95% confidence interval for the total
number of veterans in the United States. Report your answer in millions of veterans,
rounded to the nearest 10,000.
(c) Assume the population values are themselves a random sample from a superpopulation
in which
Yi = β0 + β1xi + εi
where E(εi) = 0 and V (εi) = σ
2
. Find the model-based standard error of the estimate
you computed for part (b). How does it compare to the design-based SE?
(d) Assume the population values are themselves a random sample from a superpopulation
in which
Yi = β1xi + εi
where E(εi) = 0 and V (εi) = σ
2xi
. Find the model-based standard error of the
estimate you computed for part (a). How does it compare to the design-based SE? - A fisherman is interested in N, the number of fish in a certain pond. He catches 100 fish,
tags them, and throws them back. A few days later, he returns and catches 80 fish, of
which 18 are tagged.
(a) Find the maximum likelihood estimate of N, along with its standard error.
(b) Explain in plain English what your answer to part (a) means, particularly the standard
error. The fisherman does not know or care about maximum likelihood theory,
or sampling distributions, or any such things — he just wants to know how many
fish are in the pond. Help him.
2
(c) Find an approximate 90% confidence interval for N by inverting the acceptance region
of a level .10 Pearson’s chi-square test for independence between the binary variables
In first day’s catch and In second day’s catch. - The file statepop.csv, available in the“Data”folder on Courseworks, lists the 1992
population for the 50 states plus the District of Columbia. The file counties.csv contains
the number of counties for a sample of size 12 with replacement, with probabilities
proportional to population.
(a) Estimate the total number of counties in the United States, and find the standard
error of your estimate.
(b) With California being sampled three times and New Jersey twice, there were nine
distinct states in the sample. Writing your estimate in part (a) as t?=
P
i∈R wiQiti
with R = {CA, CO, CT, MA, MO, NJ, TN, VA, WI} and ti = number of counties
in state i, find wiQi
for each of those nine states. What is P
i∈R wiQi
for this sample?
What is its expected value over repeated random sampling? - Investigators selected a random sample of 200 teenagers from a population of 2000 for a
survey of screen time on smartphones and other handheld devices; the overall response rate
was 75%. A follow-up sample was taken of 10 of the 50 nonrespondents, with responses
obtained from all 10.
In the data file ScreenTime.csv, the variable Group takes the value 1 for respondents, 2
for nonrespondents included in the follow-up survey, and 3 for nonrespondents not in the
follow-up sample; the variable Minutes gives that individual’s average daily screen time
in minutes.
Give an approximate 95% confidence interval for the average screen time per day among
these 2000 teenagers.
WX:codehelp