关于数据挖掘:STAT-440统计分析

STAT 440 – Spring 2019 – Midterm Project
Recall that you may use your notes, books, or even the internet to help answer these questions, but all of the
work should be your own and you should not ask anyone for help or about any details related to the class
and project during this 60 hour period (this includes face to face interactions, emails, internet forums, etc.).
You have half of Wednesday and all of Thursday and Friday to work on this project. You are to turn it in by
midnight on Friday. You will be graded on the accuracy of your answers, the efficiency of your coding, and
the organization/clarity of your write up. You should have plenty of writing to convey what you are doing as
well as comments in your code to explain and make it easier to read. Show all of your work.
You should turn in an RMarkdown file as well as the output through Canvas. If you have any clarifying
questions or notice any issues, don’t hesitate to contact me and Nick. Note that in some of these questions, it
is up to you to select certain things; this is done on purpose so choose wisely and explain your decisions.

Consider the following density
f(x) = Cx3e4
for x ≥ 0.
a. Find the CDF and determine the normalizing constant C.
b. Use the inverse CDF method to simulate 100,000 draws from f. Plot the histogram of your sample.
Create a second plot where you zoom in on the x-axis and plot a kernel density estimate as well as
the true density. Comment on the results.
c. Suppose you want to use the normal density with mean 0 and variance σ
2, call it g(x|σ2), to
produce samples from f. Find a constant such that
f(x)g(x|σ2)≤ M,
for all x. Note this constant can/should depend on σ2
. Feel free to either do this analytically or
do this numerically for a few different values of σ2. Try to find a σ2, which produces a small value
of M. Provide a few plots to justify your choice (or show the mathematics if you can).
d. Using your choice of σ2
from above, produce a sample of size 10,000 from f using the accept/reject
method. Produce a histogram and use the sample to estimate the mean of f and produce a
standard error of your estimate.
e. Using the same σ
and g, use importance sampling (sample size 10,000) to again estimate the
mean of f and produce a standard error for your estimate. Compare with what you saw in (d).
The file Szeged_Weather_Summary.csv contains monthly averages for different weather metrics in
Szeged, Hungary.
a. The variable WindBearing denotes the direction in which the wind is originating. The units are in
degrees with 0 denoting due north, 90 due east, 180 due south, and 270 due west. All of the winds
come from either the south east or south west. Create a new variable, Direction which indicates if
the direction is southeast (<=180) or southwest (>180). Construct boxplots of temperature vs
Direction.
b. Use a permutation test to determine if Direction is associated with Temperature (measured in
Celcius).
c. Use a bootstrap method to construct a 95% confidence interval for the effect of Direction on
Temperature (response variable here is Temperature). Do both a parametric and nonparametric
bootstrap. Compare the results.
1
d. Pick another variable of your choice to associate with Temperature while also including Direction as
another predictor. Explain why you think this variable is either important or interesting (to you).
Fit a linear regression model with the two predictors and use a bootstrap method to construct a
95% confidence interval of the two variables. Interpret your results in the context of the problem.
e. Suppose we wish to compare Temperature and ApparentTemp, as we suspect they may be quite
similar. Let μ1 be the true mean of Temperature and μ2 be the true mean of ApparentTemp. Use
the nonparametric bootstrap to test H0 : μ1 = μ2 versus H1 : μ1 =6 μ2. Use a 5% significance level.
Perform similar two-tailed tests using nonparametric bootstrap for the median and IQR. For each
test, be sure to include the test statistic, p-value, and a proper conclusion.

关于数据挖掘:STAT-440统计分析

评论

发表回复取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

关于数据挖掘:STAT-440统计分析

评论

发表回复 取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

发表回复取消回复