关于数据挖掘:STAT-440统计分析

6次阅读

共计 3309 个字符,预计需要花费 9 分钟才能阅读完成。

STAT 440 – Spring 2019 – Midterm Project
Recall that you may use your notes, books, or even the internet to help answer these questions, but all of the
work should be your own and you should not ask anyone for help or about any details related to the class
and project during this 60 hour period (this includes face to face interactions, emails, internet forums, etc.).
You have half of Wednesday and all of Thursday and Friday to work on this project. You are to turn it in by
midnight on Friday. You will be graded on the accuracy of your answers, the efficiency of your coding, and
the organization/clarity of your write up. You should have plenty of writing to convey what you are doing as
well as comments in your code to explain and make it easier to read. Show all of your work.
You should turn in an RMarkdown file as well as the output through Canvas. If you have any clarifying
questions or notice any issues, don’t hesitate to contact me and Nick. Note that in some of these questions, it
is up to you to select certain things; this is done on purpose so choose wisely and explain your decisions.

  1. Consider the following density
    f(x) = Cx3e4
    for x ≥ 0.
    a. Find the CDF and determine the normalizing constant C.
    b. Use the inverse CDF method to simulate 100,000 draws from f. Plot the histogram of your sample.
    Create a second plot where you zoom in on the x-axis and plot a kernel density estimate as well as
    the true density. Comment on the results.
    c. Suppose you want to use the normal density with mean 0 and variance σ
    2, call it g(x|σ2), to
    produce samples from f. Find a constant such that
    f(x)g(x|σ2)≤ M,
    for all x. Note this constant can/should depend on σ2
    . Feel free to either do this analytically or
    do this numerically for a few different values of σ2. Try to find a σ2, which produces a small value
    of M. Provide a few plots to justify your choice (or show the mathematics if you can).
    d. Using your choice of σ2
    from above, produce a sample of size 10,000 from f using the accept/reject
    method. Produce a histogram and use the sample to estimate the mean of f and produce a
    standard error of your estimate.
    e. Using the same σ
  2. and g, use importance sampling (sample size 10,000) to again estimate the
    mean of f and produce a standard error for your estimate. Compare with what you saw in (d).
  3. The file Szeged_Weather_Summary.csv contains monthly averages for different weather metrics in
    Szeged, Hungary.
    a. The variable WindBearing denotes the direction in which the wind is originating. The units are in
    degrees with 0 denoting due north, 90 due east, 180 due south, and 270 due west. All of the winds
    come from either the south east or south west. Create a new variable, Direction which indicates if
    the direction is southeast (<=180) or southwest (>180). Construct boxplots of temperature vs
    Direction.
    b. Use a permutation test to determine if Direction is associated with Temperature (measured in
    Celcius).
    c. Use a bootstrap method to construct a 95% confidence interval for the effect of Direction on
    Temperature (response variable here is Temperature). Do both a parametric and nonparametric
    bootstrap. Compare the results.
    1
    d. Pick another variable of your choice to associate with Temperature while also including Direction as
    another predictor. Explain why you think this variable is either important or interesting (to you).
    Fit a linear regression model with the two predictors and use a bootstrap method to construct a
    95% confidence interval of the two variables. Interpret your results in the context of the problem.
    e. Suppose we wish to compare Temperature and ApparentTemp, as we suspect they may be quite
    similar. Let μ1 be the true mean of Temperature and μ2 be the true mean of ApparentTemp. Use
    the nonparametric bootstrap to test H0 : μ1 = μ2 versus H1 : μ1 =6 μ2. Use a 5% significance level.
    Perform similar two-tailed tests using nonparametric bootstrap for the median and IQR. For each
    test, be sure to include the test statistic, p-value, and a proper conclusion.
正文完
 0