关于后端:STA314-解说

66次阅读

共计 6586 个字符,预计需要花费 17 分钟才能阅读完成。

STA314 Fall 2021 Homework 4
Homework 4
Deadline: Monday, Nov. 29, at 11:59pm.
Submission: You need to submit one file through Quercus with our answers to Questions 1, 2,
and 3 as well as code requested for Question 3. You can produce the PDF file however you like
(e.g. LATEX, Microsoft Word, scanner), as long as it is readable.
Neatness Point: One point will be given for neatness. You will receive this point as long as we
don’t have a hard time reading your solutions or understanding the structure of your code.
Late Submission: 10% of the total possible marks will be deducted for each day late, up to a
maximum of 3 days. After that, no submissions will be accepted.
Computing: To install Python and required libraries, see the instructions on the course web page.
Homeworks are to be done alone or in pairs. See the Course Information handout1
for detailed
policies.

  1. [6pts] Categorial Distribution. In this problem you will consider a Bayesian approach to
    modelling categorical outcomes. Let’s consider fitting the categorical distribution, which is
    a discrete distribution over K outcomes, which we’ll number 1 through K. The probability
    of each category is explicitly represented with parameter θk. For it to be a valid probability
    distribution, we clearly need θk ≥ 0 and P
    k
    θk = 1. We’ll represent each observation x as
    a 1-of-K encoding, i.e, a vector where one of the entries is 1 and the rest are 0. Under this
    model, the probability of an observation can be written in the following form:
    p(x|θ) = Y
    K
    k=1
    θ
    xk
    k
    .
    Suppose you observe a dataset,
    D = {x
    (i)
    }
    N
    i=1.
    Denote the count for outcome k as Nk =
    PN
    i=1 x
    (i)
    k
    and N =
    PK
    k=1 Nk. Recall that each data
    point is in the 1-of-K encoding, i.e., x
    (i)
    k = 1 if the ith datapoint represents an outcome k
    and x
    (i)
    k = 0 otherwise.
    (a) [2pts] First, derive ˆθk, which is the maximum likelihood estimator (MLE), for the class
    probabilities θk. You may assume that Nk > 0 for this question. Derivations should be
    rigorous.
    Hint 1: We saw in lecture that MLE can be thought of as‘ratio of counts’for the data,
    so what should ˆθk be counting?
    Hint 2: Similar to the binary case, write p(x
    (i)
    | θ) = ΠK
    k=1θ
    x
    (i)
    k
    k
    . The challenge in
    maximizing the log-likelihood is that this problem is a constrained maximization under
    the constraint that PK
    k=1 θk = 1. To overcome this, we will use a generalization of the
    idea from the coin flip example in lecture. We can turn the MLE maximization problem
    1
    https://www.cs.toronto.edu/~c…
    1
    STA314 Fall 2021 Homework 4
    into an easier constrained problem by setting θK = 1 −
    PK−1
    k=1 θk. Note that this is a
    maximization problem over the set {(θk)k6=K|θk > 0,
    P
    k6=K θk < 1}. Although this is
    still constrained, the log likelihood is a concave function that goes to −∞ as we approach
    the boundary of the constraint set, so the function attains its maximum in the interior
    of the region when ∂ log p(D|θ)/∂θk = 0.
    (b) [2pts] Now we will take a Bayesian approach. For the prior, we’ll use the Dirichlet
    distribution, which is defined over the set of probability vectors (i.e. vectors that are
    nonnegative and whose entries sum to 1). Its PDF is as follows:
    p(θ) ∝ θ
    a1−1
    1
    · · · θ
    ak−1
    K .
    What is the probability distribution of the posterior distribution p(θ | D)? Don’t just
    give the density of the posterior, say which family of distributions it belongs to.
    (c) [1pt] Still assuming the Dirichlet prior distribution, determine the MAP estimate of the
    parameter vector θ. For this question, you may assume each ak > 1.
    (d) [1pts] Now, suppose that your friend said that they had a hidden N + 1st outcome,
    x
    (N+1), drawn from the same distribution as the previous N outcomes. Your friend does
    not want to reveal the value of x
    (N+1) to you. So, you want to use your Bayesian model
    to predict what you think x
    (N+1) is likely to be. The“proper”Bayesian predictor is the
    so-called posterior predictive distribution:
    p(x
    (N+1)|D) = Z
    p(x
    (N+1)|θ)p(θ|D) dθ
    What is the probability that the N +1 outcome was k, i.e., the probability that x
    (N+1)
    k =
    1, under your posterior predictive distribution? Hint: A useful fact is that if θ ∼
    Dirichlet(a1, . . . , aK), then
    E[θk] = P
    ak
    k
  2. ak
    0
    .
    Report your answers to the above questions.
  3. [5pts] Gaussian Na¨ıve Bayes. In this question, you will derive the maximum likelihood
    estimates for Gaussian Na¨ıve Bayes, which is just like the na¨ıve Bayes model from lecture,
    except that the features are continuous, and the conditional distribution of each feature given
    the class is (univariate) Gaussian rather than Bernoulli. Start with the following generative
    model for a random discrete class label t ∈ {1, 2, …, K} and a random real valued vector of
    D features x ∈ R
    D:
    p(t = k) = αk (0.1)
    p(x|t = k) = Y
    D
    d=1
    2πσ2
    d
    !−1/2
    exp(

    X
    D
    d=1
    1

    2
    d
    (xd − µkd)
    2
    )
    (0.2)
    where αk ≥ 0 is the prior on class k, σ
    2
    d > 0 are the variances for each feature, which are
    shared between all classes, and µkd ∈ R is the mean of the feature d conditioned on class k.
    We write α to represent the vector with elements αk and similarly σ is the vector of variances.
    The matrix of class means is written µ where the kth row of µ is the mean for class k.
    (a) [1pt] Use Bayes’rule to derive an expression for p(t = k|x). Hint: Use the law of total
    probability to derive an expression for p(x).
    2
    STA314 Fall 2021 Homework 4
    (b) [1pt] Write down an expression for the likelihood function (LL)
    `(θ) = log p(t
    (1)
    , x
    (1), t(2)
    , x
    (2)
    , · · · , t(N)
    , x
    (N)
    ) (0.3)
    of a particular dataset D = {(t
    (1)
    , x
    (1)),(t
    (2)
    , x
    (2)), · · · ,(t
    (N)
    , x
    (N)
    )} with parameters
    θ = {α, µ,σ} and this model. (Assume the data are i.i.d.)
    (c) [3pts] Take partial derivatives of the likelihood with respect to each of the parameters
    µkd and with respect to the shared variances σ
    2
    d
    . Report these partial derivatives and
    find the maximum likelihood estimates for µkd and σ
    2
    d
    . You may assume that each class
    appears at least once in the dataset, i.e. the number of times Nk that class k appears
    in the dataset is Nk > 0.
    Report your answers to the above questions.
  4. [6pts] Gaussian Discriminant Analysis. For this question you will build classifiers to
    label images of handwritten digits. Each image is 8 by 8 pixels and is represented as a vector
    of dimension 64 by listing all the pixel values in raster scan order. The images are grayscale
    and the pixel values are between 0 and 1. The labels y are 0, 1, 2, . . . , 9 corresponding to
    which character was written in the image. There are 700 training cases and 400 test cases for
    each digit; they can be found in the .txt files provided.
    A skeleton (q3.py) is is provided for each question that you should use to structure your code.
    Starter code to help you load the data is provided (data.py). Note: the get digits by label
    function in data.py returns the subset of digits that belong to a given class.
    Using maximum likelihood, fit a set of 10 class-conditional Gaussians with a separate, full
    covariance matrix for each class. Remember that the conditional multivariate Gaussian probability
    density is given by,
    p(x |t = k) = (2π)
    −D/2
    |Σk|
    −1/2
    exp

    1
    2
    (x − µk
    )
    T Σ
    −1
    k
    (x − µk
    )

    (0.4)
    where µk ∈ R
    D, Σk ∈ R
    D×D and positive-definite. You should take p(t = k) = 1
    10
    . You
    will compute parameters µkj and Σk for k ∈ (0…9), j ∈ (1…64). You should implement the
    covariance computation yourself (i.e. without the aid of’np.cov’). Hint: To ensure numerical
    stability you may have to add a small multiple of the identity to each covariance matrix. For
    this assignment you should add 0.01I to each covariance matrix.
    (a) [5pts] Complete the 5 functions that are not complete in the starter code ([1pt] each).
    Include the function body for each completed function in your report.
    (b) [1pt] Report the average conditional log-likelihood, i.e. 1
    N
    PN
    i=1 log p(t
    (i)
    | x
    (i)
    ), of the
    trained model on both the train and test set. Report the accuracy, of the classifier that
    selects most likely posterior class for each data point using the trained model, on the
    train and test set.
    请加 QQ:99515681 或邮箱:99515681@qq.com WX:codehelp

正文完
 0