关于算法:DSC-190流程的面纱

42次阅读

共计 3244 个字符,预计需要花费 9 分钟才能阅读完成。

DSC 190 –“Super Homework”
Due: Wednesday, June 8
Write your solutions to the following problems by either typing them up or handwriting them on another
piece of paper. Unless otherwise noted by the problem’s instructions, show your work or provide some
justification for your answer. Homeworks are due via Gradescope at 11:59 PM.
Problem 1.
On Midterm 02, there was a question along the following lines. Suppose you have a data set of points X in
R100 and wish to use PCA to reduce the dimensionality to 50. Consider these two approaches:
? Approach 1: Run PCA once to go directly from R100 to R50, constructing a new data set Z1.
? Approach 2: First run PCA with k = 75 to create an intermediate data set Z ′ of points in R75, then
run PCA with k = 50 on Z ′ to create a new data set Z2.
Is there any difference between the two approaches?
The correct answer is: no, there is not. That is, Z1 = Z2. The exam’s answer key gave an intuitive geometric
explanation of the answer – here we will derive it more rigorously.
In this problem, assume that X is an n× d matrix of n data points in Rd; furthermore, assume the data are
centered. Let C be the covariance matrix of the original data. Let C ′ be the covariance matrix of Z ′ (the
intermediate data in approach #2). Let U75 be a 100× 75 matrix consisting of the top 75 eigenvectors of C,
and let U50 be a 100× 50 matrix consisting of the top 50 eigenvectors of C. Then the new PCA features in
approach 1 are Z1 = XU50, and the intermediate PCA features in approach 2 are Z ′ = XU75.
Throughout this problem you may assume for simplicity that all eigenvalues are unique.
a) Recall that C ′ is the covariance matrix of Z ′, the intermediate data in approach #2. Show that C ′ is
a diagonal matrix.
Hint: C ′ = (Z ′)TZ ′. Also remember that for general matrices AB, (AB)T = BTAT .
b) The data set Z2 is computed by multiplying the intermediate data set Z ′ by a 75 × 50 matrix U ′
consisting of the top 50 eigenvectors of the covariance matrix C ′.
Argue that U ′ is the matrix where entry u′ii = 1 and all other entries are zero. That is, it is a kind of
rectangular identity matrix.
c) Using what we have learned above, show that Z2 = XU50, and is therefore equal to Z1.
Hint: Z2 = Z ′U ′. Start by substituting for both U ′ and Z ′.
Problem 2.
As a data scientist you will have the opportunity to work on problems that are of great importance to society.
This is not one of those problems.
The menu-match dataset consists of 646 images of food from three different restaurants: an Asian restaurant,
an Italian restaurant, and a soup restaurant. The data set was constructed by employees of Microsoft
Research1.
The data set is available at the following link:
1http://neelj.com/projects/menumatch/
1
https://f000.backblazeb2.com/file/dsc-data/menu-match.npz
The file is in compressed numpy format; it can be loaded with‘np.load‘. Once loaded, it behaves like a
dictionary with four keys: X_train, X_test, y_train, and y_test, corresponding to the training data, test
data, training labels, and test labels, respectively. For example, to get the training data:

data = np.load(“menu-match.npz”)
data[‘X_train’]
The training data is a 480 x 48 x 48 x 3 tensor, since it consists of 480 images, each 48 x 48 pixels with
3 colors.
Note that the given labels are strings:“a”for Asian restaurant,“i”for Italian, and“s”for soup.
You may display an image with matplotlib using plt.imshow. For example:

plot test example #30

plt.imshow(data’X_test’)
Using tensorflow, train a convolutional neural network to predict whether a given image is from the Asian
restaurant or not (thus turning the problem into a binary classification problem). You will need to determine
the network architecture, but the example code given in lecture and in the discussion is a good starting point.
Report the test accuracy – your model should be able to get above 70% of the test set correct (preferably
higher!). Show your code.

正文完
 0