Fall 2021 CS 5340/4340 Project 6
Points 100 (UG) or 200 (G)
Due: Dec 9, 11:59 pm
r>This project is on implementing regression with regularization, with the regularization
parameter estimated from cross-validation. Do not use any package, except for the use of
random number generators and matrix operations (if any). That is, do not use any package
where regression, cross-validation, ridge or lasso are implemented as built-in functions. Use the
function y = f(x) = x2 + 10 to create a random sample (xj, yj) (training data) of size 12 (that is, j = 1
to 12) by uniform random sampling of x in -2 <= x <= 10. Obtain another five points (x-y pairs)
(by uniform random sampling of x in the same interval) to be used as the test set. Make sure
that the training set and the test set have no points in common; in the unlikely event of an
overlap, just re-sample.
- Obtain a non-linear (quadratic) regression of y on x. Do this without regularization.
- Next, implement linear regression of y on x with regularization (use “ridge” regression,
∑
2
), obtaining your from three-fold cross validation. Try the following values of
and choose one using cross-validation: 0.1, 1, 10, 100. - [For the grad section only; this part will require you to look up, read and implement
algorithms not covered in class] Finally, replace ridge in part 2 above with “lasso,”
∑ || , and re-do the computation. See the ISLR book for a good description of lasso.
Note that this project uses no single, fixed validation set because 3-fold cross-validation
arranges for multiple validation sets to be implicitly obtained from the training set; that is, we
are not using the traditional three-set (train/validate/test) scenario, but we are using the two-
set (train/cross-validate/test) scenario.
Clearly provide the following in your report as an itemized list (in addition to the source code):
(a) The twelve (x,y) values in the training set and the five (x,y) values in the test set.
(b) The equation of the non-linear regression relation you obtained in part 1; the in-sample
error; and the test error.
(c) The values of and the corresponding cross-validation errors (four such pairs) obtained
from the cross-validation phase in part 2 (also in part 3 for the grad section).
(d) The final (best) you chose after performing cross-validation in part 2 (and part 3 for
the grad section).
(e) The final equation of the regression relation you obtained after regularization (you
should have re-trained with the final on all 12 points in the training set); the
corresponding (final) in-sample error (state how many points were used in the
calculation of this error); and the corresponding test error.
(f) State whether or not you obtained your final (best) analytically. Just state (no
explanation).
(g) For ridge (and lasso for the grad section), arrange your three numerical errors (the
cross-validation error, the final in-sample (training) error, the test error) in ascending
order.
Plot your data points and the final solutions (by hand or with a plotting software; no extra
credit for using the software). Specifically, your plot should show 12 + 5 points on the x-y plane
and two (three for the grad section) final solutions (along with their respective equations) – one
curve for regression without regularization and two (three) lines for regression with
regularization – on the same plot.
Note that our original x is one-dimensional here. Take the average of squared differences as the
error.
Submission instructions as before. Please submit a single pdf (or Word or text) file on Canvas. If
you find it hard to merge text/figure and code into a single file, please submit multiple pdf files
(do NOT upload a zip file). Do not just provide a link to your code somewhere on the web.
Please write legibly if you are submitting any hand-written (and scanned) stuff.