Assignment 2
Machine Learning and Big Data for Economics and Finance
Exercise 1. In this exercise, all the cross-validation simulations should involve a random split
of the original sample into a training subsample corresponding to 90% of the observations and
a testing subsample corresponding to the remaining 10% of the observations. 1. Generate a sample of size n = 100 from the following model: X is a uniform random variable over the interval (¡1; 1). ” is a normal random variable with mean 0 and standard deviation
1
- ” is
generated independently of X. Y is a random variable linked to X through the following equation: Y = 12X3 ¡ 5X2 ¡ 10X + “
From now on, delete ” and keep the generated X and Y samples in an R data frame
object that you will call dp. We wil now consider a supervised learning setup where Y
is the output variable and X is the input variable. Fron now on, enote the observations
of the input variable by xi and those of the output variable by yi (for i = 1; :::; n). 2. Fit linear, quadratic and cubic regressions to the data in dp. Create three plots (for each model respectively) where each plot has xi on the X-axis and the true yi and predicted
y^i on the Y -axis. Compute the training and testing mean-squared errors for each model. 3. We are interested in constructing a step function learner as follows: First draw a random number U uniformly on the interval spanned by the minimum
and maximum values of the input values (x1; :::; xn) and then use it to construct the
following function whose purpose is to give the prediction of Y given X = x: f(x) = 1I(U 6 x) + 2I(U > x); where 1 and 2 are just unknown constants to be learned. It goes without saying that
I(some statement) is the indicator function that equals 1 when the statement is true
and 0 otherwise. Construct an R function called wst that implements an estimate f^(x) = ^1I(U 6
x) + ^2I(U > x) of f. This R function must take the following inputs: A vector x of values at which you would like to compute f^. A data frame data containing the input and output variables. An optional numerical input argument u that overrides the behavior of the
learner by forcing the cutpoint of f to be at u instead of the randomly gen- erated cutpoint U. The outputs of the function wst should be a list that contains the following: A vector fitted that contains the predictions at x. A vector coefficients that contains ^1 and ^2. A number cutpoint that is either equal to the input u provided by the function
user or equal to the randomly generated U if the input argument u is not pro- vided. 4. Assess the variability of both the testing and the training mean squared errors of f^ when evaluated at the data dp by drawing B = 1000 bootstrap samples. 1
Exercise 2. Describe mathematically what the following code does. Add comments to each
line describing what each line accomplishes. Design a scenario where this code would be useful
and write that scenario in R. (The scenario should not involve more than four lines of R code)
OpSp = function(y,x){
n = length(y)
s = n – 1
xs = sort(x)
ci = rep(0,s)
ssri = rep(0,s)
for (i in 1:s){
ci[i] = (xs[i] + xs[i+1])/2
y1 = y[x<ci[i]] – mean(y[x<ci[i]] )
y2 = y[x>ci[i]] – mean(y[x>ci[i]] )
ssri[i] = sum(y1^2) + sum(y2^2)
}
return(ci[which.min(ssri)])
}