关于机器学习:ECOM151大数据算法

ECOM151: Big Data Applications in Finance
Individual Assignment
Vimal Balasubramaniam
25 February 2020
Details
Grading: 20% of your final grade.
Deadline: 31 March, 2020. Time: 22:00hrs.
Submission mode: QMplus
Submission files: 1) An assignment results.R file, and 2) A “Prediction Report” with maximum of
5-pages with the interpretation of your results.
Please refrain from copying code from each other. Your code and output submission will provide
us sufficient information to detect such practices and it will be penalized.
What am I grading you on?
I am looking to test your ability to use models learned during the first half of the course to both
apply it to real data and to synthesize insights into a report.
This assignment requires you to apply all the skills learned during your first half of the term,
and the models learned during the lectures. Your ability to estimate the model is only half of the
challenge. Your interpretation of the findings matter equally.
1
Can I ask assignment related questions during office hours?
The questions that your TA and I will answer are only to clarify the meaning of a question. We
will not provide any clue on whether what you are doing is “right”, or “wrong”, or suggest ways to
improve your code. However, if you have doubts on estimating a model, we then ask you to phrase
your questions on the tutorial datasets and models, so that we are able to provide guidance.
The dataset provided to you should not have many challenges to resolve before you estimate a
model. However, if you face any challenges while using the data on Rstudio.cloud, do reach out to
us. I strongly recommend that you work on your assignment in your computer, and not on the cloud
so that you have complete control over the file. Any excuses that the file was not “saved”, or went
“missing” on the cloud will not be entertained.
Submission files
You are expected to create an R file as part of your submission. The R file should be selfcontained,
i.e., I should be able to run the code on my computer loading the same datafile without
any error.
In addition to all your code, you are expected to store your answers in objects named as
specifically instructed in each question. The name of object is provided in blue colour next to each
question. All “R objects” or columns in objects will also be referred in blue in the text below.
In addition to the R code, you are expected to create a PDF submission of no more than 5 pages,
that present your evaluation of the models.
The Assignment
LendingClub is an American peer-to-peer lending company, currently the world’s largest
platform that allows for individuals to both invest and borrow on the platform. Borrowers can obtain
unsecured personal loans from the platform, and this assignment is set up for you to assess your
ability to predict defaulters in the data using the predictors provided in the data.
The data is a random sample of loans issued on the platform between 2007−2015, including
2
the loan status, and payment information. The data also contains a number of predictors that have
been documented in the variables description file provided to you named “ECOM151-AssignmentVariableDescription.xlsx”.
For tractability, your assignment focuses only on a small set of variables
available for prediction.
You have been provided with on RData file named “ECOM151-Assignment-Data.rda” that
contain three objects:
trainData: This is the dataset on which you will train all your models.
testData: This is the dataset on which you will evaluate your model’s fit.
varDescription: This is a replication of the variable description available in the excel spreadsheet
provided to you.
Question A (10 points)
This question expects you to estimate five different class of models to identify the best model to
predict default on the LendingClub platform.
Set up the data
Load the RData file provided to you on to your work environment.
Questions:

Create a new variable in trainData called “y” which takes the value = 1 if loan status is
“Charged Off” and 0 otherwise.
All variables provided to you other than loan status are referred to as “predictors”.
Find the top 10 positively correlated variables with y and store it as a4.
Find the top 10 negatively correlated variables with y and store it as a5.
Now, we are ready to run the five models of interest. Spend time, and visualize the data. Inspect
for potential reasons why the model may not be estimated.
3
Pay particular attention to whether you would like to transform your variables (For example, a
logarithmic transformation). This may also help with interpretting coefficients in your “Prediction
Report”.
You may also want to consider converting some of the categorical variables (for example,
emp length) into a continuous variable.
LINEAR REGRESSION MODEL: Fit a linear regression model to the trainData, with y as the
outcome variable, with the predictors.
(a) What is the Mean squared error for the training data? Store this value in object named
m1.1.
(b) What is the Mean squared error for the testing data? Store this value in object named
m1.2.
BEST SUBSET MODEL: Fit a “Best subset selection” model to the trainData, with y as the
outcome variable, with the predictors.
Explore all approaches: “forward”, “backward”, and “unconstrained” best subset. Note that it
may take some time to execute the model in R, and with all of the predictors.
(a) What is the Mean squared error for the “best” model of this class for the training data?
Store this value in object named m2.1.
(b) What is the Mean squared error for the “best” model of this class for the test data? Store
this value in object named m2.2.
(c) What are the variables in the “best” model of this class? Store this in object named
m2.3.
RIDGE REGRESSION MODEL: Fit a ridge regression model to the trainData, with y as the
outcome variable, with the predictors.
4
Explore all values of lambda ( 1010 to 10−2
), setting alpha = 0, as in the tutorials. Hint:
glmnet() cannot handle “factor” or categorical variables. You will have to convert them into
dummies to be used.
(a) What is the Mean squared error for the “best” model of this class for the training data?
Store this value in object named m3.1.
(b) What is the Mean squared error for the “best” model of this class for the test data? Store
this value in object named m3.2.
(c) What are the 10 most important variables in the “best” model of this class? Store this in
object named m3.3.
LASSO: Fit a LASSO to the trainData, with y as the outcome variable, with the predictors.
Explore all values of lambda, setting the alpha parameter to 1 (the lasso penalty) in glmnet().
Hint: glmnet() cannot handle “factor” or categorical variables. You will have to convert them
into dummies to be used.
(a) What is the Mean squared error for the “best” model of this class for the training data?
Store this value in object named m4.1.
(b) What is the Mean squared error for the “best” model of this class for the test data? Store
this value in object named m4.2.
(c) What are the 10 most important variables in the “best” model of this class? Store this in
object named m4.3.
RANDOMFOREST: Fit a randomForest to the trainData, with y as the outcome variable, with
the predictors.
Explore and fit the best model of this class. Hint: randomForest() cannot handle “factor” or
categorical variables. You will have to convert them into dummies to be used.
5
(a) What is the Mean squared error for the “best” model of this class for the training data?
Store this value in object named m5.1.
(b) What is the Mean squared error for the “best” model of this class for the test data? Store
this value in object named m5.2.
(c) How important are the variables in predicting default? Store this value in object named
m5.3.
Compare and contrast the predictive power of all approaches and identify the best model to
predict default in the LendingClub data. Store this model in object named finalModel.
Question B (10 points)
You are required to synthesize all the work in Question A to submit a “Prediction Report” to
your manager on your ability to predict default for borrowers on the LendingClub platform. Utilize
all the information you have generated to write a report no longer than 5 pages and present your
best model to your manager. Pay attention to explain why it is the best model, in terms of its out of
sample predictive power, and visualize the model’s predictive power compared to the other models
on hand.

关于机器学习:ECOM151大数据算法

评论

发表回复取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

关于机器学习:ECOM151大数据算法

评论

发表回复 取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

发表回复取消回复