DATA3888 (2022): Assignment 1
Instructions
- Your assignment submission needs to be a HTML document that you have compiled using R Markdown.
Name your file as SIDXXX_Assignment.Rmd” where XXX is your Student ID. - Under author, put your Student ID at the top of the Rmd file (NOT your name) .
- For your assignment, please use set.seed(3888) Do not upload the Rmd file (i.e. the code file).
- You must use code folding so that we can inspect your code where required.
- Your assignment should make sense and provide all the relevant information in the text when the code
is hidden. Don’t rely on the marker to understand your code. - Any output that you include needs to be explained in the text of the document. If your code chunk
generates unnecessary output you can suppress it by specifying results=‘hide’in the chunk options. -
Start each question in a separate section.
Question 1: Brain-box
A physics instructor Louis has created a data set stored under“Spiker box Louis.zip”that has a series
of sequences of varying lengths. The file name determines the eye movement. For example the file‘LRL
L3.wav’corresponds to left-right-left eye movements; the file LLRLRLRL_L.wav corresponds to left-left-rightleft-right-left-right-left‘
eye movements. There are a total of 31 files. Build a classification rule for detecting
{L, R} under streaming condition where the function will take a sequence of signal as an input.
• (i) Estimate the accuracy of your classifier. Is your value reasonable?
• (ii) Does the length of the sequence impact on the classification accuracy?
Hint: (a) Consider what metric you will use to define“performance”? You will need to explain your choice
and justify your answer.
dir(“data/Spiker_box_Louis/Short”)[1] “LLL_L1.wav” “LLL_L2.wav” “LLL_L3.wav” “LLR_L1.wav” “LLR_L2.wav”
[6] “LLR_L3.wav” “LRL_L1.wav” “LRL_L2.wav” “LRL_L3.wav” “LRR_L1.wav”
[11] “LRR_L2.wav” “LRR_L3.wav” “RLL_L1.wav” “RLL_L2.wav” “RLL_L3.wav”
[16] “RLR_L1.wav” “RLR_L2.wav” “RLR_L3.wav” “RRL_L1.wav” “RRL_L2.wav”
[21] “RRL_L3.wav” “RRR_L1.wav” “RRR_L2.wav” “RRR_L3.wav”
dir(“data/Spiker_box_Louis/Medium”)
[1] “LLRLRLRL_L.wav” “LLRRLLLR_L.wav” “LLRRRLLL_L.wav” “LRRRLLRL_L.wav”
[5] “RRRLRLLR_L.wav”
1
dir(“data/Spiker_box_Louis/Long”)[1] “LLLRLLLRLRRLRRRLRLLL_L.wav” “RRLRRLRLRLLLLLLRRLRL_L.wav”
Question 2: Prevalidated model
(from Week 3 lecture) The Kidney Transplant data from“GSE46474”contains the gene expression profiles
of 40 blood samples. Of those, 20 patients rejected their kidney and 20 had stable grafts and will be treated
as controls. Using this gene expression data. Lets build a classification model incorporating two types of
data using the prevalidation principle. Here, we first build a molecular signature (set of features) from
the gene expression platform to obtain a single variable known as prevalidated outcome. Next, we model this
prevalidated outcome in combination with the others other clinical variables to build a classifier of outcome
of interest.
• (a) Build a classifier using support vector machine (SVM) to predict the outcome of graft survival
and generate a prevalidated outcome from the gene expression data.
• (b) Use it together with the clinical variables in a logistic regression to build a risk model. Describe
your final model for classifying graft survival in different individuals and your estimate of its
accuracy.
• (c) What is the final prediction based on your final model for a 70-year-old male whose transcriptomics
profile is predicted to have a favourable survival outcome?
Question 3 – Blood vs Biopsy Biomarker
In the data GSE46474, we estimated the accuracy for our predictive model in graft rejection from peripheral
blood gene expression dataset. However, rejection is a very active process that occurs in the kidney itself.
Here we will look at a similar kidney microarray dataset. Therefore, instead of genes being isolated and
sequenced from blood, we examine another dataset GSE138043 where the samples have been sequenced from
a kidney biopsy. Select the top 50 most variable genes in each of the dataset GSE138043 and GSE46474
and use the selected genes to build a classifier using randomForest to predict the outcome of graft survival.
Visualize your results. We have broken this task into the following 4 sub section.
• (a) Select the top 50 most variable genes in each of the dataset GSE138043 and GSE46474. Combined
the two sets of genes and how many genes are in the union of these two list.
• (b) Build two classifier using randomForest to predict the outcome of graft survival using the genes
selected in part (a).
• (c) Preform repeated 5-fold cross validation for each of the data and calculate the accuracy. What is
the average accuracy for blood vs biopsy biomarker model?
• (d) Select an appropriate graphic to communicate the difference between these two classification
accuracy.
2
Question 4 – Visualisation on world map
Sully and colleagues have curated a public dataset containing characteristics linked to coral bleaching over the
last two decades. The data is in the file“Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv”,
and the author curated coral bleaching events at 3351 locations in 81 countries from 1998 to 2017. The
column“Average bleaching”records the percentage of coral reefs worldwide that were bleached during the
sampling periods, while the column“ClimSST”quantifies the sea-surface temperature (SST) at various
locations.
• (a) Use ggplot to visualize the percentage of bleaching in coral reefs on a world map and look at
which areas of the world have the most severe coral bleaching.
• (b) The team of scientist believe that“coral bleaching is less common in localities with a
high variance of sea-surface temperature Anomaly (SSTA) over time.”Use one or
two appropriate graphic together to demonstrate this point. Please explain your choice. Hint:
Determine which column of the data measure“sea-surface temperature Anomaly (SSTA)”.