乐趣区

关于后端:DATA3888-2022-Assignment-1

DATA3888 (2022): Assignment 1
Instructions

  1. Your assignment submission needs to be a HTML document that you have compiled using R Markdown.
    Name your file as SIDXXX_Assignment.Rmd” where XXX is your Student ID.
  2. Under author, put your Student ID at the top of the Rmd file (NOT your name) .
  3. For your assignment, please use set.seed(3888) Do not upload the Rmd file (i.e. the code file).
  4. You must use code folding so that we can inspect your code where required.
  5. Your assignment should make sense and provide all the relevant information in the text when the code
    is hidden. Don’t rely on the marker to understand your code.
  6. Any output that you include needs to be explained in the text of the document. If your code chunk
    generates unnecessary output you can suppress it by specifying results=‘hide’in the chunk options.
  7. Start each question in a separate section.
    Question 1: Brain-box
    A physics instructor Louis has created a data set stored under“Spiker box Louis.zip”that has a series
    of sequences of varying lengths. The file name determines the eye movement. For example the file‘LRL
    L3.wav’corresponds to left-right-left eye movements; the file LLRLRLRL_L.wav corresponds to left-left-rightleft-right-left-right-left‘eye movements. There are a total of 31 files. Build a classification rule for detecting
    {L, R} under streaming condition where the function will take a sequence of signal as an input.
    • (i) Estimate the accuracy of your classifier. Is your value reasonable?
    • (ii) Does the length of the sequence impact on the classification accuracy?
    Hint: (a) Consider what metric you will use to define“performance”? You will need to explain your choice
    and justify your answer.
    dir(“data/Spiker_box_Louis/Short”)

    [1] “LLL_L1.wav” “LLL_L2.wav” “LLL_L3.wav” “LLR_L1.wav” “LLR_L2.wav”

    [6] “LLR_L3.wav” “LRL_L1.wav” “LRL_L2.wav” “LRL_L3.wav” “LRR_L1.wav”

    [11] “LRR_L2.wav” “LRR_L3.wav” “RLL_L1.wav” “RLL_L2.wav” “RLL_L3.wav”

    [16] “RLR_L1.wav” “RLR_L2.wav” “RLR_L3.wav” “RRL_L1.wav” “RRL_L2.wav”

    [21] “RRL_L3.wav” “RRR_L1.wav” “RRR_L2.wav” “RRR_L3.wav”

    dir(“data/Spiker_box_Louis/Medium”)

    [1] “LLRLRLRL_L.wav” “LLRRLLLR_L.wav” “LLRRRLLL_L.wav” “LRRRLLRL_L.wav”

    [5] “RRRLRLLR_L.wav”

    1
    dir(“data/Spiker_box_Louis/Long”)

    [1] “LLLRLLLRLRRLRRRLRLLL_L.wav” “RRLRRLRLRLLLLLLRRLRL_L.wav”

    Question 2: Prevalidated model
    (from Week 3 lecture) The Kidney Transplant data from“GSE46474”contains the gene expression profiles
    of 40 blood samples. Of those, 20 patients rejected their kidney and 20 had stable grafts and will be treated
    as controls. Using this gene expression data. Lets build a classification model incorporating two types of
    data using the prevalidation principle. Here, we first build a molecular signature (set of features) from
    the gene expression platform to obtain a single variable known as prevalidated outcome. Next, we model this
    prevalidated outcome in combination with the others other clinical variables to build a classifier of outcome
    of interest.
    • (a) Build a classifier using support vector machine (SVM) to predict the outcome of graft survival
    and generate a prevalidated outcome from the gene expression data.
    • (b) Use it together with the clinical variables in a logistic regression to build a risk model. Describe
    your final model for classifying graft survival in different individuals and your estimate of its
    accuracy.
    • (c) What is the final prediction based on your final model for a 70-year-old male whose transcriptomics
    profile is predicted to have a favourable survival outcome?
    Question 3 – Blood vs Biopsy Biomarker
    In the data GSE46474, we estimated the accuracy for our predictive model in graft rejection from peripheral
    blood gene expression dataset. However, rejection is a very active process that occurs in the kidney itself.
    Here we will look at a similar kidney microarray dataset. Therefore, instead of genes being isolated and
    sequenced from blood, we examine another dataset GSE138043 where the samples have been sequenced from
    a kidney biopsy. Select the top 50 most variable genes in each of the dataset GSE138043 and GSE46474
    and use the selected genes to build a classifier using randomForest to predict the outcome of graft survival.
    Visualize your results. We have broken this task into the following 4 sub section.
    • (a) Select the top 50 most variable genes in each of the dataset GSE138043 and GSE46474. Combined
    the two sets of genes and how many genes are in the union of these two list.
    • (b) Build two classifier using randomForest to predict the outcome of graft survival using the genes
    selected in part (a).
    • (c) Preform repeated 5-fold cross validation for each of the data and calculate the accuracy. What is
    the average accuracy for blood vs biopsy biomarker model?
    • (d) Select an appropriate graphic to communicate the difference between these two classification
    accuracy.
    2
    Question 4 – Visualisation on world map
    Sully and colleagues have curated a public dataset containing characteristics linked to coral bleaching over the
    last two decades. The data is in the file“Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv”,
    and the author curated coral bleaching events at 3351 locations in 81 countries from 1998 to 2017. The
    column“Average bleaching”records the percentage of coral reefs worldwide that were bleached during the
    sampling periods, while the column“ClimSST”quantifies the sea-surface temperature (SST) at various
    locations.
    • (a) Use ggplot to visualize the percentage of bleaching in coral reefs on a world map and look at
    which areas of the world have the most severe coral bleaching.
    • (b) The team of scientist believe that“coral bleaching is less common in localities with a
    high variance of sea-surface temperature Anomaly (SSTA) over time.”Use one or
    two appropriate graphic together to demonstrate this point. Please explain your choice. Hint:
    Determine which column of the data measure“sea-surface temperature Anomaly (SSTA)”.
    3

退出移动版