关于算法:BINF90001-机器学习

BINF90001 – Semester 1, 2022 – Assignment 1
Due date: 17:00, Thursday 14 April 2022
Instructions
This assignment contains 5 questions worth a total of 50 points. It will contribute 20% to your
assessment for this subject.
Submit your assignment as a PDF file via the LMS. Please refer to the submission instructions
on the LMS for more information.
Late assignments will only be accepted under exceptional circumstances. Usually a medical
certificate will be required. A late penalty may be imposed.
Your submission should clearly show your name and student ID number.
Provide tables, graphs, R code and concise text explanations to support your answers. Graphs
must be clear and well-labelled, including informative axis labels and titles. All tables, graphs
and code must be accompanied by explanation and interpretation. Satisfactory presentation
forms part of the marking criteria: points will be deducted for excessive and/or poorly organised
work.
Present all material related to each answer (such as R code and plots) together: do not pro-
vide any appendix or supplementary information at the end. Use of R Markdown is highly
recommended.
Do not discuss solutions to these problems on the online discussion forum. However, you can
use the forum to seek clarifications of the questions.
Data
In this assignment you will analyse some SNP genotype and phenotype data. The data files are
available from the Assignments page on the LMS. There are data from three studies:
Study 1. 500 individuals each genotyped at 200 SNPs (genotypes1.csv) and their body mass index
(BMI) recorded in kg/m2 (phenotype1.csv).
Study 2. 1,500 individuals (not including any of the 500 above) each genotyped at the same 200 SNPs
(genotypes2.csv) and their overweight status recorded as either true or false (phenotype2.csv).
For this study, a person is defined as being overweight if their BMI is greater than 25 kg/m2.
Study 3. 100 individuals each genotyped at the same 200 SNPs (genotypes3.csv) and no phenotypes
recorded.
Questions

(11 points)
(a) Read all of the data into R as data frames (one data frame per file). Check that the study
sizes reported above are correct. List the first 5 phenotypes in both Study 1 and Study 2.
(b) For each SNP in Study 1, fit a simple linear regression model for BMI against the SNP
genotype (i.e. 200 separate models each with a single predictor) and record the p-value from
testing the null hypothesis of no association between the SNP and BMI. Consider only the
additive model (1 parameter) for each SNP, rather than the general model (2 parameters).
(c) Draw a Manhattan plot to visualise all of the p-values from these tests on a log10 scale.
Briefly describe what you conclude from this plot.
(d) Which SNP has the smallest p-value? What is the p-value?
1
(18 points)
(a) You decide to combine studies 1 and 2 together. This requires making the phenotypes to
be equivalent. Convert the phenotype from Study 1 to be the same as for Study 2, and
then combine the two studies by creating a single data frame for the phenotype and one
for the genotypes.
(b) For the combined data, test each SNP for association with overweight status and record
the 200 p-values. Again, consider only an additive genetic model.
(c) Draw a Manhattan plot to visualise the p-values from these tests on a log10 scale. Briefly
describe what you conclude from this plot.
(d) Which SNP has the smallest p-value? What is the p-value?
(e) i. Report the number of SNPs that are significant using the Bonferroni method to control
the family-wise error rate at 5% across the 200 tests.
ii. Report the number of SNPs that are significant using the Benjamini & Hochberg
method to control the false discovery rate (FDR) at 5% across the 200 tests.
iii. Using the Storey method with λ = 0.1, what is the expected number of null SNPs that
are significant at level α? = 0.001. How many SNPs are observed to be significant at
this level? What is the resulting FDR estimate?
(f) Describe how you would report the number of significantly associated SNPs from the as-
sociation analysis in the combined study. How would you decide which SNPs to report as
significant and how would you summarise the possibility of error?
(7 points)
(a) Identify the SNPs with the 8 smallest p-values from the previous question, and report their
p-values.
(b) Some of the significant SNPs may be in high linkage disequilibrium (LD). To investigate
this, calculate the 8 × 8 correlation matrix between these 8 SNPs. What do you conclude
about the results from the association analysis?
(c) Defining‘high LD’to be a squared correlation coefficient > 0.5, identify the set of SNPs
among the top 8 with the lowest p-values subject to no two SNPs in the set being in high
LD with each other. (This process is called clumping and the resulting SNPs are called tag
SNPs.)
(8 points)
(a) Fit a logistic regression model that includes all of the tag SNPs as predictors (additive effects
only, like the previous models). Report the parameter estimates and standard errors.
(b) For each SNP, report the effect size as an odds ratio (OR) and explain how to interpret it.
(c) Report 95% confidence intervals for each OR.
(6 points)
(a) Using your fitted model from question 4, compute the risk (i.e. probability of being over-
weight) for all individuals in Study 3.
(b) Plot the risks with the individuals sorted in order of increasing risk.
(c) What is the risk for indiv2001?
(d) Lifestyle factors also have an impact on the risk of being overweight, not just genetic
factors. Positive lifestyle (such as a good diet and regular exercise) can reduce the risk.
For simplicity, suppose that lifestyle and genetic factors act independently. For indiv2001,
how strong would the lifestyle factors collectively need to be (expressed as an odds ratio)
in order to counteract the genetic risk (i.e. to make the overall risk be equivalent to the
lowest predicted risk based on the SNP genotypes)?