School of Mathematics and Statistics
University of Sheffield
MAS6006 Statistical Consultancy, 2018/19
Project 1: Emulators in Computer Modelling
Background
You are a statistician working at an engineering consultancy firm. Your company regularly
uses computer modelling in its work, but often has problems with the computational
expense involved in running the models. A single run of a computer model at one
choice of input values can take hours of CPU time. This causes problems whenever it is
necessary to run a model at a large number of different input values, for example, when
searching for an optimal input value to optimise the model output, or in assessing the
sensitivity of the model prediction to uncertainty in the choice of input values. Simply
investing in more computing resources isn’t thought to be the solution: more computing
power will instead be used to improve the accuracy of the models.
The models are deterministic, and so if run twice at the same input value, they
will produce the same output value. Currently, an approach used within the company
is to run a model at as many inputs as it can, and then use multi-dimensional linear
interpolation to predict (instantly) the model output at any desired new input value.
This method requires evaluations of the model over a regular grid of input values. For
example, if a model has three inputs, each scaled to take values between 0 and 1, one
could choose four evenly spaced values 0, 13,23, 1, then run the model 4 × 4× 4 times, at
each possible combination of these values for the three inputs. This can still be costly
in terms of how many times the model must be run: if the model has d inputs, and the
model is to be run over a regular grid with n values per input, this requires
total
model runs.
Your line manager has come across another technique that she thinks may work
better:“Gaussian process emulation”, and she wants you to investigate. She has found
the method described in the following paper:
O’Hagan, A. (2006). Bayesian analysis of computer code outputs: a tutorial.
Reliability Engineering and System Safety, 91, 1290-1300.
She has provided you with a data set to evaluate this method, which includes some
predictions obtained with the linear interpolation approach, and she has asked for a
short report that describes your findings.
The data
A data set from a computer model has been provided. The computer model takes a
vector of 8 inputs, each continuous over the interval [0,1], and returns a scalar output.
The model is deterministic: if run at the same input value twice, it will return the same
output value; there is no noise in the data. There are two files.
1
- training.csv contains 100 runs for fitting the emulator. Each row is one run of
the model. The first 8 columns give the value of the model input, and the 9th
column gives the model output. - test.csv contains 100 runs for testing the emulator. Each row is one run of the
model. The first 8 columns give the value of the model input, and the 9th column
gives the model output. The 10th column gives an estimate of the model output,
obtained using the linear interpolation method. The interpolation method used a
regular grid of 38 = 6561 model runs (so a different training data set to that in
training.csv) .
Using the emulator method
You should start with the article your line manager has suggested, but you can make
use of any other literature you wish. You will need to find an R package to implement
the method. There are various different names used, so you may wish to try several
searches:“Gaussian process emulator”,“Gaussian process meta-model”,“Gaussian
process regression”. Use one package only: comparing different packages is outside
the scope of the project. (Gaussian process methods have been implemented in other
languages, but you are required to use R for this project).
The report
The maximum report length is 6 pages, excluding references. You do not need to
include any R code in your report.
Your report must be structured according to the guidelines in Chapter 5 of the
module handbook (you should also follow carefully the advice and instructions in
Chapters 4, 6, 7, and 8).
Do not provide a detailed, technical account of Gaussian process emulators in your
report. You should, however, include an overview of how the method works. Your
target readers are not statisticians, but do have degree-level mathematics.
Your line manager is not interested in an analysis of the computer model itself:
do not write about the relationship between the model inputs and output. She
is interested in the performance of the emulator compared with the linear interpolation
method, and wants to know about the advantages and disadvantages of
emulators: in what circumstances the company might use them in other projects.
For reference, the multivariate linear interpolation method was implemented in
MATLAB, using the function interpn(). You do not need to write about this in
your report.
Submit your report on MOLE in the usual way. Separately, you should email your
R code (script, knitr or RMarkdown file) to j.oakley@sheffield.ac.uk.
Asking for help
In addition to your line manager, you also have a mentor within the company, who is a
more senior statistician. Your mentor has not used Gaussian process emulators before,
but may be able to advise you if you need help with the technical aspects of the project.
2
Of course, like all your other colleagues, your mentor is busy, and may not have the time
if you ask too much of him!
All questions should be posted on MOLE, on the Project 1 discussion board. Please
do not ask for help by email. This discussion board will be moderated, so your
message will not appear on the board until we have approved it (so you can ask anything
you like). State in any message who your question is addressed to:
your line manager, for questions about the project brief;
your mentor, for technical questions;
the module leader (Keith), for administrative questions.
Otherwise, the project must be entirely your own work. Do not ask anyone else for
help, or show your work to anyone else.
WX:codehelp