关于机器学习:ECMM422-Machine-Learning

46次阅读

共计 25274 个字符,预计需要花费 64 分钟才能阅读完成。

ECMM422 Machine Learning
Course Assessment 1
This course assessment (CA1) represents 40% of the overall module assessment.
This is an individual exercise and your attention is drawn to the College and University
guidelines on collaboration and plagiarism, which are available from the College website.
Note:
.
do not change the name of this notebook, i.e. the notebook file has to be named: ca1.ipynb
.
do not remove/delete any cell
.
do not add any cell (you can work on a draft notebook and only copy the function
implementations here)
.
do not add you name or student code in the notebook or in the file name
Evaluation criteria:
Each question asks for one or more functions to be implemented.
Each question is awarded a number of marks.
A (hidden) unit test is going to evaluate if all desired properties of the required function(s) are
met.
If the test passes all the associated marks are awarded, if it fails 0 marks are awarded. The large
number of questions allows a fine grading.
Notes:
In the rest of the notebook, the term data matrix refers to a two dimensional numpy array
where instances are encoded as rows, e.g. a data matrix with 100 rows and 4 columns is to be
interpreted as a collection of 100 instances each with four features.
When a required function can be implemented directly by a library function it is intended that
the candidate should write her own implementation of the function, e.g. a function to compute
the accuracy or the cross validation.
Some questions are just a check-point, i.e. it is for you to see that you are correctly
implementing all functions. Since those check-points use functions that you have already
implemented and that have already been marked, those questions are not going to be marked
(i.e. they appear as having marks 0).
In []: %matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 2/16
Question 1 [marks 6]
a) Make a function data_matrix = make_data_classification(mean, std,
n_centres, inner_std, n_samples, random_seed=42) to create a data matrix
according to the following rules:
.
mean is a n-dimensional vector (say [1,1], but the function should allow vectors of any
dimension)
.
n_centres is the number of centres (say 3)
.
std is the standard deviation (say 1)
.
the centres are sampled from a Normal distribution with mean mean and standard
deviation std
.
from each centre sample n_samples from a Normal distribution with the centre as the
mean and standard deviation inner_std so if mean=[1,1] n_centres=3 and
n_samples=10 then the data matrix will be a 30 rows x 2 columns numpy array.
b) Make a function data_matrix, targets = make_data_regression(mean, std,
n_centres, inner_std, n_samples_list, random_seed=42) to create a data matrix
and a target vector according to the following rules:
.
the data matrix is constructed in the same way as in make_data_classification
.
the targets are the Euclidean distance between the sample and the centre of the generating
Normal distribution
See Question 3 for a graphical example of the expected output.
Question 2 [marks 2]
import scipy as sp

unit test utilities: you can ignore these function

def is_approximately_equal(test,target,eps=1e-2):
return np.mean(np.fabs(np.array(test) – np.array(target)))<eps
def assert_test_equality(test, target):
assert is_approximately_equal(test, target), ‘Expected:\n %s \nbut got:\n %s
In []:
def make_data_classification(mean, std, n_centres, inner_std, n_samples, random_
# YOUR CODE HERE
raise NotImplementedError()
def make_data_regression(mean, std, n_centres, inner_std, n_samples, random_seed
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 3/16
a) Make a function data_matrix, targets =
get_dataset_classification(n_samples, std, inner_std) to create a data matrix
and a target vector for a binary classification problem according to the following rules:
the instances from the positive class are generated according to the same rules provided
for make_data_classification ; so are the instances from the negative class
instances from the positive class have as mean the vector [10,10] and those from the
negative class, vector [-10,-10]
the number of centres is fixed to 3
the random seed is fixed to 42
n_samples indicates the total number of instances finally available in the output
data_matrix
b) Make a function data_matrix, targets = get_dataset_regression(n_samples,
std, inner_std) to create a data matrix according to the following rules:
the instances are generated according to the same rules provided for
make_data_regression
the targets are generated according to the same rules provided for
make_data_regression
instances have as mean the vector [10,10]
the number of centres is fixed to 3
the random seed is fixed to 42
n_samples indicates the total number of instances finally available in the output
data_matrix
Question 3 [marks 1]
Make a function plot(X,y) to display the scatter plot of a data matrix of two dimensional
instances using the array y to assign the colour to the instances.
When running
X, y = get_dataset_regression(n_samples=600, std=30, inner_std=5)
plot(X,y)
you should get something like
In []:
def get_dataset_classification(n_samples, std, inner_std):
# YOUR CODE HERE
raise NotImplementedError()
def get_dataset_regression(n_samples, std, inner_std):
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 4/16
and when running
X, y = get_dataset_classification(n_samples=600, std=30, inner_std=5)
plot(X,y)
you should get something like
Question 4 [marks 1]
Make a function classification_error(targets, preds) to compute the fraction of
times that the entries in targets do not agree with the corresponding entries in preds .
Note: do not use library functions to compute the result directly but implement your own
version.
Question 5 [marks 2]
Make a function regression_error(targets, preds) to compute the mean squared error
between targets and preds .
Note: do not use library functions to compute the result directly but implement your own
version.
Question 6 [marks 7]
Make a function make_bootstrap(data_matrix, targets) to extract a bootstrapped
replicate of an input dataset.
In []:
def plot(X,y):
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:
def classification_error(targets, preds):
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

MSE =
n∑
i=1
(Ti − Pi)
2
.
1
n
In []:
def regression_error(targets, preds):
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 5/16
The function should return the following 6 elements (in this order):
bootstrap_data_matrix, bootstrap_targets, bootstrap_sample_ids,
oob_data_matrix, oob_targets, oob_samples_ids , where:
.
bootstrap_data_matrix : is a data matrix encoding the bootstrapped replicate of the
data matrix
.
bootstrap_targets : is the corresponding bootstrapped replicate of the target vector
.
bootstrap_sample_ids : is an array containing the instance indices of the bootstrapped
replicate of the data matrix
.
oob_data_matrix : is a data matrix encoding the out of bag instances
.
oob_targets : is the corresponding out of bag instances of the target vector
.
oob_samples_ids : is an array containing the instance indices of the out of bag instances
Question 7 [marks 10]
Consider the following functional blueprints estimator = train(X_train, y_train,
param) and test(X_test, estimator) . A function of type train takes in input a data
matrix X_train a target vector y_train and a single value param (not a list of
parameters). A function of type train outputs an object that represent an estimator. A
function of type test takes in input a data matrix X_test the fit object estimator and
outputs the predicted targets.
Using this blueprint, write the specialised train and test functions for the following classifiers
and regressors (use the function signature provided in the next cell, e.g. train_ab for training
an adaboost classifier):
Classifiers:
a) k-nearest-neighbor: the parameter controls the number of neighbors (you may use
KNeighborsClassifier from scikit) [train_knn, test_knn]
b) adaboost: the parameter controls the maximal depth of the decision tree uses as weak
classifier (you may use the DecisionTreeClassifier from scikit but you should provide your
own implementation of the boosting algorithm) [train_ab, test_ab]
c) random forest: the parameter controls the maximal depth of the tree (you may use the
DecisionTreeClassifier from scikit but you should provide your own implementation of
the bagging algorithm) [train_rfc, test_rfc]
Regressors:
In []:
def make_bootstrap(data_matrix, targets):
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 6/16
d) decision tree: the parameter controls the maximal depth of the tree (you may use the
DecisionTreeRegressor from scikit) [train_dt, test_dt]
e) svm linear: the parameter controls the regularization constant C (you may use SVR from
scikit) [train_svm_1, test_svm]
f) svm with a polynomial kernel of degree 2: the parameter controls the regularization
constant C (you may use SVR from scikit) [train_svm_2, test_svm]
g) svm with a polynomial kernel of degree 3: the parameter controls the regularization
constant C (you may use SVR from scikit) [train_svm_3, test_svm]
h) random forest: the parameter controls the maximal depth of the tree (you may use the
DecisionTreeRegressor from scikit but you should provide your own implementation of
the bagging algorithm) [train_rf, test_rf]
For the algorithms adaboost and random forest , the size of the ensemble should be fixed
to 100.
In []:

classifiers

from sklearn.neighbors import KNeighborsClassifier
def train_knn(X_train, y_train, param):
# YOUR CODE HERE
raise NotImplementedError()
def test_knn(X_test, est):
# YOUR CODE HERE
raise NotImplementedError()
from sklearn.tree import DecisionTreeClassifier
def train_ab(X_train, y_train, param):
# YOUR CODE HERE
raise NotImplementedError()
def test_ab(X_test, models):
# YOUR CODE HERE
raise NotImplementedError()
from sklearn.tree import DecisionTreeClassifier
def train_rfc(X_train, y_train, param):
# YOUR CODE HERE
raise NotImplementedError()
def test_rfc(X_test, models):
# YOUR CODE HERE
raise NotImplementedError()

regressors

from sklearn.tree import DecisionTreeRegressor
def train_dt(X_train, y_train, param):
# YOUR CODE HERE
raise NotImplementedError()
def test_dt(X_test, est):
2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 7/16
Question 8 [marks 0]
This is just a check-point, i.e. it is for you to see that you are correctly implementing all
functions. Since this cell uses functions that you have already implemented and that have
already been marked, this Question is not going to be marked.
Make a dataset using
X, y = get_dataset_classification(n_samples=240, std=30, inner_std=10)
# YOUR CODE HERE
raise NotImplementedError()
from sklearn.svm import SVR
def train_svm_1(X_train, y_train, param):
# YOUR CODE HERE
raise NotImplementedError()
def train_svm_2(X_train, y_train, param):
# YOUR CODE HERE
raise NotImplementedError()
def train_svm_3(X_train, y_train, param):
# YOUR CODE HERE
raise NotImplementedError()

Note: you do not need to specialise the svm test function for each degree

def test_svm(X_test, est):
# YOUR CODE HERE
raise NotImplementedError()
from sklearn.tree import DecisionTreeRegressor
def train_rf(X_train, y_train, param):
# YOUR CODE HERE
raise NotImplementedError()
def test_rf(X_test, models):
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 8/16
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)
and check that the classification error for
k-nearest-neighbor
random forest classifier
adaboost
Question 9 [marks 0]
This is just a check-point, i.e. it is for you to see that you are correctly implementing all
functions. Since this cell uses functions that you have already implemented and that have
already been marked, this Question is not going to be marked.
Make a dataset using
X, y = get_dataset_regression(n_samples=120, std=30, inner_std=10)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)
and check that the regression error for these regressors
decision tree
svm with polynomial kernel of degree 2
svm with polynomial kernel of degree 3
is approximately comparable.
Question 10 [marks 10]
In []:

Just run the following code, do not modify it

X, y = get_dataset_classification(n_samples=240, std=30, inner_std=10)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)
param=3
e_knn = classification_error(y_test, test_knn(X_test, train_knn(X_train, y_train
e_rfc = classification_error(y_test, test_rfc(X_test, train_rfc(X_train, y_train
e_ab = classification_error(y_test, test_ab(X_test, train_ab(X_train, y_train, p
print(e_knn, e_rfc, e_ab)
In []:

Just run the following code, do not modify it

X, y = get_dataset_regression(n_samples=120, std=30, inner_std=10)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)
param=3
e_dt = regression_error(y_test, test_dt(X_test, train_dt(X_train, y_train, param
e_svm2 = regression_error(y_test, test_svm(X_test, train_svm_2(X_train, y_train,
e_svm3 = regression_error(y_test, test_svm(X_test, train_svm_3(X_train, y_train,
print(e_dt, e_svm2, e_svm3)
2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 9/16
Make a function sizes, train_errors, test_errors =
compute_learning_curve(train_func, test_func, param, X, y, test_size,
n_steps, n_repetitions) to compute the train and test errors as mandated in the learning
curve approach.
The regressor will be trained via train_func on the problem data_matrix , targets with
parameter param . The estimate will be done averaging a number of replicates equal to
n_repetitions , i.e. the code needs to repeat the process n_repetitions times (say 10)
and average the error.
Note that a fraction of the data as indicated by test_size (say 0.33 for 30%) is going to be
reserved for testing purposes. The remaining amount of data can be used in the training phase.
The learning curve should be computed for an amount of training material that varies from a
minimum of 2 instances up to all the instances available for training.
You should use the function regression_error to compute the error.
Note: do not use library functions (e.g. learning_curve in scikit) to compute the result
directly but implement your own version.
Question 11 [marks 1]
Make a function plot_learning_curve(sizes, train_errors, test_errors) to
display the train and test error as a function of the size of the training set.
You should get something like:
Question 12 [marks 3]
Make a function estimate_asymptotic_error(sizes, train_errors, test_errors)
that returns an estimate of the asymptotic error, i.e. the error made in the limit of an infinitely
large training set.
In []:
def compute_learning_curve(train_func, test_func, param, X, y, test_size, n_step
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:
def plot_learning_curve(sizes, train_errors, test_errors):
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 10/16
Question 13 [marks 0]
This is just a check-point, i.e. it is for you to see that you are correctly implementing all
functions. Since this cell uses functions that you have already implemented and that have
already been marked, this Question is not going to be marked.
When you run:
X, y = get_dataset_regression(n_samples=800, std=30, inner_std=10)
train_func, test_func = train_dt, test_dt
param=5
sizes, train_errors, test_errors = compute_learning_curve(train_func,
test_func, param, X, y, test_size=.3, n_steps=10, n_repetitions=100)
e = estimate_asymptotic_error(train_errors, test_errors)
print(‘Asymptotic error: %.1f’%e)
plot_learning_curve(sizes, train_errors, test_errors)
you should get something like
Question 14 [marks 6]
Make a function bias2, variance = compute_bias_variance(predictions_dict,
targets) that takes in input a dictionary of lists of predictions indexed by the instance index,
and the target vector. The function should compute the squared bias component of the error
and the variance components of the error for each instance.
As a toy example consider: predictions_dict={0:[1,1,1], 1:[1,-1], 2:
[-1,-1,-1,1]} and targets=[1,1,-1] , that is, for instance with index 0 there are 3
predictions available [1,1,1] , instead for instance with index 1 there are only 2 predictions
available [1,-1] , etc. In this case, you should get bias2=[0. , 1. , 0.25] and
variance=[0. , 1. , 0.75] .
In []:
def estimate_asymptotic_error(sizes, train_errors, test_errors):
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

Just run the following code, do not modify it

X, y = get_dataset_regression(n_samples=800, std=30, inner_std=10)
train_func, test_func = train_dt, test_dt
param=5
sizes, train_errors, test_errors = compute_learning_curve(train_func, test_func,
e = estimate_asymptotic_error(sizes, train_errors, test_errors)
print(‘Asymptotic error: %.1f’%e)
plot_learning_curve(sizes, train_errors, test_errors)
In []: def compute_bias_variance(predictions_dict, targets):
# YOUR CODE HERE
raise NotImplementedError()
2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 11/16
Question 15 [marks 10]
Make a function bias2, variance = bias_variance_decomposition(train_func,
test_func, param, data_matrix, targets, n_bootstraps) to compute the bias
variance decomposition of the error of a regressor on a given problem. The regressor will be
trained via train_func on the problem data_matrix , targets with parameter param .
The estimate will be done using a number of replicates equal to n_bootstraps .
Question 16 [marks 2]
Consider the following regression problem (it does not matter that the target is only 1 and -1):
from sklearn.datasets import load_iris
def make_iris_data():
X,y = load_iris(return_X_y=True)
X=X[:,[0,2]]
y[y==2]=0
y[y==0]=-1
return X,y
Estimate the squared bias and variance component for each instance.
Consider as regressor a linear svm and a polynomial svm with degree 3.
What is the class of the instances that have the highest bias error on average?
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:
def bias_variance_decomposition(train_func, test_func, param, data_matrix, targe
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

Just run the following code, do not modify it

from sklearn.datasets import load_iris
def make_iris_data():
X,y = load_iris(return_X_y=True)
X=X[:,[0,2]]
y[y==2]=0
y[y==0]=-1
return X,y
X,y = make_iris_data()
bias2, variance = bias_variance_decomposition(train_svm_1, test_svm, param=2, da
print(np.mean(bias2[y==1]) , np.mean(bias2[y==-1]))
bias2, variance = bias_variance_decomposition(train_svm_3, test_svm, param=2, da
print(np.mean(bias2[y==1]) , np.mean(bias2[y==-1]))
2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 12/16
Question 17 [marks 6]
Make a function bs,vs = compute_bias_variance_decomposition(train_func,
test_func, params, data_matrix, targets, n_bootstraps) to compute the average
squared bias error component and the average variance component of the error for each
parameter setting in the vector params . The regressor will be trained via train_func on the
problem data_matrix , targets with parameter param . The estimate will be done using a
number of replicates equal to n_bootstraps . To be clear, the vector bs contains the
average square bias error for each parameter in params and the vector vs contains the
average variance error for each parameter in params .
Question 18 [marks 1]
Make a function plot_bias_variance_decomposition(train_func, test_func,
params, data_matrix, targets, n_bootstraps, logscale=False) .
You should plot the individual components or the squared bias, the variance and the total error.
You should allow the possibility to employ a logarithmic scale for the horizontal axis via the
logscale flag.
You should get something like:
Question 19 [marks 2]
Make a function find_best_param_with_bias_variance_decomposition(train_func,
test_func, params, data_matrix, targets, n_bootstraps) that uses the bias
variance decomposition analysis to determine which parameter among params achieves the
smallest estimated predictive error.
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:
def compute_bias_variance_decomposition(train_func, test_func, params, data_matr
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:
def plot_bias_variance_decomposition(train_func, test_func, params, data_matrix,
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []: def find_best_param_with_bias_variance_decomposition(train_func, test_func, para
# YOUR CODE HERE
raise NotImplementedError()
2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 13/16
Question 20 [marks 6]
When you execute the following code
X, y = get_dataset_regression(n_samples=400, std=10, inner_std=7)
params = np.linspace(1,30,30).astype(int)
train_func, test_func = train_dt, test_dt
p = find_best_param_with_bias_variance_decomposition(train_func,
test_func, params, data_matrix, targets, n_bootstraps=60)
print(‘Best parameter:%s’%p)
plot_bias_variance_decomposition(train_func, test_func, params,
data_matrix, targets, n_bootstraps=50, logscale=False)
You should get something like:
The next unit tests will run your functions
find_best_param_with_bias_variance_decomposition on an undisclosed dataset
using as regressors:
decision tree
svm degree 3
and 3 marks will be awarded for each correct optimal parameter identified.
Question 21 [marks 5]
Make a function conf_mtx = confusion_table(targets, preds) to output the
confusion matrix as a 2 x 2 Numpy array. Rows indicate the prediction and columns the target.
The cell element with index [0,0] should report the true positive count.
Running the following code:
from sklearn.datasets import load_iris
X,y = load_iris(return_X_y=True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)
models = train_knn(X_train, y_train, param=3)
preds = test_knn(X_test, models)
conf_mtx = confusion_table(y_test, preds)
print(conf_mtx)
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:

This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 14/16
you should obtain something similar to
[[16. 1.]
[0. 28.]]
Note: the exact values can differ in your run
Note: do not use library functions to compute the result directly but implement your own
version.
Question 22 [marks 1]
Make a function error_from_confusion_table(confusion_table_func, targets,
preds) that takes in input the previous confusion_table function and returns the error, i.e.
the fraction of predictions that do not agree with the targets.
Question 23 [marks 12]
Make a function predictions, out_targets =
cross_validation_prediction(train_func, test_func, param, data_matrix,
targets, kfold) that estimates the predictions of a classifier trained via the function
train_func with parameter param on the problem data_matrix, targets using a kfold
cross validation strategy with the number of folds indicated by kfold .
Since the order of the instances associated to the predictions can be different from the original
order, the function is required to output also the corresponding target values in the array
out_targets (i.e. the value in position 10 in predictions corresponds to the target value
in position 10 in out_targets )
Note: do not use library functions (such as KFold or StratifiedKFold) but implement
your own version of the cross validation.
In []:
def confusion_table(targets, preds):
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:
def error_from_confusion_table(confusion_table_func, targets, preds):
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:
def cross_validation_prediction(train_func, test_func, param, data_matrix, targe
# YOUR CODE HERE
raise NotImplementedError()
In []: # This cell is reserved for the unit tests. Do not consider this cell.
2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 15/16
Question 24 [marks 5]
Make a function mean_errors =
compute_errors_with_crossvalidation(train_func, test_func, params,
data_matrix, targets, kfold, n_repetitions) that returns the estimated average
error for each parameter in params . The classifier is trained via the function train_func
with parameters taken from params on the problem data_matrix, targets using a k-fold
cross validation strategy with the number of folds indicated by kfold . The error estimate is
repeated a number of times indicated in n_repetitions . The error should be computed
using the function error_from_confusion_table . The output vector mean_errors has
as many entries as there are paramters in params .
Note: do not use library functions (such as cross_val_score) but implement your own
version of the code.
Question 25 [marks 2]
Make a function find_best_param_with_crossvalidation(train_func, test_func,
params, data_matrix, targets, kfold, n_repetitions) that uses crossvalidation to
determine which parameter among params achieves the smallest estimated predictive error.
Question 26 [marks 0]
This is just a check-point, i.e. it is for you to see that you are correctly implementing all
functions. Since this cell uses functions that you have already implemented and that have
already been marked, this Question is not going to be marked.
You should be able to run the following code:
from sklearn.datasets import load_wine
X,y = load_wine(return_X_y=True)
params = [3,5,7,9,11]
train_func, test_func = train_knn, test_knn
kfold = 5
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:
def compute_errors_with_crossvalidation(train_func, test_func, params, data_matr
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

In []:
def find_best_param_with_crossvalidation(train_func, test_func, params, data_mat
# YOUR CODE HERE
raise NotImplementedError()
In []:

This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1
localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 16/16
n_repetitions = 5
best_param = find_best_param_with_crossvalidation(train_func, test_func,
params, data_matrix, targets, kfold, n_repetitions)
print(best_param)
and get a value around 3.
In []:

Just run the following code, do not modify it

from sklearn.datasets import load_wine
data_matrix, targets = load_wine(return_X_y=True)
params = [3,5,7,9,11]
train_func, test_func = train_knn, test_knn
kfold = 5
n_repetitions = 5
best_param = find_best_param_with_crossvalidation(train_func, test_func, params,
print(best_param)

正文完
 0