乐趣区

关于机器学习:CMP6002B-机器学习

School of Computing Sciences
Module: CMP-6002B Machine Learning
Assignment: Classification with Decison Trees
Date set: Tuesday 11th March 2022
Value: 50%
Date due: Wednesday 18th May 2022 3pm
Returned by: Wednesday 17th June 2022
Submission: Blackboard
Learning outcomes
The students will learn about implementing variants of decision trees and ensemble techniques
and the general issues involved with implementing any classifier. They will understand better
how to evaluate and compare classifiers on a range of data and for a specific problem. They will
appreciate the difficulty in presenting technical results in a way to support or refute a hypothesis.
The exercises will improve the following transferable skills: programming in Java; analysis
of data and results; writing clearly and concisely; presenting data coherently. The students will
also learn the basics of how to use GitHub to work with open source software.
Specification
Overview
The task for this assignment is to implement components of decision trees and ensembles,
then to evaluate/compare them on a range of real world data.
1
Part 1: Building a Tree by Hand (10%)
The purpose of this exercise is to make sure you understand how the basic algorithms work and
to provide you with a bench check for later implementations.
The data in Table 1 shows a training data set for a classification problem involving predicting
the region of a whisky based on the three taste attributes: Peaty; Woody; and Sweet. Train a
Table 1: Whisky Region Classification Example
Peaty Woody Sweet Region
yes no yes Islay
yes yes yes Islay
yes no no Islay
yes no no Islay
no yes no Islay
no yes yes Speyside
no yes yes Speyside
no yes yes Speyside
no no yes Speyside
no no yes Speyside
decision tree on this data using the information gain as a splitting criteria.
Construct decision trees for this data using

  1. the chi-squared statistic as the splitting criterion;
  2. the information gain as the splitting criterion.
    The stopping criteria for both trees should be to stop at a pure node (node of just one class) or
    if there are no more attributes to use. Include all calculations and draw the final decision tree.
    2
    Part 2: Implement Variants of Decision Trees and Ensembles (50%)
    This exercise involves implementing, debugging and testing variants of decision tree classifiers.
    A java code base, tsml, is provided. To get started, you need to clone the ml6002b-coursework
    branch on the tsml GitHub repository. There is guidance on how to do this on blackboard.
    Please follow method and class names specified in the instructions. Failure to do so may result
    in a marks penalty.
    Static functions to help assess attribute quality (15%)
    Implement a class called AttributeMeasures in the package ml 6002b coursework that contains
    four static methods, each of which measures the quality of an attribute split at a node.
    They should all take a two dimensional array of integers as an argument and return a double.
    You can assume that the rows represent different values of the attribute being assessed, and
    the columns the class counts. The methods are:
  3. measureInformationGain returns the information gain for the contingency table.
  4. measureInformationGainRatio returns the information gain ratio for the contingency
    table.
  5. measureGini returns the gini measure for the contingency table.
  6. measureChiSquared returns the chi-squared statistic for the contingency table.
    The formal definitions for information gain, gain ratio, gini and chi-squared are given in the lecture
    on Decision Trees, and I want you to follow these formula. Comment the code to indicate
    how you have dealt with any edge cases. Include a main method test harness that tests each
    function by finding each measure for the attribute Peaty in terms of the diagnosis. Print each
    measure to the console, in the form“measure <insert> for Peaty = <insert>”.
    Alternative attribute selection methods for CourseworkTree (15%)
    Your task is to facilitate using your attribute measures as selection mechanisms within the
    CourseworkTree classifier provided in the package ml 6002b coursework. It is advisable to
    revisit the lab sheet on decision trees and to look at the original source for ID3. There is an
    interface called AttributeSplitMeasure that is used in CourseworkTree. You need to implement
    variants of this interface and use them within the CourseworkTree. The interface
    contains a single abstract method, computeAttributeQuality, that takes an Instances and
    an Attribute, that should return the quality of the attribute passed. There is also a default
    method splitData to split Instances by the Attribute.
  7. Implement and test the skeleton class IGAttributeSplitMeasure so that the split is
    performed using information gain or information gain ratio. The choice of measure should
    be controlled by a boolean variable called useGain.
  8. Implement and test a class ChiSquaredAttributeSplitMeasure that implements
    AttributeSplitMeasure and measures the quality using the chi-squared statistic.
  9. Implement and test a class GiniAttributeSplitMeasure that implements
    AttributeSplitMeasure and measures the quality using the Gini index statistic.
    3
  10. Configure CourseworkTree so that it can be used with any of the four attribute split measures
    you have implemented (including both information gain and information gain ratio).
    Adjust CourseworkTree so that the split criterion can be set through setOptions.
  11. Currently, CourseworkTree can only be used with nominal attributes. Implement a
    method in AttributeSplitMeasure called splitDataOnNumeric that makes a binary
    split of instances into those below the value and those above the value. This should be
    done prior to measuring the attribute quality. Please research how best to do this.
    The majority of the marks will be for correctness. However, some consideration will be given
    to efficiency and code quality. Include a main method in all three AttributeSplitMeasure
    classes that prints out the split criteria value for Peaty, Woody and Sweet for the data from Section
  12. for the whole data. Print in the form“measure <insert> for attribute <insert> splitting
    diagnosis = <insert>”. Add a main method to the class DecisionTreeCoursework that loads
    the problem optdigits (all nominal attributes) from the directory test data (do not use absolute
    file paths for this), performs a random split, then builds classifiers using IG, ChiSquared
    and Gini split criteria, outputting the test accuracy of each to console in the form
    “DT using measure <insert> on optdigits problem has test accuracy = <insert>”.
    Repeat this for the data set ChinaTown (all continuous attributes).
    Implement an Ensemble Classifier (20%)
    The task is to implement an Ensemble classifier that can be used with variants of your enhanced
    CourseworkTree classifier. You must implement this classifier from scratch rather than use any
    existing classifiers in tsml. Implement a classifier, TreeEnsemble, that extends
    AbstractClassifier and consists of an ensemble of CourseworkTree classifiers stored in an
    array or List. The size of the ensemble should be controlled by an integer attribute numTrees,
    set to a default value of 50. TreeEnsemble should be in the package ml 6002b coursework.
  13. The method buildClassifier should construct a new set of instances for each base
    classifier of the ensemble by selecting a random subset (without replacement) of the
    attributes. The proportion of attributes to select should be a parameter (defaults to 50%).
    The TreeEnsemble will need to store which attributes are used with which classifier in
    order to recreate the attribute selections in classifyInstance and
    distributionForInstance.
  14. Further diversity should be injected into the ensemble by randomising the decision tree
    parameters. This diversification should include attribute selection mechanisms you have
    implemented, but can also involve other tree parameters such as stopping conditions.
  15. Implement classifyInstance so that it returns the majority vote of the ensemble. i.e.,
    classify a new instance with each of the decision trees, count how many classifiers predict
    each class value, then return the class that receives the largest number of votes.
  16. Implement distributionForInstance so that it returns the proportion of votes for
    each class.
  17. Implement an option where rather than counting votes, the ensemble can average probability
    distributions of the base classifiers. This should be controlled by a boolean attribute
    called averageDistributions.
  18. Implement a main method in the class TreeEnsemble that loads the problems optdigits
    and ChinaTown, prints out the test accuracy but also prints out the probability estimates
    for the first five test cases.
    4
    Evaluation of Decision Trees and Ensembles Classifiers (40%)
    Your task is to perform a series of classification experiments and write them up as a research
    paper. Your experiments will address the question of whether the variations you have implemented
    for decision trees improve performance over a range of problems and whether your
    ensemble is a good approach for a specific case study data set. You have been assigned a specific
    data set (see blackboard) and a link to further information. Please note that for this part
    of the coursework we mark the paper itself, not the code used to generate the results. We
    advise you reserve significant time for reading and checking your paper, and recommend you
    ask someone else to read it before submission. Imagine you are writing a paper for a wide
    audience, not just for the marker. Aim for a tight focus on the specific questions the paper
    addresses. We have provided two sets of datasets to perform these experiments: these are
    in files UCIDisrete.zip and UCIContinuous.zip. A list of the problems in these files is given in
    the otherwise empty class DatasetLists.java. You will be assigned a completely different
    problem from timeseriesclassification.com for the case study.
    There are four issues to investigate:
  19. Decision Trees: Test whether there is any difference in average accuracy for the attribute
    selection methods on the classification problems we have provided. Compare your versions
    of DecisionTree to the Weka ID3 and J48 classifiers.
  20. Tree Ensemble vs Tree Tuning: Test whether tuning CourseworkTree, including choosing
    attribute selection criteria method, is better than ensembling. It is up to you to select
    ranges of values to tune and ensemble over, but you should ensure that the range of
    parameters you tune over is the same as those you use to ensemble. Perform this experiment
    with the proportion of attributes set to 100%, then repeat the experiment with the
    proportion set to 50%. You can use any code available in tsml to help you do the tuning.
  21. Compare your ensemble against a range of built in Weka classifiers, including other ensembles,
    on the provided classification problems.
  22. Perform a case study on your assigned data set to propose which classifier from those
    you have used would be best for this particular problem.
    Experiments
    You should compare classifiers based on accuracy (or, equivalently, error) and any other metrics
    described in the evaluation lecture that you feel are appropriate. You can use the built
    in classifier evaluation tools provided in tsml, or use any other mechanism. You should think
    about the experimental design for each experiment, including deciding on a sampling method
    (train/test, cross validate or resample) and performance measures. The choice of classifiers for
    part 3 is up to you, but we recommend you include random forest and rotation forest in the list.
    Include a table of all parameter settings of all classifiers in the paper. You can also compare the
    time each algorithm takes to build a model. You can use both the discrete data and continuous
    data, although it may be advisable to split the analysis into two parts if you use both. For the
    case study, the data you have been given has all continuous attributes. You should experiment
    with the data in this format, and compare results to those obtained by first discretizing the data
    so that it is converted to nominal attributes. The convertion can be done using the Weka Filter
    Discretize or any other means
    5
    The Write Up
    You should write a paper in Latex using the style file available on blackboard. The paper should
    be called“An experimental assessment of decision trees and decision tree ensembles”. This
    should be around 4 pages, and should not be longer than 8 pages, including references and
    appendices. Your paper should include the following sections:
  23. Introduction: start with the aims of the paper, a description of the hypotheses you are
    testing and an overview of the structure. Include in this your prior beliefs as to what you
    think the outcome of the test will be, with some rationalisation as to why. So, for example,
    do you think attribute evaluation will make a difference? Can you find any published
    literature that claims to answer any of the questions?
  24. Data Description: an overview describing the data, including data characteristics summarised
    in a table (number of attributes, number of train/test cases, number of classes,
    class distribution). For your case study data set you should have more detail, including a
    description of the source of the data and the nature of the problem the data describes.
    Look for references for similar types of problem. Do not simply copy the data description
    from timeseriesclassification.com.
  25. Classifier Description: Include a description of your C45Coursework classifier, including
    details of design choices you made and data structures you employed. You should include
    an example of how to use the classifier with the refinements you have included.
    Also provide an overview of the other classifiers you use in the case study, including references.
  26. Results: A description of the experimental procedure you used and details of the results.
    Remember, graphs are good. There should be a subsection for each hypothesis. The
    case study section should go into greater depth and remember, accuracy is not the only
    criteria.
  27. Conclusions: how do you answer the questions posed? Are there any biases in your
    experiment and if so, how would you improve/refine them?
    Presenting the output of experiments is a key transferable skill in any technical area. I stress
    again we mark the paper for this section, not the experimental code. Avoid writing a description
    of what you did. We do not need to know about any blind alleys you went down or problems
    you encountered, unless they are relevant to the experiments (e.g. memory constraints).
    Please also take care to include references where appropriate, and to check the formatting of
    references.
    These experiments will require a reasonable amount of computation. We suggest you store
    the results of experiments in the format used on the module to avoid data loss in the case of
    computer failure. You can remotely log on to the lab machines and run experiments on this
    hardware if yours is limited.
    Submission Requirements
    You should submit a single zipped file containing your solution to part 1, 2 and 3 to the blackboard
    submission point.
    For part 1, submit a single pdf with your worked solution. For part 2, submit a zipped Intellij
    project with your code. For part 3, submit a single pdf of your paper. A template format is here
    https://www.overleaf.com/read…
    6
    Feel free to copy this project.
    The blackboard submission portal will go live one week before the deadline.
    Plagiarism and collusion: Stackoverflow and other similar forums (such as the module discussion
    board) are valuable resources for learning and completing formative work. However, for
    assessed work such as this assignment, you may not post questions relating to coursework
    solutions in such forums. Using code from other sources/written by others without acknowledgement
    is plagiarism, which is not allowed (General Regulation 18).
    Marking Scheme
  28. Part 1. Decision trees by hand (10 marks)
  29. Part 2. Implement tree variants (50 marks)
  30. Part 3. Evaluate Classifiers (40 marks)
    Total: 100 Marks, 50% of the overall module marks
    7
退出移动版