共计 13024 个字符,预计需要花费 33 分钟才能阅读完成。
Data Analytics: Principles and Tools
Assignment #3
Decision Trees & Visualizations
Posted: March 18th 2019
Due: April 5th 2019 11:55PM
Total: 100 Points (10% of Final Grade)
CS2034 – Data Analytics: Principles and Tools Assignment #3
Learning Outcomes
By completing this assignment, you will gain and demonstrate skills relating to:
Creating Decision Trees
Applying Information Theory Concepts
Calculating Entropy and Information Gain Creating Visualizations
Processing Data to be Visualized
Instructions
This assignment is divided into two distinct activities, one dealing with decision trees and
one with visualizations. In both activities, it is left to you to decide the best way to process
the data and do the required calculations using the techniques that have been covered in
class and the labs. Precise step-by-step instructions are intentionally not given so you can
demonstrate the skills you have acquired in this course.
For both activities you should use Excel (and optionally VBA) as your primary tool for
processing data and making any calculations. You must provide full details on the steps
you took to process the data and make any calculations clear to the reader of the Excel
document. This can be done by including notes in the Excel sheet (e.g. cells with text in
them explaining the calculations being done in other cells), documenting/commenting any
VBA code (if used) or by including a PDF with text explaining your work. For each activity,
you are expected to include an Excel sheet with your processed data and calculations as part
of your final deliverables.
You should check that your Excel documents and any VBA code (if used) work correctly
and are compatible with the GenLab computers and Excel 2016 for Windows.
You will be assessed on the following:
Using the correct file from OWL (activity 2).
Showing your work, calculations and steps taken to process the data.
Your Excel formulas and operations.
Your VBA code (if used).
Completion of each task correctly.
Using appropriate visualizations (activity 2).
Producing the final deliverables as described.
Assignment submission via OWL before deadline.
1 of 10
CS2034 – Data Analytics: Principles and Tools Assignment #3
Activity 1: Decision Trees
Below is a table of observations of 18 objects numbered (O1 to O18).
Table 1: Object Observation
Object Colour Roundness Size Texture Class
O1 Yellow Round Small Rough Duck
O2 Blue Round Large Rough Not a Duck
O3 Yellow Round Small Smooth Duck
O4 Red Round Medium Rough Duck
O5 Blue Square Small Smooth Not a Duck
O6 Blue Square Large Rough Not a Duck
O7 Red Round Small Rough Duck
O8 Blue Square Medium Rough Not a Duck
O9 Red Square Small Smooth Not a Duck
O10 Yellow Square Large Rough Duck
O11 Yellow Square Medium Rough Not a Duck
O12 Yellow Round Large Rough Not a Duck
O13 Red Square Large Smooth Duck
O14 Yellow Square Medium Smooth Duck
O15 Red Square Medium Rough Not a Duck
O16 Yellow Round Large Smooth Duck
O17 Blue Round Large Smooth Not a Duck
O18 Blue Round Medium Smooth Not a Duck
Colour, Roundness, Size and Texture are attributes of the objects (features) and Class
denotes if the object is a rubber duck (Duck) or some other object (Not a Duck). Assume
that only the values shown in this table are possible for each attribute (i.e.“Green”is not
a valid Colour and“Medium”is not a valid value for Roundness).
2 of 10
CS2034 – Data Analytics: Principles and Tools Assignment #3
Task 1.1
Using Table 1 as your training data, create a full decision tree to classify an object as Duck or
Not a Duck based on the attributes Colour, Roundness, Size, and Texture. Use the method
based on Information Theory described in the week 10 lecture and Lab 10.
You are required to show your work and calculations for each step of the process, including
the Entropy and Information Gain values needed to find each node in the tree (even if you
could“eyeball it”accurately.
Do all of your calculations in Excel. You may use VBA, including the code for the entropy
function from Lab 10 (you would have to modify it to work with this data) but this is not
required (you can do all calculations with just Excel formulas).
You are required to make your calculations clear and understandable to any reader of the
Excel document. You should include notes as text in cells to explain any complicated calculations
and make it clear what you are calculating. Use multiple sheets such that the calculations
for each node in the tree are on a different sheet in your Excel work book and make it
clear what node the sheet is for. If you use VBA code it should be documented/commented.
You may include a PDF with notes (see the Deliverables section for details) to the TA about
how you did your calculations and processed the data but you should still have notes in the
Excel sheet.
You are allowed to do some manual processing of the data and hard coding. For example, you
can manually copy the table of observations and delete rows to create a subset of the data
(you are not required to automate this). However, the more you automate the easier/faster
it will be to calculate the next node in the tree.
Note that the same attribute can appear multiple times in a decision tree so long as they
only appear once on any given path from the root node to a leaf node.
3 of 10
CS2034 – Data Analytics: Principles and Tools Assignment #3
Task 1.2
After you have completed Task 1.1 and believe your calculations to be correct, create a
diagram of your decision tree that clearly labels all attributes (nodes), classes (leaf nodes)
and values (branches/edges).
You may use any software you are comfortable with to create this diagram so long as everything
is labelled clearly. See the Deliverables section for details on format and file name.
Below is an example decision tree diagram from the week 10 slides (for different data, your
tree will look different and have different attributes/values).
4 of 10
CS2034 – Data Analytics: Principles and Tools Assignment #3
Task 1.3
Use your decision tree to classify the following new objects (N1 to N6) based on their
attributes.
Table 2: New Object Observation
New Object Colour Roundness Size Texture
N1 Yellow Square Small Rough
N2 Red Square Medium Smooth
N3 Blue Round Small Smooth
N4 Yellow Square Large Smooth
N5 Yellow Round Large Rough
N6 Red Round Large Rough
Give your answers in a PDF file (see the Deliverables section for details) and include a brief
explanation (two to three sentences) of how you classify new observations using a decision
tree.
Activity 1 Deliverables
For this activity you must submit:
An Excel workbook, named userid act1.xlsx or userid act1.xlsm (if you used VBA)
where userid is your UWO user id, that contains all of your calculations, data processing
and VBA code (if used) for Task 1.1.
A PDF named, userid act1.pdf where userid is your UWO user id, that contains any
notes for Task 1.1, your diagram for Task 1.2 and your answers to Task 1.3. The
diagram for Task 1.2 must be legible, not overly pixelated or cut off/cropped. This
PDF should be easy for the TA to read and understand what answers are for what
Task.
You must submit these deliverables via OWL with the deliverables from Activity 2.
5 of 10
CS2034 – Data Analytics: Principles and Tools Assignment #3
Activity 2: Visualization
Download the file tweetdata.xlsx from OWL. This file contains the processed tweet data
from Assignment #2 with two enhancements. The location column has been cleaned up and
split into City, Province and Country columns. The string”NULL”is used in cases where
the City, Province or Country could not be determined. The sentiment values have also
been updated using a sentimentCalc function that considers far more positive and negative
keywords.
Base your visualizations and work in the following tasks on this updated tweetdata.xlsx file
and not your own work from assignment #2.
For this activity, you may create your visualizations using any tool you are comfortable with
and have access to. The following tools are recommended and you may use more than one:
The RAW site (used in Lab 9)
Excel (Charts, Power View, Power Map, 3D Map, etc.)
HeatMapper.ca
Any other visualization mentioned in the week 9 slides.
Task 2.1: Country Visualizations
Process the Data
Using techniques we have covered in lectures, labs and assignments create a new sheet in
the tweetdata.xlsx workbook titled“Country”that contains a list of all of the countries in
the data (containing each country only once). You are allowed to use VBA (but you are not
required to) and do some manual steps (e.g. copy and pasting, using Excel’s sort feature,
etc.).
For each country in the list, calculate the average sentiment, number of tweets in the data
set, number and percentage of positive, negative and neutral tweets and any other value you
need to create the visualizations in the next steps.
You must include notes in your Excel workbook detailing how you processed your data (e.g.
you need to describe how you created the list of countries). You may also include notes in
a PDF file (see the deliverables section for this activity for details).
Example of what your“Country”sheet might look like (not all countries shown and data
intentionally blurred out):
6 of 10
CS2034 – Data Analytics: Principles and Tools Assignment #3
Create the Visualizations
Create the following visualizations using the Country data:
- A visualization that best shows the rank of the top 10 countries by total tweets.
- A visualization that best shows the rank of the top 10 countries by average sentiment.
- A visualization that best shows the percentage of positive, negative and neutral tweets
for Canada (out of the total number of tweets for Canada). - A visualization that best shows the total number of tweets for each country geospatially
(e.g. on a map). - A visualization that best shows the percentage of negative tweets for each country
geospatially. - A visualization that best shows the percentage of positive tweets for each country
geospatially.
The percentage for 3, 5 and 6 should be based on the number of tweets for that country and
not the total number of tweets (i.e. the positive percentage, negative percentage and neutral
percentage should add up to exactly 100% for each country).
Note that you may have to do more processing and clean up the data more depending on
the tools you use to create the visualization. For example, you may need to edit the country
names slightly to get them all to work with mapping tools.
For each visualization, include a title and appropriate labels. If a legend is required or
appropriate for the visualization type you pick, include that as well. - of 10
CS2034 – Data Analytics: Principles and Tools Assignment #3
Task 2.2: Hierarchical Visualizations
Process the Data
Using techniques we have covered in lectures, labs and assignments create a new sheet in
the tweetdata.xlsx workbook titled“Hierarchy”that contains a list of all of the unique city,
province, and country pairs in the data. That is, each combination of a city, province and
country found in the data should be listed exactly once. Any row with a”NULL”value for
city, province or country should be ignored. You are allowed to use VBA (but you are not
required to) and do some manual steps (e.g. copy and pasting, using Excel’s sort feature,
etc.).
For each combination in the list calculate the average sentiment, number of tweets in the
data set, number and percentage of positive, negative and neutral tweets and any other value
you need to create the visualizations in the next steps.
You must include notes in your Excel workbook detailing how you processed your data (e.g.
you need to describe how you created the list of combinations). You may also include notes
in a PDF file (see the deliverables section for this activity for details).
Example of what your“Hierarchy”sheet might look like (not all combinations shown and
data intentionally blurred out): - of 10
CS2034 – Data Analytics: Principles and Tools Assignment #3
Create the Visualizations
Create the following visualizations using the Hierarchy data: - A visualization that shows the hierarchical relationship between cities, provinces
and countries. No data values (e.g. average sentiment, number of tweets, etc.) should
be shown or used (i.e. the hierarchy should not be weighted). - A visualization that shows the hierarchical relationship between cities, provinces
and countries weighted by the total number of tweets. - A visualization that shows the hierarchical relationship between cities, provinces
and countries weighted by the total number of positive tweets. This visualization
must use a different visualization type than the one you used for the last visualization. - A visualization that shows the flow of negative tweets from cities to provinces to
countries.
Note that you may have to do more processing and/or clean up the data more depending
on the tools you use to create the visualization.
For each visualization, include a title and appropriate labels. If a legend is required or
appropriate for the visualization type you pick, include that as well.
Hint: The RAW Site might be a useful tool for creating some of these visualizations and
deciding which visualization type to use.
Task 2.3: Infographic
Note: You will not be graded on your graphic skills per se but on how well you
communicate the results and take advantage of the Gestalt Principles.
Using the data in tweetdata.xlsx and the data you have processed, create a unique visualization
(distinct from the visualizations from the previous tasks) that shows aspects of the data
you find interesting. You may use parts of the data we have not yet dealt with like followers,
friends, verified, etc. Show any work you do for processing the data for this visualization in
a new sheet named“MyVis”.
Using at least 3 of the visual representations you have created in Task 2.1 or 2.2 and your
unique visualization, create an infographic using Paint, Adobe Photoshop (available in some
GenLabs) or other software available to you. The following web based tool may also be of
use:
Piktochart
Venngage
Canva
easelly - of 10
CS2034 – Data Analytics: Principles and Tools Assignment #3
For tips on how to create infographics, start with the article 19 Warning Signs Your Infographic
Stinks and search the web for good examples.
Your infographic should:
Explain the data set, and the images you included from Task 2.1 or 2.2.
Explain your unique visualization.
Have at least 3 facts about the data.
Have a title and at least 2 subsections.
Take advantage of at least some of the Gestalt Principles to help communicate your
analysis.
Activity 2 Deliverables
For this activity you must submit:
The tweetdata.xlsx file renamed to userid act2.xlsx or userid act2.xlsm (if you used
VBA) where userid is your UWO user id. This file should contain all of your calculations,
data processing and VBA code (if used) for Tasks 2.1 to 2.3.
A PDF named, userid act2.pdf where userid is your UWO user id, that contains any
notes for Task 2.1 to 2.3, your visualizations for each task and your infographic Task
2.3. The visualizations must be legible, not overly pixelated or cut off/cropped in any
way that you can not see all the data. This PDF should be easy for the TA to read
and understand what answers are for what Task and visualization.
A short paragraph explaining what Gestalt Principles you used in your visualizations
and/or info graphic. This should be included at the end of the above mentioned PDF
file named userid act2.pdf where userid is your UWO user id.
You must submit these deliverables via OWL with the deliverables from Activity 1. - of 10
WX:codehelp