共计 7228 个字符,预计需要花费 19 分钟才能阅读完成。
ELEC0033 – 2020/2021
Page 9
5 Data Analytics Task – Climate Data Analysis using Python
5.1 General Overview
The assignment comprises individual code writing, data analysis and inferring. You are
allowed to discuss ideas with peers, but your code, and experiments and report must be
done solely based on your on work.
The assignment leverages elements covered in class (data analytics lecture). You will be
working with a couple of meteorological datasets, you will be required to crunch data, to
clean the datasets and infer hidden patterns. Specifically, there will be three tasks you will
be asked to solve.
The goals of the assignment are the following:
• To further develop your programming skills
• To further develop your skills and understanding principle of data analytics and
machine learning
• To acquire experience in dealing with real-world data
5.2 Assignment description
- Dataset description
You will find two pickle files named weather-denmark-resampled.pkl and df_perth.pkl,
respectively.
For TASKS 1 and 2, which cover the main aspects of preliminary data analysis, missing
data and outlier detection, you must use the first dataset.
For TASK 3, which cover correlation and pattern inferring, you will be using the second
smaller dataset in order to find correlations and infer patterns. - Tasks to be solved
Read carefully the three tasks description and address them using the pre-compiled
Jupyter notebook named Coursework_weather_data.ipynb.
TASK 1 – PRELIMINARY ANALYSIS
In this first task, you will explore the dataset. Follow the instructions in the following:
a. Import the weather-denmark-resampled.pkl dataset provided in the folder and
explore the dataset by answering the following questions.
i. How many cities are there in the dataset?
ii. How many observations and features are there in this dataset?
iii. What are the names of the different features?
ELEC0033 – 2020/2021
Page 10
b. Now that you got confident with the dataset, evaluate if the dataset contains any
missing values? If so, then remove them using the pandas built-in function.
c. Extract the general statistical properties summarising the minimum, maximum,
median, mean and standard deviation values for all the features in the dataset. Spot
any anomalies in these properties and clearly explain why you classify them as
anomalies.
TASK 2 – OUTLIERS
The second task is focused on spotting and overcoming outliers. Follow the instructions
in the following:
d. Store the temperature measurements in May 2006 for the city of Odense. Then
produce a simple plot of the temperature versus time.
HINT: In this dataset, the cities are vertically stacked. Therefore, we have a multi
column dataset, which basically works as a nested dictionary.
e. Find the outliers in this set of measurements (if any) and replace them using linear
interpolation.
TASK 3 – CORRELATION AND INFERENCE
In this last task, you will be seeking correlation between features of the data and inferring
hidden patterns. For this task, you will be working with a smaller dataset. Follow the
instructions in the following:
3.1 – CORRELATION
f. We now take a new dataset (df_perth.pkl), which collects climate data of a city
in Australia. Here we have just one year of measurements, but more features.
g. Find any significant correlations between features.
HINT: you might find useful looking for trends and recurrent patterns within the
data.
h. We now focus on the correlation between precipitation and cloud cover. We
want to infer the probability of having moderate to heavy rain (> 1 mm/h) as a
function of the cloud cover index.
HINT: you might find useful to create a new column where you have 0 if
precipitation < 1 mm/h and 1 otherwise.
3.2 – INFERENCE
i. Let’s now assume that we want to predict the photovoltaic production (PV
production) using multiple linear regression. Explain which features are
statistically significant in modelling the target variable.
j. Create a multivariate model using the predictors chosen in the previous
question.
ELEC0033 – 2020/2021
Page 11
5.3 Deliverable
Report
The report should be written in the form of an academic paper using the ICML format1.
The report should be at most 10 pages long excluding references and appendices. The
report must include the following sections:
● Abstract. This section should be a short paragraph (4-5 sentences) that provides a
brief overview of the methodology and results presented in the report.
● Preliminary Analysis. This section describes your study carried out during task 1
and should be organized in the following subsections:
○ Data Understanding. This subsection should detail the data that was used
for this study, clearly describing the content, size and format of the data,
how many cities are described in the dataset, how many observations and
how many (and which) features are considered. Further information can
be provided.
○ Data Cleaning. This subsection should describe the missing data
processing. It is important to describe the methodology that you used in
searching for the missing data and how did you address them in the best
way (for example how do you ensure that the dataset preserver the same
statistics/properties). Motivate clearly your answers.
○ Data Statistics. This subsection should describe the general statistical
properties of the dataset with numerical or graphical visualization. Provide
reflections toward anomalies (with clear motivation/supporting evidence
for anomalies)
● Outliers. This section should describe all the steps that were applied to the data
to find and tackle outlier pre-processing. A justification for each step should also
be provided. In case no or very little pre-processing was done, this section should
clearly justify why.
● Data inference. This section should describe the explorative and inference
process. The following subsections should be provided
○ Data Correlation: This subsection should describe the different features
correlations that you have investigated in the current dataset. Even if you
discover little patterns, it is important that you clearly explain and justify
the methodologies that you adopted. Clearly show results that can support
your statements.
○ Data Inference. This subsection should describe the final step of data
inference. Again clearly motivate your solutions, approaches and - https://icml.cc/Conferences/2…
ELEC0033 – 2020/2021
Page 12
conclusions/results.
● Conclusion. This last section summarises the findings, highlights any challenges or
limitations that were encountered during the study and provides directions for
potential improvements.
Please make sure you complement your discussion in each section with relevant
equations, diagrams, or figures as you see fit. Most importantly, be sure that all your
answers and solutions are well motivated.
Marking Criteria
See the following page for the marking criteria
Criteria Mark
Weight
Abstract/
Conclusions
The purpose of the executive summary is to outline data analytics project,
input, envisioned outputs as well as key findings 5%
Task 1 –
Preliminary
Analysis
Dataset Understanding. Provide a clear description of the dataset answering the
following questions: i) How many cities are there in the dataset? ii) How many
observations and features are there in this dataset? iii) What are the names of the
different features?
10%
Data Cleaning – Missing data. Provide a clear description of the results
from your missing data analysis and key outcomes. 15%
Data Statistics. Describe the general statistical properties of the dataset
with numerical or graphical visualization. Provide reflections toward
anomalies (with clear motivation/supporting evidence for anomalies)
10%
Task 2 –
Outliers
Show the visualization of the temperature measurements, together with some
comments on the behaviour depicted in the plots. Provide summaries on the
outliers – in terms of number of outliers detected as well as techniques adopted to
replace outliers (motivate your answers).
20%
Task 3 –
Inference
Data Correlation. Comment on the significant correlation you found between
features and assess rain probability as a function of cloud cover index. Support
the text with visualization of results and key insights on the considered
approach.
15%
Data Inference. Good understanding of data inference. Comment on the
multivariate model using the predictors chosen in the previous question. 20%
Report Style Report needs to be with a clean and clear structure as well as layout. Quality
of images, table, citations and references will be also taken into account. 5