全文链接:http://tecdat.cn/?p=30944
Suzy Moat and Tobias Preis
Data Science Lab, Behavioural Science, Warwick Business School, The University of Warwick
http://www.wbs.ac.uk/about/pe…
http://www.wbs.ac.uk/about/pe…
Goal of your investigation
Humans around the world are uploading increasing amounts of information to social media servicessuch as Twitter and Flickr.To what extent can we exploit this information during catastrophic eventssuch as natural disasters, to gather data about changes to our world at a time when good decisionsmust be reached quickly and effectively?
The subject of your current investigation is Hurricane Sandy, a hurricane that devastated portions of theCaribbean and the Mid-Atlantic and Northeastern United States during late October 2012. As ahurricane approaches,air pressure drops sharply. Your goal is to determine whether a relationshipexists between the progression of Hurricane Sandy, as measured by air pressure, and user behaviouron the photo-sharing site Flickr.
lf you can find a simple relationship between changes in air pressure, and changes in photos taken andthen uploaded to Flickr, then perhaps further investigation of these social media data would give insightinto problems resulting from a hurricane that are harder to measure using environmental sensors alone.This might include the existence of burst pipes, fires,collapsed trees or damaged property. Suchinformation could be of interest both to policy makers charged with emergency crisis management, andinsurance companies too.
Part 1: Acquiring the Flickr data (2%)
Hurricane Sandy,classified as the eighteenth named storm and tenth hurricane of the 2012 Atlantichurricane season,made landfall near Atlantic City,New Jersey at 00:00 Coordinated Universal Time(UTC)on 30 October 2012.
You have decided to have a look at how Flickr users behaved for around this date, from 20 October2012, to 10 November 2012. In particular, you are going to look at data on photos uploaded to Flickrwith the text “hurricane sandy”. When were photos with these tags taken?
TASK 1A (1%):
For the period 20 October 2012 00:00 to 10 November 2012 23:59,download hourly counts fromFlickr of the number of photos taken and uploaded to Flickr with labels which include the text“hurricanesandy”.
Create a data frame containing the hourly counts. Each row of the data frame will need to specify thedate and hour the count relates to, and the number of photos found.(You may wish to use one columnor two for the date and time – either is fine.)
For assessment,,submit the code you wrote to obtain this data,and a CSV file of the data framecontaining the hourly counts.
To solve this exercise, you need to edit the code from the previous lab to:
- create a function to create the URL you need to acquire one JSON page of data on pages
tagged with”hurricane sandy” for a given hour
write code to download this JSON page of data and extract the information on this page whichtells you how many photos were taken in that hour(see Hint 1 below)
- write some code to get this count for all the hours in the period above
- put each count in a row of a data frame, with the relevant date and time in another column
Hint 1:There is more than one way to get hourly counts of the number of photos. Importantly – you donot need to parse all the information about all the individual photos. Instead, the data which is returnedfrom Flickr has some key information in the variable labelled “total” on the first page of the results. Youshould use this or your code will take a very long time to run! On a reasonable broadband connection,the download for Task 1A should take under 15 minutes.Note that because“total” is on the first page,you do not need to download all of the pages.
Hint 2: ln downloading data,you only need to be concerned about min_taken_date andmax_taken_date – you can ignore min_upload_date and max_upload_date.
Hint 3: For the purposes of this exercise, don’t worry about using a ““bbox”when downloading thisinformation.
Hint 4: The“time taken”on a photo is in the photographer’s local time. For the purposes of thisexercise, don’t worry about time zones – just use the times which Flickr specifies.
Hint 5: You can save a data frame to a CSV file using write.csv.See Cookbook for R for more guidance:
TASK 1B(1%):
The hurricane might not be the only infuence on the number of photos people take. Perhaps peopletake more photos at the weekend or at certain times of day, for example.
We should account for this by finding out how many photos were taken in total during each
hour.
For the period 20 October 2012 00:00 to 10 November 2012 23:59, download hourly counts fromFlickr of the TOTAL number of photos taken and uploaded to Flickr.
Create a data frame containing the hourly counts, of the same format as the data frame you created forthe last task.
For assessment,submit the code you wrote to obtain this data, and a CSV fle of the data framecontaining the hourly counts.
The solution to this exercise is extremely similar to the solution to Task 1A. However, here you do notwant to only count photos with the text “hurricane sandy” attached.
Hint 1:Again,the download for this exercise should take around 15 minutes on a good broadbandconnection. If your download is taking too long, make sure you are not trying to count the photographsby downloading all the pages.
Hint 2: We have recently seen the Flickr database giving counts of 0 between 5am and 6am in themorning. Don’t worry if this happens to you too. We will clean up the data in the next step.
A:install.packages("lubridate")
library(lubridate)
install.packages("Rcpp")
buildFlickrURL(hourBegin=as.POSIXct("2012-10-20 00:00:00"), page=1)# 查找特定小时内的照片
library(RCurl)
source('E:/davidvictoria/getFlickrData.r')
source('E:/davidvictoria/buildFlickrURL.r')
flickrURL <- buildFlickrURL(hourBegin=as.POSIXct("2012-10-20 00:00:00"),page=1)
flickrData <- getURL(flickrURL,ssl.verifypeer = FALSE)# 下载 flickr 数据
flickrData
install.packages("RJSONIO")
library(RJSONIO)
flickrParsed <- fromJSON(flickrData)# 解析 JSON 格局到 R 格局
flickrParsed
str(flickrParsed, max.level=2)# 浏览转换后的数据
flickrParsed$photos$photo
library(plyr)
flickrDF <- ldply(flickrParsed$photos$photo, data.frame)# 转换成数据框模式
head(flickrDF)
=1, last=15))# 提取前 15 个字符,到小时为止
head(sandyFlickrData$Date) 查看日期数据
sandyFlickrTS <- xtabs(~Date, # Count entries per hour...
sandyFlickrData) # ... in the sandyFlickrData 计算每个小时的照片数量
head(sandyFlickrTS)# 每个小时的照片数量信息
sandyFlickrTS <- as.data.frame(sandyFlickrTS) 将照片数量数据转换成数据框的格局
head(sandyFlickrTS)
str(sandyFlickrTS)
B:source('E:/davidvictoria/getFlickrData1.r')
source('E:/davidvictoria/buildFlickrURL1.r')
flickrURL <- buildFlickrURL1(hourBegin=as.POSIXct("2012-10-20 00:00:00"
15 个字符,到小时为止
head(FlickrData$Date)
FlickrTS <- xtabs(~Date, # Count entries per hour...
FlickrData) # ... in the FlickrData 计算每个小时的照片数量
head(FlickrTS)# 每个小时的照片信息
FlickrTS <- as.data.frame(FlickrTS)# 转换成数据框格局,并查看
head(FlickrTS)
str(FlickrTS)
Part 2:Processing the Flickr data (2%)TASK 2A(2%):
You now want to use the data you downloaded on the total number of photos taken to normalise thedata you have on the number of Hurricane Sandy photos taken.
First, clean up your total hourly counts data. Change any entries where Flickr has given you
counts of O
total photos (very unlikely given the distribution of the rest of the data) to NA values. This is one line ofcode. lf you’re not sure how to replace Os with NAs, try Googling“r replace with na” for some hints.Second, merge the total hourly counts data into your Hurricane Sandy count data frame, so that eachrow has an entry for the hourly count of Hurricane Sandy photos, and the total hourly count of photos.This is one line of code. You will find the command “merge”useful.
Finally, create a new column which contains the normalised count of Hurricane Sandy photos. Yournew column should contain the result of dividing the hourly count of Hurricane Sandy photos by thetotal hourly count of photos.This is one line of code.You will find the command “transform”useful.
For assessment,submit the code you wrote to process this data,and a CSV file of the
data frame containing the Hurricane Sandy hourly counts,the total hourly counts with the Os replaced,and thenormalised Hurricane Sandy hourly counts.
allhours <- seq(as.POSIXct("2012-10-20 00:00:00"),
as.POSIXct("2012-11-10 23:59:00"),
by="1 hour")# 生成桑迪飓风时间段所有小时的工夫向量
allhours <- data.frame(Date=allhours)
head(allhours)
sandyFlickrTS <- merge(sandyFlickrTS, # Merge the time series data 将工夫向量和照片数量向量合并
allhours, # ... with the list of days
by="Date", # Matching rows based on "Date"
all=T) # Keeping all entries - not only those
# which exist in both data frames
)# 重命名列名为照片数量
head(sandyFlickrTS)
sandyFlickrTS <- transform(sandyFlickrTS,
sandyFlickrTS$ncount =sandyFlickrTS$Freq/FlickrTS$Freq )# 将照片数量进行标准化(用蕴含关键词桑迪飓风的照片数量除以总的照片数量)
Part 3: Acquiring and processing the environmental data (2%)
As a hurricane approaches an area, atmospheric pressure falls. We can therefore use data on
atmospheric pressure as a measure of the hurricane’s progress.
TASK 3A (1%):
Hurricane Sandy made landfall very close to Atlantic City in New Jersey.
We can retrieve atmospheric pressure readings from Atlantic City from the following website:
http://www.ncdc.noaa.gov/data…
Click on User Interface Page. Select the Advanced Options. Agree to the terms.
You now want data for the United States.
On the next page, you want data for New Jersey, where you will retrieve data for selected stations.
On the next page, select the first entry for“Atlantic City”.
On the page after that, select Atmospheric Pressure Observation.
Pick the period we want the data for (from 20 October 2012 to 10 November 2012, the same as the
Flickr data) via“Use Date Range”. Do not select“Select Only Obs. on the hour”.
Output the data with comma delimiters, including the station name.
Continue, and on the next page enter your email address. The data will be sent to you shortly.
For assessment, submit the text file you can download from the line in the email you have received
labelled“Data File”.
TASK 3B (1%):
Once the data has arrived, read in the file. Leave out the first line, and the headers too, as there are no commas in the headers, making them trickier to parse. (Remember how you left out lines in the second
lab exercise when loading Google data.) This is one line of code.
The information you require is the date on which each reading was taken, the time at which it was taken, and the atmospheric pressure measurement.
Identify which columns contain the data you require using information in the other files which the NOAA sent you. Create a data frame which contains only these three columns. This is one line of code. You will find the subset command useful for this.
Label the columns“Date”,“Time”and“AtmosPressure”. This is one line of code. Look at how you have
renamed columns in the labs!
For assessment, submit the code you wrote to process this data, and a CSV file of the data frame with three columns which you just created.
Hint 1: If you can’t work out which columns you need, look at the“format documentation”NOAA sent
Part 4: Combining the Flickr and environmental data (2%)
Now you have the Flickr data and the environmental data.
To work out how these data sets relate, you need to merge them.
For each hour from the beginning of 20 October 2012 to 10 November 2012,you have both anormalised count of the number of Hurricane Sandy Flickr photos taken, and a measurement ofatmospheric pressure in Atlantic City.
However, the atmospheric pressure data uses a different format than the Flickr data for specifying thedate and time.
You need to work out how to change the format of the atmospheric pressure date and time, so that itmatches the format used in the Flickr data. This might need about three lines of code, but there are lotsof different solutions.
You then need to merge the two datasets to create one data frame,where each row represents onehour, and contains a measurement of the atmospheric pressure and the normalised count of HurricaneSandy Flickr photos.This is one line of code.The merge function will be useful here.
For assessment, submit the code you wrote to process this data,and a CSV file of the data frame withthe atmospheric pressure data and Flickr counts you created.
Hint 1: There are many different ways to change the date formatl You might want to look atas.POSIXct(, which does something similar to as.Date(), but can represent times as well as Dates.
Part 5: Visualising and analysing the data(2%)
Now you have your data organised, you can plot some graphs to take a look at your data, and begin toanalyse the relationship between these two time series.
TASK 5A (1%):
First, use ggplot to create a line graph of the normalised Hurricane Sandy Flickr photos time series, sowe can see how this count changed across time.
Second, use ggplot to create a line graph of the atmospheric pressure in Atlantic City, so that we cansee how atmospheric pressure changed across time.
Make the plots look as nice as you can in ggplot. Include these two plots as two panels of the samefigure in your PDF answer sheet.Below your figure, write a caption describing what your figure shows.For assessment,submit the code you wrote to create these figures, and your figures and caption inyour PDF answer sheet as described above.
Hint 1: lf you’re not sure what a figure caption should look like,look at some of the papers we haveprovided as further reading for examples.
TASK 5B(1%):
Finally, in your answer sheet PDF, explain how you would carry out a correlational analysis to determine whether there is a relationship between these two time series.
Would your analysis make any assumptions about the distribution of the data? In R, create any graphsand run any tests you need to run to check these assumptions.
Now carry out a correlational analysis. In your answer sheet PDF, write a short description of the resultsyou have found. Keep this under 150 words.
For assessment, submit your answer sheet PDF, describing your analysis method and your results asspecified above. Submit any code you wrote to check assumptions for your analysis and carry youranalysis out. Include any graphs you generated in the PDF with a short caption to explain what theyshow.
Hint 1: There are various ways of analysing such relationships. For the purposes of this assessment, wewill restrict the analysis to a correlational analysis.