共计 2400 个字符,预计需要花费 6 分钟才能阅读完成。
Big Data Basics INFS 5095
Supp. Assessment: Working with HDFS and MapReduce
Submission
Part 1: Submit a word document containing all your answers and screenshots. The font size is
expected to be either 11 or 12 with lines single line spaced. The images should be clearly visible and
readable in normal reading setting.
Part 2: Question and Answer zoom on Dec 22nd, 2021 or Dec 23rd, 2021
Assessment Aims
This assessment aims to demonstrate your understanding of how big data are stored in HDFS and
processed using MapReduce. This knowledge will assist you as a graduate as you distributed file
systems that run on community software are often used to process large amounts of data. A popular
software for processing this distributed information is MapReduce. The results of this type of analysis
will also need to be effectively communicated to clients.
The assessment addresses the following course objectives:
[CO3]. Understand and apply standard processes and industry standard tools to acquire, store and
prepare big data sets.
[CO4]. Select an appropriate big data analytics tool and apply it to a problem involving big data.
[CO5]. Communicate appropriately with professional colleagues through visualisation and report.
Assessment Description
In this assessment, you will
• Store document files in Hadoop Distributed File System (HDFS).
• Write a MapReduce program in Python to solve a typical application in text analysis.
• Write a report for professional data analyst and submit a document file containing all answers
and screenshots via the submission point. – Max 500 words
Assessment Details
Text analytics is the process of deriving high-quality information from documents. Text analysis parses
the contents of a document and creates structured data out of free text contents of the document. A
typical application in text analytics is count the words in a set of documents and identify the trending
words i.e. highly talked about words or most referenced words.
You will write a MapReduce program that will read a document and compute the top K most frequent
words in the document, where K will be any integer value. The output of the program will be a text
file with one word and count per line, the word and count separated by a tab.
Your program should be able to handle any document file. The program should also perform
preprocessing like removing spaces and symbols like punctuation marks (i.e. ., ;, ,, !, ?), and it will not
consider the articles (i.e. a, an, the), and non-significant words like prepositions and conjunctions.
You use the modified version of tweets dataset as the input data for this practical. The dataset contains
5000 social media messages. The text document (tweets.txt) is given the Practical2_Resources folder
on the course website.
Marking Scheme
MapReduce Program – 5 Marks
Output File – 2 Marks
Report File – 3 Marks