关于后端:Big-Data-Basics-INFS-5095

Big Data Basics INFS 5095
Supp. Assessment: Working with HDFS and MapReduce
Submission
Part 1: Submit a word document containing all your answers and screenshots. The font size is
expected to be either 11 or 12 with lines single line spaced. The images should be clearly visible and
readable in normal reading setting.
Part 2: Question and Answer zoom on Dec 22nd, 2021 or Dec 23rd, 2021
Assessment Aims
This assessment aims to demonstrate your understanding of how big data are stored in HDFS and
processed using MapReduce. This knowledge will assist you as a graduate as you distributed file
systems that run on community software are often used to process large amounts of data. A popular
software for processing this distributed information is MapReduce. The results of this type of analysis
will also need to be effectively communicated to clients.
The assessment addresses the following course objectives:
[CO3]. Understand and apply standard processes and industry standard tools to acquire, store and
prepare big data sets.
[CO4]. Select an appropriate big data analytics tool and apply it to a problem involving big data.
[CO5]. Communicate appropriately with professional colleagues through visualisation and report.
Assessment Description
In this assessment, you will
• Store document files in Hadoop Distributed File System (HDFS).
• Write a MapReduce program in Python to solve a typical application in text analysis.
• Write a report for professional data analyst and submit a document file containing all answers
and screenshots via the submission point. – Max 500 words
Assessment Details
Text analytics is the process of deriving high-quality information from documents. Text analysis parses
the contents of a document and creates structured data out of free text contents of the document. A
typical application in text analytics is count the words in a set of documents and identify the trending
words i.e. highly talked about words or most referenced words.
You will write a MapReduce program that will read a document and compute the top K most frequent
words in the document, where K will be any integer value. The output of the program will be a text
file with one word and count per line, the word and count separated by a tab.
Your program should be able to handle any document file. The program should also perform
preprocessing like removing spaces and symbols like punctuation marks (i.e. ., ;, ,, !, ?), and it will not
consider the articles (i.e. a, an, the), and non-significant words like prepositions and conjunctions.

You use the modified version of tweets dataset as the input data for this practical. The dataset contains
5000 social media messages. The text document (tweets.txt) is given the Practical2_Resources folder
on the course website.
Marking Scheme
MapReduce Program – 5 Marks
Output File – 2 Marks
Report File – 3 Marks

关于后端:Big-Data-Basics-INFS-5095

评论

发表回复取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

关于后端:Big-Data-Basics-INFS-5095

评论

发表回复 取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

发表回复取消回复