811312A Data Structures and Algorithms 2020-2021 Assignment, Instructions
This document describes both general and specific requirements for the assignment of the course
Data Structures and Algorithms.
- General Instructions
Everyone shall return one assignment implemented in C. Return the solution to course’s Moodle
workspace. The returned work will be graded as approved/disproved. If not approvable, a returned
solution can be revised one time, according to inspector’s advice.
The assignment is an individual work. You are not allowed to copy code from others.
However, the course materials, for instance exercise solutions and example programs can be used
freely. When utilizing already made solutions, referencing practices need to be followed and
explanation of how it has been modified. The solution must be provided with the standard features
of C language. The use of extra code libraries is not allowed.
The solution shall contain - program code and
- a report
in one zipped file.
The report must be written in English and it shall contain - description of the solution and
- analysis of the solution program.
Analysis contains both analysis of the solution algorithm and measurements of program
running times. Based on this, estimates for the performance of the solution algorithm should
be provided. The code should be commented to improve its readability, but no other
documentation is required. The following description of the assignment describes the contents of
the report in more detail.
The assignment for this instance of the course shall be returned at latest 9.4.2021 23.59 - Description of the Assignment
In this assignment, you shall find the 100 most frequently occurring words from a large text
file. The program shall be implemented in C language. A continuous string of characters a..z and
A..Z, with possible apostrophes’, is considered a word. Words with uppercase and lowercase
letters are considered equal. For example, in the text
Herman Melville’s book Moby Dick starts, as we all know, with the sentence”Call me Ishmael”.
the words are herman, melville’s, book, moby, dick, starts, as, we, all, know, with, the, sentence,
call, me, and ishmael
The name of the input text file shall be given as an input from the user. The program shall print the - most frequently occurring words and their frequencies. The words are printed in descending
order according to their frequencies.
Because the file can be very large, you need a suitable data structure to store words and their
frequencies. This can be, for instance, a hash table or a binary search tree. You can find the most
frequent words by sorting the structure with some fast algorithm.
There are examples of input files and corresponding outputs in Moodle. All the files are texts
written in English. You should use these to test your program. The size of the input is
measured as the number of words in the text file. Measure the running times of your program
with given test input files and make an estimate, how many words there can be in a file that you can
process in 15 seconds. Based on this, estimate the maximum size (in megabytes) of files that your
program can process in the mentioned time, when we know that the average word length in English
is 5.1 letters.
In addition to program code, you shall return a report that describes - the solution, and
- analysis of program’s performance as mentioned above.
Analysis contains thus the measurements of program’s running times and the mentioned estimates
for number of words and file size. The code shall be commented to improve readability, but no
other documentation is required. The report shall naturally contain student’s name and student
number. If you wish, you can also give feedback in your report.
Remember to zip the files before returning.