关于算法:CSE30记录读写算法

Assignment 2: Record Reader
CSE30: Computer Organization and Systems, Summer Session 1
2022
Instructors: John Eldon and Nishant Bhaskar
Due: Friday July 15th, 2022 @ 11:59PM
Please read over the entire assignment before starting to get a sense of what you will need to
get done in the next week. REMEMBER: Everyone procrastinates but it is important to know
that you are procrastinating and still leave yourself enough time to finish. Start early, start often.
You MUST run the assignment on the pi-cluster. You HAVE to SSH: You will not be able to
compile or run the assignment otherwise. The assignments are getting longer as this course
progresses.
ACADEMIC INTEGRITY REMINDER: you should do this assignment entirely on
your own with help only from course staff. Consulting with other students (past
or present) who are not in the course may result in an academic integrity violation
which can have serious consequences.
Need help or instructions? See CSE 30 FAQs
Table of Contents

Learning Goals
Assignment Overview
Getting Started
a. Developing Your Program
b. Running Your Program
How the Program Works
Submission and Grading
a. Submitting
b. Grading Breakdown [50 pts]
Learning Goals
● Parsing input
● Using pointers in C
● Using arrays of pointers
● Allocating memory at runtime with malloc()
Assignment Overview
A common format for moving data between systems is called a“Delimiter Separated Values”
(DSV) file. A DSV file uses a delimiter (‘,’, ‘#’, ‘ ‘, ‘\t’, etc) to separate values on a line
of data1. Each line of the file, or“data record”, ends with a newline2 (‘\n’). Each record
consists of one or more data fields (columns) separated by the delimiter. Fields are numbered
from left to right starting with column 1. A DSV file stores tabular data (numbers and text) in
plain text (ASCII strings). In a proper DSV file, each line will always have the same number of
fields (columns).
In the example above we have a sample of a Call Detail Record (CDR) in DSV format with the
delimiter as a comma. A CDR file describes a cell phone voice call from one phone to another
phone. Each record has 10 fields or columns. In this example, the first record of the file is a
label for that column (field). Each column can be empty, and the last column is never followed
by a ‘,’. It always ends with a ‘\n’ for every record.
The data that you will deal with in this assignment is sent in a DSV file where the delimiter is
whitespace.You must pick out the right columns from these DSVs to analyze.
In Unix systems, lines generally end with ‘\n’ which is ASCII 0x0a (linefeed). On Windows systems,
the end of line is sometimes 2 characters “\r\n” ASCII 0x0d (carriage return) followed by ASCII 0x0a
(linefeed). The Wikipedia page describes this awkward situation. Windows’use of CRLF dates back to
MS-DOS (1981). Unix, of course, is older. TOPS-10 (c 1967) used CRLF so it predates Unix. The terms
Linefeed and Carriage Return refer to old teletypes: A linefeed would scroll the paper to the next line and
a carriage return would move the print head to the left side of the paper. The Teletype Model 33 is a
famous one. (Professor Chin used this in elementary school!)
You may have noticed this option when saving a spreadsheet (Save As CSV file).
Getting Started
Developing Your Program
For this class, you MUST compile and run your programs on the pi-cluster. To access the
server, you will need your cs30wi22xx student account, and know how to SSH into the cluster.
We’ve provided you with the starter code at the following link:
https://github.com/cse30-su122/A2-starter.git
Download the files in the repository.
a. You can either use
git clone https://github.com/cse30-su122/A2-starter.git
directly from pi-cluster, or download the repo locally and scp the folder to
pi-cluster if that doesn’t work.
Fill out the fields in the README before turning in.
Running Your Program
We’ve provided you with a Makefile so compiling your program should be easy!
Additionally, we’ve placed the reference solution binary at:
/home/linux/ieng6/cs30wi22/public/bin/reader-a2-ref
You can use reader-a2-ref as a command as follows:
/home/linux/ieng6/cs30s122/public/bin/reader-a2-ref -c num_cols
< input-file
Makefile: The Makefile provided will create a reader executable from the source files
provided. Compile it by typing make into the terminal. By default, it will run with warnings turned
on. To compile without warnings, run make WARN= instead. Run make clean to remove files
generated by make.
How the Program Works
You will be given the number of columns in each record of the file and a list of column indices.
There may also be a flag, -c. Read input from stdin, parse it as a DSV, and print the
specified columns in the given order. If the -c flag was provided, you also need to print
statistics about the number of records in the input and the longest field (in terms of number of
characters).
Executable called with stdin from the command-line:
./reader [-c] num_cols
The angle brackets around“list of cols …”indicates it is a mandatory list.
It may look like the program hangs. It is waiting for you to type input and enter it on the
command-line. Type what you want and press enter to send that line to stdin. If you want a
more streamlined way to enter text, redirect stdin as shown below.
Executable called with stdin redirected from a file:
./reader [-c] num_cols < input_filename
“< input_filename”is not part of the command-line arguments, and you will find neither
“<”nor“input_filename”in argv! The“<”is the indirection operator. It is not part of C,
and is in fact part of the bash shell. It sends the contents of input_filename to your
program’s stdin. You can read more about redirection operators on GNU.org.
Delimiters
In our DSVs, fields will be delimited by whitespace of any length. For example, 2 spaces
(‘ ‘) is a single delimiter, as is 1 space (‘ ‘), 1 tab (‘\t’), 3 tabs (‘\t\t\t’), or 1 space
and 1 tab (‘ \t’), or similar combinations. Therefore, the following two DSVs would,
functionally, have the same records.
It means no worries
For the\trest of your days
It’s our problem-free philosophy
Hakuna\t\t Matata!
It means no worries
For the rest of your days
It’s our problem-free\tphilosophy
Hakuna \t Matata!
Once you have read one record (line) from stdin, your job is to delimit it and print out the
columns corresponding to the indices that are passed in.
Program Implementation
You will be given the following as command-line arguments:
● The number of columns in a record
● A list of column indices
○ This list of indices always comes after the number of columns.
● Optionally: The -c flag. This flag can only appear as the first argument.
Note: You should only choose from the following library functions: malloc(), free(),
atoi(), printf(), fprintf(), getline(), strcmp(), strlen(), and exit() in your
code. You may not use any other function such as strtok().
Input guarantees
You do not have to handle degenerate input – input that violates any of these guarantees.
● All records will have at least one column.
● All the records in the file will have the same number of columns.
● No record will begin with whitespace.
● There will not be any empty columns. (In fact, it is impossible to create one: Our delimiter
is any amount of whitespace, and the only remaining possible location, the beginning, is
ruled out because no record can begin with whitespace.)
● The -c flag only appears as the first argument. If it appears, it will be as argv[1].
● The value provided for“number of columns”on the command line will be positive.
● The indices are valid decimal integers.
● The indices provided will be in-bounds, i.e. if num_col is the number of columns in a
record, then the indices will always be in the range [-num_col, num_col – 1]
inclusive.
● The only whitespace characters will be space (‘ ‘), tab (‘\t’), and newline (‘\n’).
○ Only ‘ ‘ and ‘\t’ can be delimiters, since ‘\n’ indicates the end of a line.
Implementing Option Processing
First we will implement option processing in order to parse the optional -c flag.
The -c flag can only be used as the first argument. This means that, if it appears, it will always
be as argv[1]. You can compare two strings by using strcmp(). Its function signature
appears below. strcmp returns 0 if str1 and str2 are identical.
int strcmp(const char str1, const char str2)
You should set up a variable to indicate whether the -c flag appeared as argv[1]. You will use
it when you are collecting the data for statistics to print at the end.
Parsing the command-line arguments
The first argument that is not the -c flag will always be the number of columns in each record,
num_cols.
./reader 10 4 -1 9 4 ./reader -c 25 -10 11
In the leftmost example above, this would be the string“10”. In the rightmost example, this
would be the string“25”. As usual, you should convert this string to a number that you can use.
We will come back to num_cols in the next section.
The arguments following num_cols will be a list of columns that we want to print out, in the
order we want to print them. Indices can be negative as well as positive! We are following
Python’s indexing convention for the column indices, where negative indices count backwards
from the end of the array. For instance, if we have the array
int array[] = {10, 20, 30, 40};
then array[-1] would be 40, and array[-4] would be 10.
There can be duplicate numbers listed, and this list can be of any length. We won’t know until
runtime. To accommodate this, we need to use malloc() to allocate a dynamic amount of
memory for our storage (see example below).
Create an array of ints to store this list of indices. The number of elements in the array is equal
to the number of columns that will be written to stdout. This array’s size is determined by the
number of arguments passed on the command line after num_cols. Each entry will contain the
column index to be output, and the order of the indices within this array is the order to be
written. Let’s assume there were 4 output column arguments: 4, -1, 9, 4. Then the output buffer,
named *obuf here, would look like this.
Think about how you could calculate the number of indices, to allocate memory for int *obuf,
without needing to iterate over argv. argc may be useful here. However, you will need to
iterate over argv at least once to fill out the contents of obuf.
Parsing the record from stdin
To prepare for parsing the record from stdin, use malloc() to allocate an array of pointers to
chars. This array is named **buf in the picture below. The number of array entries is the
same as the number of columns in a record, num_cols, which we found earlier. The purpose of
this array is to store the beginning of each column of a record. Since the record is an array of
chars, a location within it will be the location of a char, and therefore be of type char *.
As an example, say we are processing records that have 10 columns. The input processing
array will look like this.
If we had the following record while processing records with 5 columns:
I never thought hyenas essential
Then buf would contain the following pointers:
Pointer to
“I”in“I”
Pointer to“n”in
“never”
Pointer to“t”in
“thought”
Pointer to“h”in
“hyenas”
Pointer to“e”in
“essential”
I never thought hyenas essential
Here is how you can allocate an array of 20 int pointers and an array of 15 chars (char*).
.Processing the input from stdin
Read one line from stdin into an input buffer, using getline(). The function prototype for
getline is as follows:
ssize_t getline(char* lineptr, size_t n, FILE*stream);
The last parameter to getline is a FILE *stream. In the previous assignment, this was
returned from fopen(). In C, there are 3 FILE pointers that are automatically opened for you :
stdin (standard input), stdout (standard output), and stderr (standard error). Input from
the terminal or from a redirected file is associated with stdin. Output to the terminal or a
redirected file are associated with stdout and stderr.
getline takes in a pointer to an input buffer (somewhere to store the characters it read), a
pointer to a variable to store the number of characters it read, and the source to read from. In
this example, let’s call this pointer readbuffer. You will need to set this input buffer
readbuffer as NULL so that getline can use it. getline internally does a malloc, which
means there is no need to malloc space for readbuffer. However, it does require that
readbuffer is initialized to NULL the first time getline is called with readbuffer3.
In our case, the source will be stdin. The characters read into the input buffer will always be
terminated with a ‘\0’. Try man getline for more information.
Once you’ve read the line, fill out the input array (char **buf). No record will start with
whitespace, so the first character of the record is the beginning of the first column. Set the first
element of the input array (buf[0]) to point to the beginning of the input buffer readbuffer.
Walk through the rest of readbuffer to continue filling out buf. As you walk through, search
for whitespace, and stop once you see the ‘\0’ that indicates the end of the buffer. For each
whitespace character located directly after a non-whitespace character, or the newline character
after the last column, replace it with a ‘\0’ to terminate that substring. Continue walking until
you see another non-whitespace character. This next non-whitespace character is the first
character of the next field, so store a pointer to it in the next entry of buf.
At the end of processing the input line, you should have an array buf filled with pointers to the
start of each column in the record, and each column is a properly ‘\0’-terminated C string.
You should not need to zero out the array before each pass of getline; getline will
overwrite the contents anyway.
Something to think about, why does this variable need to be char * instead of char ?
Before processing a record
Say we have an input buffer readbuffer that getline reads into, and it looks like this
pre-processing. Empty cells represent the space character.
T h e P r i d e \t \t L a n d s \n \0
After processing a record
This is what readbuffer would look like post-processing. Red cells are those that were
replaced with ‘\0’s, while blue cells are the whitespace characters that remained unchanged.
T h e \0 P r i d e \0 \t L a n d s \0 \0
The input array buf would have these contents. Notice that buf[2] does not point to the blue
cell in readbuffer. You should not assume that the next character after a whitespace is going
to be the start of the next record.
Pointer to“T”in“The”Pointer to“P”in“Pride”Pointer to“L”in“Lands”
T h e \0 P r i d e \0 \t L a n d s \0 \0
Printing
Once you have filled out buf, print the elements in buf corresponding to the indices you stored
in obuf. Loop over obuf and for each number ind in it, print buf[ind] to stdout using
printf. Why does this work? You’ve null-terminated each of the individual columns in the
record, which means each of these substrings is now a proper C string.
Place a space between every word you print out. However, do not print a space at the end of
the line.
Loop back to the call to getline and repeat the process until you read EOF, indicating that
there is no more input.
Statistics
If the -c flag is set, collect some statistics about the input while you are parsing it.
● Number of records in the input
● Longest field in the input (in terms of number of characters)
After you are done processing every record in the input, print out these statistics to stdout.
Formatting strings for these statistics are provided in the starter code, and you need to calculate
the numbers as stated.
Examples
Contents of circle-of-life.txt
In the circle of life
It’s the wheel of fortune
It’s the leap of faith
It’s the band of hope
‘Til we find our place
Command-line
./reader 5 0 3 < circle-of-life.txt
Output
In of
It’s of
It’s of
It’s of
‘Til our
Contents of circle-of-life.txt
In the circle of life
It’s the wheel of fortune
It’s the leap of faith
It’s the band of hope
‘Til we find our place
Command-line
./reader 5 2 -1 < circle-of-life.txt
Output
circle life
wheel fortune
leap faith
band hope
find place
Contents of be-prepared.txt
I know that your powers of retention
Are as wet as a warthog’s backside
But thick as you are, pay attention
My words are a matter of pride
Command-line
./reader 7 -7 1 -5 3 -3 5 -1 < be-prepared.txt
Output
I know that your powers of retention
Are as wet as a warthog’s backside
But thick as you are, pay attention
My words are a matter of pride
Contents of be-prepared.txt
I know that your powers of retention
Are as wet as a warthog’s backside
But thick as you are, pay attention
My words are a matter of pride
Command-line
./reader -c 7 -7 1 -5 3 -3 5 -1 < be-prepared.txt
Output
I know that your powers of retention
Are as wet as a warthog’s backside
But thick as you are, pay attention
My words are a matter of pride
Number of lines: 4
Longest field: 9 characters
Midpoint
This part of the assignment is due earlier than the full assignment, on
Sunday 7/10 at 11:59 pm PST. There are no late submissions on the
Midpoint.
Complete the Gradescope assignment“Midpoint: Programming Assignment 2”, an Online
Assignment that is done entirely through Gradescope. This assignment consists of a few quiz
questions and a free-response question where you will document your pseudocode or the
algorithm in plain English. This is a planning document and does not need to reflect your final
implementation, although you are encouraged to keep it up to date.
Describe the complete algorithm for your program. Cover all the steps required to parse the
input arguments, get a line of input from stdin, identify where each column begins, print the
specified columns, and collect statistics.
Submission and Grading
Submitting
Submit your files to Gradescope under the assignment titled“Programming Assignment
2”.
You will submit the following files:
reader.c
README.md
You can upload multiple files to Gradescope by holding CTRL (? on a Mac) while you
are clicking the files. You can also hold SHIFT to select all files between a start point
and an endpoint.
Alternatively, you can place all files in a folder and upload the folder to the assignment.
Gradescope will upload all files in the folder. You can also zip all of the files and upload
the .zip to the assignment. Ensure that the files you submit are not in a nested folder.
After submitting, the autograder will run a few tests:
a. Check that all required files were submitted.
b. Check that reader compiles.
c. Runs some sanity tests on the resulting reader executable.
Grading Breakdown [50 pts]
Make sure to check the autograder output after submitting! We will be running
additional tests after the deadline passes to determine your final grade. Also, throughout this
course, make sure to write your own test cases. It is bad practice to rely on the minimal
public autograder tests as this is an insufficient test of your program.
To encourage you to write your own tests, we are not providing any public tests that have
not already been detailed in this writeup.
The assignment will be graded out of 50 points and will be allocated as follows:
● Midpoint writeup: 5 points. This part of the assignment is due earlier than the full
assignment, on Sunday 7/10 at 11:59 pm PST. Complete the Gradescope
assignment“Midpoint: Programming Assignment 2”.
● Code compiles without warnings: 1 point.
● No Memory Leaks : 4 points
● Public tests with the provided examples : 16 points
● Private tests with hidden test cases : 24 points
NOTE: The tests expect an EXACT output match with the reference binary. There will be
NO partial credit for any differences in the output. Test your code – do NOT rely on the
autograder for program validation (read this sentence again).
Make sure your assignment compiles correctly through the provided Makefile on the
pi-cluster without warnings. Any assignment that does not compile will receive 0 credit.
Note: You should only choose from the following library functions: malloc(), free(),
atoi(), printf(), fprintf(), getline(), strcmp(), strlen(), and exit() in your
code. You may not use any other function such as strtok().
Checking For Exact Output Match
A common technique is to redirect the outputs to files and compare against the reference
solution4:
./your-program args > output; our-reference args > ref
diff -s output ref
This command will output lines that differ with a < or > in front to tell which file the line came
from.

关于算法:CSE30记录读写算法

Contents of circle-of-life.txt

Command-line

Output

Contents of circle-of-life.txt

Command-line

Output

Contents of be-prepared.txt

Command-line

Output

Contents of be-prepared.txt

Command-line

Output

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）