关于算法:CSE30记录读写算法

61次阅读

共计 16981 个字符,预计需要花费 43 分钟才能阅读完成。

Assignment 2: Record Reader
CSE30: Computer Organization and Systems, Summer Session 1
2022
Instructors: John Eldon and Nishant Bhaskar
Due: Friday July 15th, 2022 @ 11:59PM
Please read over the entire assignment before starting to get a sense of what you will need to
get done in the next week. REMEMBER: Everyone procrastinates but it is important to know
that you are procrastinating and still leave yourself enough time to finish. Start early, start often.
You MUST run the assignment on the pi-cluster. You HAVE to SSH: You will not be able to
compile or run the assignment otherwise. The assignments are getting longer as this course
progresses.
ACADEMIC INTEGRITY REMINDER: you should do this assignment entirely on
your own with help only from course staff. Consulting with other students (past
or present) who are not in the course may result in an academic integrity violation
which can have serious consequences.
Need help or instructions? See CSE 30 FAQs
Table of Contents

  1. Learning Goals
  2. Assignment Overview
  3. Getting Started
    a. Developing Your Program
    b. Running Your Program
  4. How the Program Works
  5. Submission and Grading
    a. Submitting
    b. Grading Breakdown [50 pts]
    Learning Goals
    ● Parsing input
    ● Using pointers in C
    ● Using arrays of pointers
    ● Allocating memory at runtime with malloc()
    Assignment Overview
    A common format for moving data between systems is called a“Delimiter Separated Values”
    (DSV) file. A DSV file uses a delimiter (‘,’, ‘#’, ‘ ‘, ‘\t’, etc) to separate values on a line
    of data1. Each line of the file, or“data record”, ends with a newline2 (‘\n’). Each record
    consists of one or more data fields (columns) separated by the delimiter. Fields are numbered
    from left to right starting with column 1. A DSV file stores tabular data (numbers and text) in
    plain text (ASCII strings). In a proper DSV file, each line will always have the same number of
    fields (columns).
    In the example above we have a sample of a Call Detail Record (CDR) in DSV format with the
    delimiter as a comma. A CDR file describes a cell phone voice call from one phone to another
    phone. Each record has 10 fields or columns. In this example, the first record of the file is a
    label for that column (field). Each column can be empty, and the last column is never followed
    by a ‘,’. It always ends with a ‘\n’ for every record.
    The data that you will deal with in this assignment is sent in a DSV file where the delimiter is
    whitespace.You must pick out the right columns from these DSVs to analyze.
  6. In Unix systems, lines generally end with ‘\n’ which is ASCII 0x0a (linefeed). On Windows systems,
    the end of line is sometimes 2 characters “\r\n” ASCII 0x0d (carriage return) followed by ASCII 0x0a
    (linefeed). The Wikipedia page describes this awkward situation. Windows’use of CRLF dates back to
    MS-DOS (1981). Unix, of course, is older. TOPS-10 (c 1967) used CRLF so it predates Unix. The terms
    Linefeed and Carriage Return refer to old teletypes: A linefeed would scroll the paper to the next line and
    a carriage return would move the print head to the left side of the paper. The Teletype Model 33 is a
    famous one. (Professor Chin used this in elementary school!)
  7. You may have noticed this option when saving a spreadsheet (Save As CSV file).
    Getting Started
    Developing Your Program
    For this class, you MUST compile and run your programs on the pi-cluster. To access the
    server, you will need your cs30wi22xx student account, and know how to SSH into the cluster.
    We’ve provided you with the starter code at the following link:
    https://github.com/cse30-su122/A2-starter.git
  8. Download the files in the repository.
    a. You can either use
    git clone https://github.com/cse30-su122/A2-starter.git
    directly from pi-cluster, or download the repo locally and scp the folder to
    pi-cluster if that doesn’t work.
  9. Fill out the fields in the README before turning in.
    Running Your Program
    We’ve provided you with a Makefile so compiling your program should be easy!
    Additionally, we’ve placed the reference solution binary at:
    /home/linux/ieng6/cs30wi22/public/bin/reader-a2-ref
    You can use reader-a2-ref as a command as follows:
    /home/linux/ieng6/cs30s122/public/bin/reader-a2-ref -c num_cols
    < input-file
    Makefile: The Makefile provided will create a reader executable from the source files
    provided. Compile it by typing make into the terminal. By default, it will run with warnings turned
    on. To compile without warnings, run make WARN= instead. Run make clean to remove files
    generated by make.
    How the Program Works
    You will be given the number of columns in each record of the file and a list of column indices.
    There may also be a flag, -c. Read input from stdin, parse it as a DSV, and print the
    specified columns in the given order. If the -c flag was provided, you also need to print
    statistics about the number of records in the input and the longest field (in terms of number of
    characters).
    Executable called with stdin from the command-line:
    ./reader [-c] num_cols
    The angle brackets around“list of cols …”indicates it is a mandatory list.
    It may look like the program hangs. It is waiting for you to type input and enter it on the
    command-line. Type what you want and press enter to send that line to stdin. If you want a
    more streamlined way to enter text, redirect stdin as shown below.
    Executable called with stdin redirected from a file:
    ./reader [-c] num_cols < input_filename
    “< input_filename”is not part of the command-line arguments, and you will find neither
    “<”nor“input_filename”in argv! The“<”is the indirection operator. It is not part of C,
    and is in fact part of the bash shell. It sends the contents of input_filename to your
    program’s stdin. You can read more about redirection operators on GNU.org.
    Delimiters
    In our DSVs, fields will be delimited by whitespace of any length. For example, 2 spaces
    (‘ ‘) is a single delimiter, as is 1 space (‘ ‘), 1 tab (‘\t’), 3 tabs (‘\t\t\t’), or 1 space
    and 1 tab (‘ \t’), or similar combinations. Therefore, the following two DSVs would,
    functionally, have the same records.
    It means no worries
    For the\trest of your days
    It’s our problem-free philosophy
    Hakuna\t\t Matata!
    It means no worries
    For the rest of your days
    It’s our problem-free\tphilosophy
    Hakuna \t Matata!
    Once you have read one record (line) from stdin, your job is to delimit it and print out the
    columns corresponding to the indices that are passed in.
    Program Implementation
    You will be given the following as command-line arguments:
    ● The number of columns in a record
    ● A list of column indices
    ○ This list of indices always comes after the number of columns.
    ● Optionally: The -c flag. This flag can only appear as the first argument.
    Note: You should only choose from the following library functions: malloc(), free(),
    atoi(), printf(), fprintf(), getline(), strcmp(), strlen(), and exit() in your
    code. You may not use any other function such as strtok().
    Input guarantees
    You do not have to handle degenerate input – input that violates any of these guarantees.
    ● All records will have at least one column.
    ● All the records in the file will have the same number of columns.
    ● No record will begin with whitespace.
    ● There will not be any empty columns. (In fact, it is impossible to create one: Our delimiter
    is any amount of whitespace, and the only remaining possible location, the beginning, is
    ruled out because no record can begin with whitespace.)
    ● The -c flag only appears as the first argument. If it appears, it will be as argv[1].
    ● The value provided for“number of columns”on the command line will be positive.
    ● The indices are valid decimal integers.
    ● The indices provided will be in-bounds, i.e. if num_col is the number of columns in a
    record, then the indices will always be in the range [-num_col, num_col – 1]
    inclusive.
    ● The only whitespace characters will be space (‘ ‘), tab (‘\t’), and newline (‘\n’).
    ○ Only ‘ ‘ and ‘\t’ can be delimiters, since ‘\n’ indicates the end of a line.
    Implementing Option Processing
    First we will implement option processing in order to parse the optional -c flag.
    The -c flag can only be used as the first argument. This means that, if it appears, it will always
    be as argv[1]. You can compare two strings by using strcmp(). Its function signature
    appears below. strcmp returns 0 if str1 and str2 are identical.
    int strcmp(const char str1, const char str2)
    You should set up a variable to indicate whether the -c flag appeared as argv[1]. You will use
    it when you are collecting the data for statistics to print at the end.
    Parsing the command-line arguments
    The first argument that is not the -c flag will always be the number of columns in each record,
    num_cols.
    ./reader 10 4 -1 9 4 ./reader -c 25 -10 11
    In the leftmost example above, this would be the string“10”. In the rightmost example, this
    would be the string“25”. As usual, you should convert this string to a number that you can use.
    We will come back to num_cols in the next section.
    The arguments following num_cols will be a list of columns that we want to print out, in the
    order we want to print them. Indices can be negative as well as positive! We are following
    Python’s indexing convention for the column indices, where negative indices count backwards
    from the end of the array. For instance, if we have the array
    int array[] = {10, 20, 30, 40};
    then array[-1] would be 40, and array[-4] would be 10.
    There can be duplicate numbers listed, and this list can be of any length. We won’t know until
    runtime. To accommodate this, we need to use malloc() to allocate a dynamic amount of
    memory for our storage (see example below).
    Create an array of ints to store this list of indices. The number of elements in the array is equal
    to the number of columns that will be written to stdout. This array’s size is determined by the
    number of arguments passed on the command line after num_cols. Each entry will contain the
    column index to be output, and the order of the indices within this array is the order to be
    written. Let’s assume there were 4 output column arguments: 4, -1, 9, 4. Then the output buffer,
    named *obuf here, would look like this.
    Think about how you could calculate the number of indices, to allocate memory for int *obuf,
    without needing to iterate over argv. argc may be useful here. However, you will need to
    iterate over argv at least once to fill out the contents of obuf.
    Parsing the record from stdin
    To prepare for parsing the record from stdin, use malloc() to allocate an array of pointers to
    chars. This array is named **buf in the picture below. The number of array entries is the
    same as the number of columns in a record, num_cols, which we found earlier. The purpose of
    this array is to store the beginning of each column of a record. Since the record is an array of
    chars, a location within it will be the location of a char, and therefore be of type char *.
    As an example, say we are processing records that have 10 columns. The input processing
    array will look like this.
    If we had the following record while processing records with 5 columns:
    I never thought hyenas essential
    Then buf would contain the following pointers:
    Pointer to
    “I”in“I”
    Pointer to“n”in
    “never”
    Pointer to“t”in
    “thought”
    Pointer to“h”in
    “hyenas”
    Pointer to“e”in
    “essential”
    I never thought hyenas essential
    Here is how you can allocate an array of 20 int pointers and an array of 15 chars (char*).
    .Processing the input from stdin
    Read one line from stdin into an input buffer, using getline(). The function prototype for
    getline is as follows:
    ssize_t getline(char* lineptr, size_t n, FILE*stream);
    The last parameter to getline is a FILE *stream. In the previous assignment, this was
    returned from fopen(). In C, there are 3 FILE pointers that are automatically opened for you :
    stdin (standard input), stdout (standard output), and stderr (standard error). Input from
    the terminal or from a redirected file is associated with stdin. Output to the terminal or a
    redirected file are associated with stdout and stderr.
    getline takes in a pointer to an input buffer (somewhere to store the characters it read), a
    pointer to a variable to store the number of characters it read, and the source to read from. In
    this example, let’s call this pointer readbuffer. You will need to set this input buffer
    readbuffer as NULL so that getline can use it. getline internally does a malloc, which
    means there is no need to malloc space for readbuffer. However, it does require that
    readbuffer is initialized to NULL the first time getline is called with readbuffer3.
    In our case, the source will be stdin. The characters read into the input buffer will always be
    terminated with a ‘\0’. Try man getline for more information.
    Once you’ve read the line, fill out the input array (char **buf). No record will start with
    whitespace, so the first character of the record is the beginning of the first column. Set the first
    element of the input array (buf[0]) to point to the beginning of the input buffer readbuffer.
    Walk through the rest of readbuffer to continue filling out buf. As you walk through, search
    for whitespace, and stop once you see the ‘\0’ that indicates the end of the buffer. For each
    whitespace character located directly after a non-whitespace character, or the newline character
    after the last column, replace it with a ‘\0’ to terminate that substring. Continue walking until
    you see another non-whitespace character. This next non-whitespace character is the first
    character of the next field, so store a pointer to it in the next entry of buf.
    At the end of processing the input line, you should have an array buf filled with pointers to the
    start of each column in the record, and each column is a properly ‘\0’-terminated C string.
    You should not need to zero out the array before each pass of getline; getline will
    overwrite the contents anyway.
  10. Something to think about, why does this variable need to be char * instead of char ?
    Before processing a record
    Say we have an input buffer readbuffer that getline reads into, and it looks like this
    pre-processing. Empty cells represent the space character.
    T h e P r i d e \t \t L a n d s \n \0
    After processing a record
    This is what readbuffer would look like post-processing. Red cells are those that were
    replaced with ‘\0’s, while blue cells are the whitespace characters that remained unchanged.
    T h e \0 P r i d e \0 \t L a n d s \0 \0
    The input array buf would have these contents. Notice that buf[2] does not point to the blue
    cell in readbuffer. You should not assume that the next character after a whitespace is going
    to be the start of the next record.
    Pointer to“T”in“The”Pointer to“P”in“Pride”Pointer to“L”in“Lands”
    T h e \0 P r i d e \0 \t L a n d s \0 \0
    Printing
    Once you have filled out buf, print the elements in buf corresponding to the indices you stored
    in obuf. Loop over obuf and for each number ind in it, print buf[ind] to stdout using
    printf. Why does this work? You’ve null-terminated each of the individual columns in the
    record, which means each of these substrings is now a proper C string.
    Place a space between every word you print out. However, do not print a space at the end of
    the line.
    Loop back to the call to getline and repeat the process until you read EOF, indicating that
    there is no more input.
    Statistics
    If the -c flag is set, collect some statistics about the input while you are parsing it.
    ● Number of records in the input
    ● Longest field in the input (in terms of number of characters)
    After you are done processing every record in the input, print out these statistics to stdout.
    Formatting strings for these statistics are provided in the starter code, and you need to calculate
    the numbers as stated.
    Examples

    Contents of circle-of-life.txt

    In the circle of life
    It’s the wheel of fortune
    It’s the leap of faith
    It’s the band of hope
    ‘Til we find our place

    Command-line

    ./reader 5 0 3 < circle-of-life.txt

    Output

    In of
    It’s of
    It’s of
    It’s of
    ‘Til our

    Contents of circle-of-life.txt

    In the circle of life
    It’s the wheel of fortune
    It’s the leap of faith
    It’s the band of hope
    ‘Til we find our place

    Command-line

    ./reader 5 2 -1 < circle-of-life.txt

    Output

    circle life
    wheel fortune
    leap faith
    band hope
    find place

    Contents of be-prepared.txt

    I know that your powers of retention
    Are as wet as a warthog’s backside
    But thick as you are, pay attention
    My words are a matter of pride

    Command-line

    ./reader 7 -7 1 -5 3 -3 5 -1 < be-prepared.txt

    Output

    I know that your powers of retention
    Are as wet as a warthog’s backside
    But thick as you are, pay attention
    My words are a matter of pride

    Contents of be-prepared.txt

    I know that your powers of retention
    Are as wet as a warthog’s backside
    But thick as you are, pay attention
    My words are a matter of pride

    Command-line

    ./reader -c 7 -7 1 -5 3 -3 5 -1 < be-prepared.txt

    Output

    I know that your powers of retention
    Are as wet as a warthog’s backside
    But thick as you are, pay attention
    My words are a matter of pride
    Number of lines: 4
    Longest field: 9 characters
    Midpoint
    This part of the assignment is due earlier than the full assignment, on
    Sunday 7/10 at 11:59 pm PST. There are no late submissions on the
    Midpoint.
    Complete the Gradescope assignment“Midpoint: Programming Assignment 2”, an Online
    Assignment that is done entirely through Gradescope. This assignment consists of a few quiz
    questions and a free-response question where you will document your pseudocode or the
    algorithm in plain English. This is a planning document and does not need to reflect your final
    implementation, although you are encouraged to keep it up to date.
    Describe the complete algorithm for your program. Cover all the steps required to parse the
    input arguments, get a line of input from stdin, identify where each column begins, print the
    specified columns, and collect statistics.
    Submission and Grading
    Submitting

  11. Submit your files to Gradescope under the assignment titled“Programming Assignment
    2”.
    You will submit the following files:
    reader.c
    README.md
    You can upload multiple files to Gradescope by holding CTRL (? on a Mac) while you
    are clicking the files. You can also hold SHIFT to select all files between a start point
    and an endpoint.
    Alternatively, you can place all files in a folder and upload the folder to the assignment.
    Gradescope will upload all files in the folder. You can also zip all of the files and upload
    the .zip to the assignment. Ensure that the files you submit are not in a nested folder.
  12. After submitting, the autograder will run a few tests:
    a. Check that all required files were submitted.
    b. Check that reader compiles.
    c. Runs some sanity tests on the resulting reader executable.
    Grading Breakdown [50 pts]
    Make sure to check the autograder output after submitting! We will be running
    additional tests after the deadline passes to determine your final grade. Also, throughout this
    course, make sure to write your own test cases. It is bad practice to rely on the minimal
    public autograder tests as this is an insufficient test of your program.
    To encourage you to write your own tests, we are not providing any public tests that have
    not already been detailed in this writeup.
    The assignment will be graded out of 50 points and will be allocated as follows:
    ● Midpoint writeup: 5 points. This part of the assignment is due earlier than the full
    assignment, on Sunday 7/10 at 11:59 pm PST. Complete the Gradescope
    assignment“Midpoint: Programming Assignment 2”.
    ● Code compiles without warnings: 1 point.
    ● No Memory Leaks : 4 points
    ● Public tests with the provided examples : 16 points
    ● Private tests with hidden test cases : 24 points
    NOTE: The tests expect an EXACT output match with the reference binary. There will be
    NO partial credit for any differences in the output. Test your code – do NOT rely on the
    autograder for program validation (read this sentence again).
    Make sure your assignment compiles correctly through the provided Makefile on the
    pi-cluster without warnings. Any assignment that does not compile will receive 0 credit.
    Note: You should only choose from the following library functions: malloc(), free(),
    atoi(), printf(), fprintf(), getline(), strcmp(), strlen(), and exit() in your
    code. You may not use any other function such as strtok().
    Checking For Exact Output Match
    A common technique is to redirect the outputs to files and compare against the reference
    solution4:
    ./your-program args > output; our-reference args > ref
    diff -s output ref
    This command will output lines that differ with a < or > in front to tell which file the line came
    from.
正文完
 0