CSci 4203/EE4367, Spring 2021
Homework Assignment III (Issued March 30, 2021)
Instructions:
1.You can type in your solutions by downloading this MS Word file. Or, write your solutions, scan the file and upload your PDF file.
2.Label your assignment with your name and UMN email address
3.Submit assignments via Canvas course web page. Note: No late submissions accepted.
4.Due Date: 11:59 pm Thursday 4/22/2021 (Central Standard Time)
5.For each problem, please show all of your work to get full credits. Without showing your work, up to half of the total credits could be deducted.
6.Due to TA resource limitations, we will be able to grade only a subset of the assigned problems (same subset for everyone).
7.Homework must be done individually. Please pay specific attention to the“Student Academic Integrity and Scholastic Dishonesty”noted in the syllabus. Failing the class because of cheating on a homework is not worth it.
Problem 1 From Homework #2 Problem 5 (Multiple-Issue Processors, Chapter 4)
This exercise compares the performance of an 1-issue and a 2-issue statically-scheduled processor with a design as in Figure 4.69 (in textbook) shown below. Note that there is no hazard detection unit and no data forwarding unit in both processors. So, it is the responsibility of a programmer or a compiler to insert nops (if needed), and to statically schedule the instructions for a better performance (i.e. to reduce the number of nops). In the 2-issue processor, one instruction can be an ALU or branch instruction, and the other can be a load or store instruction. Note that nops need to be filled in if there is no instruction available in a particular slot. Assume we have the following C program.
for (j=0; j < m; j++)
y[j] = x[j] + z[j-1];
When translating into MIPS code, assume that variables are kept in registers as shown in the following table, and that all registers except those assigned in the following table (i.e. r6 – r15) can be used as free registers to allow register renaming in the unrolled loop.
j m x y z
r4 r5 r1 r2 r3
A.Translate this C code directly into MIPS instructions, i.e. without considering hazards.
B.Try to insert nops into the code from (A) for the 1-issue processor described above (i.e. no hazard detection unit and no data forwarding unit). Try to rearrange the instructions to achieve a better performance, i.e. try to minimize the number of nops needed. How many cycles are needed to execute one iteration of the loop using your code?
C.Unroll the loop two times, and try to schedule the instructions on the 2-issue statically-schedule processor as described above (as in Figure 4.69). How many cycles are needed to execute one iteration of the unrolled loop (i.e. two iterations of the original code) using your code? Please present your code schedule similar to that in Figure 4.71 (in textbook) as shown below in which the empty slots are nops.
D.Assume both 1-issue and 2-issue processors have perfect branch prediction. Also, assume the loop has 1000 iterations (i.e. m = 1000). What is the speedup we can achieve by going from an 1-issue to a 2-issue processor? Note that there will be only 500 iterations in the unrolled loop (since we unroll the loop two times).
(Figure 4.71 in Textbook)
Problem 2 (Cache Memory Organizations, Chapter 5)
Below is a trace of 8-bit memory references. Each memory reference in the trace is given as word addresses. Assume the memory is word addressable, not byte-addressable in this problem.
Memory reference trace: 6010, 14610, 9210, 18610, 6010, 14610, 18610, 9210
Problem 2.a
Assume we have a direct-mapped cache with 8 one-word blocks. For each of those references, identify its memory address in binary form, its tag, and its index to the direct-mapped cache. Also, indicate if each reference is a hit or a miss, assuming the cache is initially empty. Show your results in the following table.
Problem 2.b
Assume we have a 2-way set-associative cache with 8 one-word blocks. For each of those references in the same trace, identify its binary address, its tag, and its index. Also, list if each reference is a hit or a miss, assuming the cache is initially empty. Show your results in the following table.
We are trying to decide the best cache design for the given address trace among two possible cache design options, all with a total of 8 words of data. Case-1 is a direct-mapped cache design with 1-word blocks as in Problem 1.a. Case-2 is a 2-way set-associative cache design with 1-word blocks as in Problem 1.b. Given address trace above, which cache design has the lowest miss rate? Assume the miss penalty is 20 cycles. Case-1 has an access time of 1 cycle. Case-2 has an access time of 2 cycles. Which cache design has a better performance in terms of AMAT (Average Memory Access Time)?
Problem 2.d
There are many design parameters that are important to a cache’s overall performance. Assume we have the following parameters for a direct-mapped cache: (1) cache data size: 16Kbyte; (2) cache block size: 4 words. Calculate the total number of bits required for the direct-mapped cache that includes data, tag and valid bit, assuming it is in a machine with 32-bit address. Given that the total size, find the total size for the closest 4-way set-associative cache with 2-word blocks of equal size or greater.
Problem 3. TLB and Address Mapping
Virtual memory uses a page table to track the mapping of (i.e. to translate) virtual addresses to physical addresses in main memory. This exercise shows how the page table must be updated as memory references are being processed. Assume we have a 32-bit virtual address and the page size is 4Kbytes. Also, it has a 4-entry fully-associative TLB with an LRU (least recently used) replacement scheme. When a page is brought into the main memory due to a page fault, it is placed in the page following the page with the largest page number, i.e. if the current largest physical page number is 12 as shown in the table below, it will be placed in the page number 13. The following list shows the stream of virtual addresses during the program execution.
Problem 3.e
Here are several parameters that can impact the overall size of the page table: (1) the virtual address size; (2) page size, and (3) the size of each page table entry. Assume the virtual address size is 32 bits, the page size is 8KB and the size of each page table entry is 4 bytes (assume it includes all of the information, e.g. the valid bit, physical page number, etc.). Please calculate the total page table size (in bytes) for a system running 6 applications. Each has its own page table, i.e. there are 6 page tables active at the same time in the system.
Problem 4 Cache Coherence, False Sharing
Cache coherence concerns the views of multiple processors on a given cache block. The following table shows two processors and their read/write operations on two different words X[0] and X[1] in an 8-byte cache block (initially, X[0] = X[1] = 0). Assume the size of integers is 4 bytes (i.e. one word). Hence, X[0] and X[1] are in the same cache block.
Processor 1 Processor 2
X[0]++; X[1]=3; X[0]=6; X[1] +=2;
Problem 4.a
List all possible values of the given cache block assuming we have a correct coherence protocol implementation. Note that there can be 6 possible orderings in the execution of the two statements on each processor. Two possible orderings are shown in the two tables below. You need to show the rest of the orderings, and the results in each ordering. List at least one more possible value of the block if the 2-processor system does not ensure cache coherency, i.e. a processor is not aware of the value having been changed in the other processor.
Problem 4.b
Assume we have a snooping cache coherence protocol, list the operation sequence (in the format similar to Figure 5.42) of the read/write operations on each processor/cache for the Order 2 shown in the above table. Remember that X[0] and X[1] are in the same cache block.
Processor activity Bus activity Contents of P1’s cache Contents of P2’s cache Contents of Memory
P2 reads X[0] Cache miss for X[0] Empty X[0] = 0; X[1] = 0 X[0] = 0; X[1] = 0
P2 writes X[0] Invalidation for X[0] Empty X[0] = 6; X[1] = 0 X[0] = 0; X[1] = 0
P1 reads X[0]
P1 writes X[0]
P2 reads X[1]
P2 writes X[1]
P1 reads X[1]
P1 writes X[1]
Problem 4.c
Among the 6 possible orderings in Problem 4.a, what are the best-case and the worst-case numbers of cache misses needed to execute the listed read/write instructions?