关于后端:COMP528

11次阅读

共计 5437 个字符,预计需要花费 14 分钟才能阅读完成。

PAPER CODE NO. EXAMINER: Fabio Papacchini
COMP528-JAN21 DEPARTMENT: Computer Science
FIRST SEMESTER EXAMINATIONS 2020/21
Multi-Core and Multi-Processor Programming
EXPECTED TIME: 3 hours
INSTRUCTIONS TO CANDIDATES
The exam consists of FOUR questions. You must answer ALL questions
The expected writing time for the exam is 3 hours
You may write your answers using a word processor (please export the document to PDF
before submitting), or you may write your answers by hand and either scan them, or take
photos of them. If you write answers by hand, then both the handwriting and the scanned
copy must be legible in order to be accepted.
The exam will be released at 9am on Friday 30 April 2021. You will then have 24 hours in
which to prepare your answers, and the final deadline for submissions is 9am on Saturday
01 May 2021. All times and dates are BST (GMT +1).
All answers should be combined into a single PDF that should be uploaded to the relevant
CANVAS area for the COMP528-JAN21 course 2020-2021.
Late submissions will not be accepted.
PAPER CODE COMP528-JAN21 page 1 of 5 Continued
Question 1

  1. What are the reasons for the fast development and wide use of multi-core processors? What
    hardware approaches have been developed to overcome the limitations imposed by a single
    core processor?
    [7 marks]
  2. Consider two small HPC clusters C1 and C2 where
    C1 is composed of 4 nodes with 8 processors each, and each processor has 15 cores
    running at a fixed frequency of 2.5 GHz; and
    C2 is composed of 2 nodes with 12 processors each, and each processor has 20 cores
    running at a fixed frequency of 2 GHz.
    How you would evaluate their peak performances? Assuming 2 floating point operation per
    CPU core cycle, which of the two small clusters has a better peak performance?
    [5 marks]
  3. In your own words, explain and comment on Amdahl’s Law and Gustafson’s observation. For
    the former (i.e., Amdahl’s law), provide its equation, and define each used variable. Explain
    what is meant by“speed up”and express the theoretical maximum speed-up as a function of
    the serial proportion of a code.
    [8 marks]
  4. A serial code takes 300 minutes to run. Timing different parts of the code shows that 294 of
  5. minutes are spent on a parallelisable portion of the code. Using Amdahl’s Law, how long
    would it take to run a parallel version of the code on 3 cores? Assuming Amdahl’s Law is
    correct, what would be the parallel efficiency on 3 cores? And what is the maximum speed-
    up possible for this code on this architecture?
    [5 marks]
    Question 2
  6. Describe the MPI Scatter and MPI Gather collectives, explaining what buffers are required
    to be allocated and initialised on which process. NOTE: the description should discuss both
    the high-level idea behind them, and the more practical aspects in an implementation.
    [8 marks]
  7. Explain what a deadlock is, provide an example of an MPI program which might result is a
    deadlock, and provide a possible solution to your example (HINT: you do not have to provide
    PAPER CODE COMP528-JAN21 page 2 of 5 Continued
    the whole code, but just the relevant lines causing the deadlock and making clear which pro-
    cesses are executing them)
    [8 marks]
  8. Explain what affects the time to send messages in an HPC cluster, and how a reduction pat-
    tern can have a positive impact on the total wall clock time.
    [5 marks]
  9. Explain what is wrong in the following C code, and propose a possible solution. You can
    assume that each process has the integer array x initialised, NUM is an integer provided as
    input, and the number of processes is less than NUM.
    MPI_comm_size(MPI_COMM_WORLD, &size);
    MPI_comm_rank(MPI_COMM_WORLD, &myRank);
    int stepSize = NUM/size;
    int start = myRank*stepSize;
    int finish = start + stepSize;
    int global_sum=0;
    int local_sum = 0;
    for(i=start; ilocal_sum += x[i]
    }
    MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
    if(rank==0)
    printf(“Sum: %f”, global_sum);
    [4 marks]
    Question 3
  10. Consider the following serial C code where a, b and res are vectors (of N floats) and N is a
    positive integer previously read in by the program.
    res = 0.0;
    for (i=N – 1; i > 0; i–) {
    res = res * (b[i]-a[i]);
    }
    PAPER CODE COMP528-JAN21 page 3 of 5 Continued
    Explain how you would parallelise this using OpenMP, giving the full and exact code needed
    to appropriately parallelise this loop, bearing in mind good programming practices
    [5 marks]
  11. Explain, via the use of an example, what is wrong in the following code, and how you would
    fix it. You can assume that all variables have been initialised before the provided code (you
    might want to state what their values are for your example).

    pragma omp parallel default(none) shared(NUM,A,B,x,y) private(i,j,k)

    {

    pragma omp for nowait

    for(i=0;ix[i] = A * x[i];
    }

    pragma omp for

    for(j=0;jk = NUM-j-1;
    y[j] = B * x[k];
    }
    }
    [7 marks]

  12. When writing OpenMP programs, it is fundamental to understand the“scope”of variables
    within a parallel region. For a given variable x, describe in words, and potentially via an
    example and/or diagram, what happens in each of the following cases: (1) x is defined as
    private(x), (2) x is defined as shared(x), and (3) x is defined within a reduction directive
    (e.g., reduction(+:x)). [6 marks]
  13. Different approaches can be taken when implementing a parallel program (e.g., shared mem-
    ory vs distributed memory), each of which has its advantages and disadvantages. Provide at
    least four statements about differences and/or similarities between the two main approaches
    discussed in the lectures.
    [7 marks]
    Question 4
  14. Describe, potentially with the use of diagrams, what task parallelism and data parallelism
    are. Which one of these two kind of parallelism is better suited for implementation on GPUs?
    PAPER CODE COMP528-JAN21 page 4 of 5 Continued
    Explain you answer.
    [6 marks]
  15. Explain the steps you would take to implement the serial code fragment below as a parallel
    operation on a GPU as a 1D CUDA kernel. You should include how to change the code to
    become a CUDA kernel, including portability for invocation on any number of threads larger
    than num (where there are num elements of x, y and z).
    void saxpy(float z, float A, float x, float *y, int num)
    int i;
    for (i=0; i{
    z[i] = x[i] – A * y[i];
    }
    [8 marks]
  16. Explain the differences of using openMP or openACC to offload part of a parallel computation
    to a GPU.
    [5 marks]
  17. Choose one between the two“emerging technology”FPGA and Quantum Computing. De-
    scribe the chosen technology and how it differs from traditional CPU computing.
    [6 marks]
    PAPER CODE COMP528-JAN21 page 5 of 5 End
正文完
 0