关于后端:DFT-Parallelization-in-OpenMP

89次阅读

共计 2228 个字符,预计需要花费 6 分钟才能阅读完成。

DFT Parallelization in OpenMP
The FFT is the most famous library to solve a Fourier Transform (FT) using
the Fast Fourier Transform algorithm. The goal of this exercise is to develop a
fast DFT (Discrete Fourier Transform, a different algorithm to calculate FT) by
parallelizing the serial C code that is provided as DFT_seq.c.
The code calculates direct and inverse DFTs, timing the total time taken for the
computation of the two DFTs. The computational core of the program is the
DFT subroutine that takes idft argument as 1 for the direct DFT and -1 for the
inverse DFT.
The code also prints the value of the first element of the DFT computation. This
is equal to the input size (N). You can use this number to quickly check the
correctness of your implementation.
You will focus on parallelizing the DFT function:
int DFT(int idft, double xr, double xi, double Xr_o, double Xi_o, int N)
{
// Region 1: main DFT function
for (int k = 0; k < N; k++)
{
for (int n = 0; n < N; n++)
{
// Real part of X[k]
Xr_o[k] += xr[n] cos(n k PI2 / N) + idft xi[n] sin(nk*PI2/N);
// Imaginary part of X[k]
Xi_o[k] += -idft xr[n] sin(n k PI2 / N) + xi[n] cos(nk*PI2/N);
}
}
// Region 2: Normalizing IDFT
if (idft == -1)
{
for (int n = 0; n < N; n++)
{
Xr_o[n] /= N;
Xi_o[n] /= N;
}
}
return 1;
}
In this exercise, N=10000. Please provide a comprehensive report to answer the
following questions. You should plot figures to show your testing results in the
report.
Questions:
1

  1. (1 point) Using OpenMP to parallelize the given sequential code in
    DFT_seq.c . The goal is to achieve the best possible performance for DFT
    with the correct result and no false sharing. Note that the scheduling policy
    and where to parallelize will affect your final code performance. Your plot
    should reflect the execution time for the runs on 1, 2, 4, 8 and 16 threads.
    The performance measurement should be an average of 5 repeated runs.
    Note that you need to parallelize all the code regions that are necessary
    for performance improvement.
  2. (1 point) Zoom into Region 1, which is the main DFT function nested
    loops, please write the two OpenMP parallelization versions for outer and
    inner loops in the report and discuss which design choice is better for
    performance and why. After this, using the one design above that has
    the better performance as the baseline, please discuss how the scheduling
    policy choices (default, static and dynamic) may impact its final execution
    time. Plot them out in a figure to reflect the runs on 1, 2, 4, 8 and 16
    threads. Note that please design your scheduling policy that avoids false
    sharing. Also, it is possible that they don’t present significant performance
    differences. You have to summarize your findings.
正文完
 0