| Clemson Home > CCIT Home | Skip Navigation | A-Z Index Calendar CU Safety Map Webcams Phonebook |
Compiling your parallel code
The point of using palmetto is to speed up your computing and the way to do that is to split your code up to run across the palmetto nodes and cpus. Once you have compiled and tested your code on one processor, its time to look into running it in parallel.
There are several parallel methods in use today. We include a summary here with their Wikipedia description.
| Method | Description |
|---|---|
| OpenMP | The OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C/C++ and Fortran on many architectures, including Unix and Microsoft Windows platforms. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer. (See http://en.wikipedia.org/wiki/OpenMP.) |
| MPI | Message Passing Interface (MPI) are both a computer specification and its implementation that allows many computers to communicate with one another. It is used in computer clusters. ... MPI "is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation" (See http://en.wikipedia.org/wiki/Message_Passing_Interface and http://en.wikipedia.org/wiki/MPICH.) We use MPICH, a freely available and portable version of MPI, on palmetto. |
| TLP | Thread-level parallelism (TLP) is the parallelism inherent in an application that runs multiple threads at once. ... By running many threads at once, these applications are able to tolerate the high amounts of I/O and memory system latency their workloads can incur - while one thread is delayed waiting for a memory or disk access, other threads can do useful work. (See http://en.wikipedia.org/wiki/Thread-level_parallelism.) |
Parallelization using the Intel compilers
The Intel compilers include sophisticated tools for analyzing your code in various ways. The options you can use to instruct the compiler to analyze and set up your executable file for parallel execution automatically are described briefly in this section. For additional explanation and options, see the man ifort and man icc or the additional resources listed in the Finding software and documentation section.
| Option | Description | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
-openmp -openmp-report |
Enables the parallelizer to generate multithreaded code based on OpenMP directives. The code can be executed in parallel on both uniprocessor and multiprocessor systems. The -openmp-reportn controls the level of diagnostic messages of the OpenMP parallelizer, where n can be
|
||||||||||||||||||||
| -parallel -par-report -par-schedule |
Tells the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel. You must also specify -O2 or -O3. The -par-reportn option controls the diagnostic information reported by the auto-parallelizer, where n can be
The -par-schedule<keyword>=n option specifies a scheduling algorithm for DO loop iterations. <keyword> specifies the scheduling algorithm and can be any of
|
||||||||||||||||||||
| -threads | Specifies that multithreaded libraries should be linked. This means that any routines you call from these libraries will be executed in parallel. The default is -nothreads. | ||||||||||||||||||||
| -vec -vec-report |
Takes advantage of Streaming SIMD Extensions 2 (SSE2) and Streaming SIMD Extensions 3 (SSE3) vectorization. (Default: -novec prior to Intel 10, -vec Intel 10). The -vec-reportn option directs the compiler to generate a vectorization report where n is a value denoting which level of diagnostic messages to report. Possible values are:
|
OpenMP Example
A simple program that shows the use of OpenMP directives is an example from the Lawrence Livermore tutorial
PROGRAM WORKSHARE1
INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS,
+ OMP_GET_THREAD_NUM, N, CHUNKSIZE, CHUNK, I
PARAMETER (N=100)
PARAMETER (CHUNKSIZE=10)
REAL A(N), B(N), C(N)
! Some initializations
DO I = 1, N
A(I) = I * 1.0
B(I) = A(I)
ENDDO
CHUNK = CHUNKSIZE
!$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(I,TID)
TID = OMP_GET_THREAD_NUM()
IF (TID .EQ. 0) THEN
NTHREADS = OMP_GET_NUM_THREADS()
PRINT *, 'Number of threads =', NTHREADS
END IF
PRINT *, 'Thread',TID,' starting...'
!$OMP DO SCHEDULE(DYNAMIC,CHUNK)
DO I = 1, N
C(I) = A(I) + B(I)
WRITE(*,100) TID,I,C(I)
100 FORMAT(' Thread',I2,': C(',I3,')=',F8.2)
ENDDO
!$OMP END DO NOWAIT
PRINT *, 'Thread',TID,' done.'
!$OMP END PARALLEL
END
|
(Note: The example, in C, can be found at https://computing.llnl.gov/tutorials/openMP/samples/C/omp_workshare1.c)
We would compile it with
ifort -o workshare -openmp workshare.f
to which we receive the response
[myid@user001 workshare]$ ifort -o workshare -openmp workshare.f
workshare.f(25): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
workshare.f(16): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
workshare.f(10): (col. 7) remark: LOOP WAS VECTORIZED.
[myid@user001 workshare]$
We will show how to run this program in the next section.
MPI Example
This simple examples shows an array decomposition in C, implemented with MPI library calls.
/****************************************************************************** * FILE: mpi_array.c * DESCRIPTION: * MPI Example - Array Assignment - C Version * This program demonstrates a simple data decomposition. The master task * first initializes an array and then distributes an equal portion that * array to the other tasks. After the other tasks receive their portion * of the array, they perform an addition operation to each array element. * They also maintain a sum for their portion of the array. The master task * does likewise with its portion of the array. As each of the non-master * tasks finish, they send their updated portion of the array to the master. * An MPI collective communication call is used to collect the sums * maintained by each task. Finally, the master task displays selected * parts of the final array and the global sum of all array elements. * NOTE: the number of MPI tasks must be evenly disible by 4. * AUTHOR: Blaise Barney * LAST REVISED: 04/13/05 ****************************************************************************/ #include "mpi.h" #include |
(Note: The example, in Fortran, can be found at https://computing.llnl.gov/tutorials/mpi/samples/Fortran/mpi_array.f)
We would compile it with
mpicc -o array-decomp mpi_array.c
to which we receive the response
[myid@user001 workshare]$ mpicc -o array-decomp mpi_array.c
array.c(54): (col. 3) remark: LOOP WAS VECTORIZED.
array.c(71): (col. 11) remark: LOOP WAS VECTORIZED.
array.c(129): (col. 3) remark: LOOP WAS VECTORIZED.
[myid@user001 workshare]$
We will show how to run this program in the next section.
TLP Example
And we conclude with a simple thread level example which handles array decomposition via loop distribution.
/****************************************************************************** * FILE: arrayloops.c * DESCRIPTION: * Example code demonstrating decomposition of array processing by * distributing loop iterations. A global sum is maintained by a mutex * variable. * AUTHOR: Blaise Barney * LAST REVISED: 04/05/05 ******************************************************************************/ #include |
We would compile it with mpicc, as in
mpicc -o loop loop.c
to which we receive the response
[myid@user001 loop]$ mpicc -o loop loop.c
loops.c(68): (col. 3) remark: LOOP WAS VECTORIZED.
loops.c(32): (col. 3) remark: LOOP WAS VECTORIZED.
[myid@user001 loop]$
Try running it yourself from the command line on palmetto to see the results.
Resources
Tutorials and Resources, OpenMP home page
Lawrence Livermore OpenMP Tutorial
Lawrence Livermore MPI Tutorial
Lawrence Livermore POSIX Threads Programming Tutorial
Introduction to OpenMP by Ruud van der Pas, Sun Microsystems
- Login to post comments
