Clemson Home  >  CCIT HomeSkip NavigationA-Z Index    Calendar    CU Safety    Map    Webcams    Phonebook    

Compiling your parallel code

The point of using palmetto is to speed up your computing and the way to do that is to split your code up to run across the palmetto nodes and cpus. Once you have compiled and tested your code on one processor, its time to look into running it in parallel.

There are several parallel methods in use today. We include a summary here with their Wikipedia description.

Method Description
OpenMP The OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C/C++ and Fortran on many architectures, including Unix and Microsoft Windows platforms. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer. (See http://en.wikipedia.org/wiki/OpenMP.)
MPI Message Passing Interface (MPI) are both a computer specification and its implementation that allows many computers to communicate with one another. It is used in computer clusters. ... MPI "is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation" (See http://en.wikipedia.org/wiki/Message_Passing_Interface and http://en.wikipedia.org/wiki/MPICH.) We use MPICH, a freely available and portable version of MPI, on palmetto.
TLP Thread-level parallelism (TLP) is the parallelism inherent in an application that runs multiple threads at once. ... By running many threads at once, these applications are able to tolerate the high amounts of I/O and memory system latency their workloads can incur - while one thread is delayed waiting for a memory or disk access, other threads can do useful work. (See http://en.wikipedia.org/wiki/Thread-level_parallelism.)

Parallelization using the Intel compilers

The Intel compilers include sophisticated tools for analyzing your code in various ways. The options you can use to instruct the compiler to analyze and set up your executable file for parallel execution automatically are described briefly in this section. For additional explanation and options, see the man ifort and man icc or the additional resources listed in the Finding software and documentation section.

Option Description
-openmp
-openmp-report
Enables the parallelizer to generate multithreaded code based on OpenMP directives. The code can be executed in parallel on both uniprocessor and multiprocessor systems. The -openmp-reportn controls the level of diagnostic messages of the OpenMP parallelizer, where n can be

Level Description
0 Displays no diagnostic information.
1 Displays diagnostics indicating loops, regions, and sections successfully parallelized. This is the default.
2 Displays the diagnostics specified by -openmp-report1 plus diagnostics indicating successful handling of MASTER constructs, SINGLE constructs, CRITICAL constructs, ORDERED constructs, ATOMIC directives, etc.
-parallel
-par-report
-par-schedule
Tells the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel. You must also specify -O2 or -O3. The -par-reportn option controls the diagnostic information reported by the auto-parallelizer, where n can be

Level Description
0 Tells the auto-parallelizer to report no diagnostic information.
1 Tells the auto-parallelizer to report diagnostic messages for loops successfully auto-parallelized. This is the default. Issues a "LOOP AUTO-PARALLELIZED" message for parallel loops.
2 Tells the auto-parallelizer to report diagnostic messages for loops successfully auto-parallelized, as well as unsuccessful loops.
3 Tells the auto-parallelizer to report the same diagnostic messages by -par-report2 plus additional information about any proven or assumed dependencies inhibiting auto-parallelization (reasons for not parallelizing).

The -par-schedule<keyword>=n option specifies a scheduling algorithm for DO loop iterations. <keyword> specifies the scheduling algorithm and can be any of

Algorithm Description
static Divides iterations into contiguous pieces (chunks) of size n. The chunks are statically assigned to threads in the team in a round-robin fashion in the order of the thread number. If no n is specified, the iteration space is divided into chunks that are approximately equal in size, and each thread is assigned at most one chunk.
dynamic Assigns iterations to threads in chunks as the threads request them. The thread executes the chunk of iterations, then requests another chunk, until no chunks remain to be assigned. If no n is specified, the default is 1.
guided Assigns iterations to threads in chunks as the threads request them. The thread executes the chunk of iterations, then requests another chunk, until no chunks remain to be assigned. For a chunk of size 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads, decreasing to 1. If no n is specified, the default is 1.
runtime Defers the scheduling decision until run time. The scheduling algorithm and chunk size are then taken from the setting of environment variable OMP_SCHEDULE. You cannot specify n with this keyword.
-threads Specifies that multithreaded libraries should be linked. This means that any routines you call from these libraries will be executed in parallel. The default is -nothreads.
-vec
-vec-report
Takes advantage of Streaming SIMD Extensions 2 (SSE2) and Streaming SIMD Extensions 3 (SSE3) vectorization. (Default: -novec prior to Intel 10, -vec Intel 10). The -vec-reportn option directs the compiler to generate a vectorization report where n is a value denoting which level of diagnostic messages to report. Possible values are:

Level Description
0 Tells the vectorizer to report no diagnostic information.
1 Tells the vectorizer to report on vectorized loops.
2 Tells the vectorizer to report on vectorized and non-vectorized loops.
3 Tells the vectorizer to report on vectorized and non-vectorized loops and any proven or assumed data dependences.
4 Tells the vectorizer to report on non-vectorized loops.
5 Tells the vectorizer to report on non-vectorized loops and the reason why they were not vectorized.

OpenMP Example

A simple program that shows the use of OpenMP directives is an example from the Lawrence Livermore tutorial

      PROGRAM WORKSHARE1

      INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS,
     +  OMP_GET_THREAD_NUM, N, CHUNKSIZE, CHUNK, I
      PARAMETER (N=100)
      PARAMETER (CHUNKSIZE=10) 
      REAL A(N), B(N), C(N)

!     Some initializations
      DO I = 1, N
        A(I) = I * 1.0
        B(I) = A(I)
      ENDDO
      CHUNK = CHUNKSIZE

!$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(I,TID)

      TID = OMP_GET_THREAD_NUM()
      IF (TID .EQ. 0) THEN
        NTHREADS = OMP_GET_NUM_THREADS()
        PRINT *, 'Number of threads =', NTHREADS
      END IF
      PRINT *, 'Thread',TID,' starting...'

!$OMP DO SCHEDULE(DYNAMIC,CHUNK)
      DO I = 1, N
        C(I) = A(I) + B(I)
        WRITE(*,100) TID,I,C(I)
 100    FORMAT(' Thread',I2,': C(',I3,')=',F8.2)
      ENDDO
!$OMP END DO NOWAIT

      PRINT *, 'Thread',TID,' done.'

!$OMP END PARALLEL

      END

(Note: The example, in C, can be found at https://computing.llnl.gov/tutorials/openMP/samples/C/omp_workshare1.c)

We would compile it with

ifort -o workshare -openmp workshare.f

to which we receive the response

[myid@user001 workshare]$ ifort -o workshare -openmp workshare.f
workshare.f(25): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
workshare.f(16): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
workshare.f(10): (col. 7) remark: LOOP WAS VECTORIZED.
[myid@user001 workshare]$

We will show how to run this program in the next section.

MPI Example

This simple examples shows an array decomposition in C, implemented with MPI library calls.

/******************************************************************************
* FILE: mpi_array.c
* DESCRIPTION: 
*   MPI Example - Array Assignment - C Version
*   This program demonstrates a simple data decomposition. The master task
*   first initializes an array and then distributes an equal portion that
*   array to the other tasks. After the other tasks receive their portion
*   of the array, they perform an addition operation to each array element.
*   They also maintain a sum for their portion of the array. The master task 
*   does likewise with its portion of the array. As each of the non-master
*   tasks finish, they send their updated portion of the array to the master.
*   An MPI collective communication call is used to collect the sums 
*   maintained by each task.  Finally, the master task displays selected 
*   parts of the final array and the global sum of all array elements. 
*   NOTE: the number of MPI tasks must be evenly disible by 4.
* AUTHOR: Blaise Barney
* LAST REVISED: 04/13/05
****************************************************************************/
#include "mpi.h"
#include 
#include 
#define  ARRAYSIZE	16000000
#define  MASTER		0

float  data[ARRAYSIZE];

int main (int argc, char *argv[])
{
int   numtasks, taskid, rc, dest, offset, i, j, tag1,
      tag2, source, chunksize; 
float mysum, sum;
float update(int myoffset, int chunk, int myid);
MPI_Status status;

/***** Initializations *****/
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if (numtasks % 4 != 0) {
   printf("Quitting. Number of MPI tasks must be divisible by 4.\n");
   MPI_Abort(MPI_COMM_WORLD, rc);
   exit(0);
   }
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
printf ("MPI task %d has started...\n", taskid);
chunksize = (ARRAYSIZE / numtasks);
tag2 = 1;
tag1 = 2;

/***** Master task only ******/
if (taskid == MASTER){

  /* Initialize the array */
  sum = 0;
  for(i=0; i<5; j++) 
      printf("  %e",data[offset+j]);
    printf("\n");
    offset = offset + chunksize;
    }
  printf("*** Final sum= %e ***\n",sum);

  }  /* end of master section */



/***** Non-master tasks only *****/

if (taskid > MASTER) {

  /* Receive my portion of array from the master task */
  source = MASTER;
  MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);
  MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, 
    MPI_COMM_WORLD, &status);

  mysum = update(offset, chunksize, taskid);

  /* Send my results back to the master task */
  dest = MASTER;
  MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);
  MPI_Send(&data[offset], chunksize, MPI_FLOAT, MASTER, tag2, MPI_COMM_WORLD);

  MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);

  } /* end of non-master */


MPI_Finalize();

}   /* end of main */


float update(int myoffset, int chunk, int myid) {
  int i; 
  float mysum;
  /* Perform addition to each of my array elements and keep my sum */
  mysum = 0;
  for(i=myoffset; i < myoffset + chunk; i++) {
    data[i] = data[i] + i * 1.0;
    mysum = mysum + data[i];
    }
  printf("Task %d mysum = %e\n",myid,mysum);
  return(mysum);
  }

(Note: The example, in Fortran, can be found at https://computing.llnl.gov/tutorials/mpi/samples/Fortran/mpi_array.f)

We would compile it with

mpicc -o array-decomp mpi_array.c

to which we receive the response

[myid@user001 workshare]$ mpicc -o array-decomp mpi_array.c
array.c(54): (col. 3) remark: LOOP WAS VECTORIZED.
array.c(71): (col. 11) remark: LOOP WAS VECTORIZED.
array.c(129): (col. 3) remark: LOOP WAS VECTORIZED.
[myid@user001 workshare]$

We will show how to run this program in the next section.

TLP Example

And we conclude with a simple thread level example which handles array decomposition via loop distribution.

/******************************************************************************
* FILE: arrayloops.c
* DESCRIPTION:
*   Example code demonstrating decomposition of array processing by
*   distributing loop iterations.  A global sum is maintained by a mutex
*   variable.  
* AUTHOR: Blaise Barney
* LAST REVISED: 04/05/05
******************************************************************************/
#include 

#include 
#include 

#define NTHREADS      4
#define ARRAYSIZE   1000000
#define ITERATIONS   ARRAYSIZE / NTHREADS

double  sum=0.0, a[ARRAYSIZE];
pthread_mutex_t sum_mutex;


void *do_work(void *tid) 
{
  int i, start, *mytid, end;
  double mysum=0.0;

  /* Initialize my part of the global array and keep local sum */
  mytid = (int *) tid;
  start = (*mytid * ITERATIONS);
  end = start + ITERATIONS;
  printf ("Thread %d doing iterations %d to %d\n",*mytid,start,end-1); 
  for (i=start; i < end ; i++) {
    a[i] = i * 1.0;
    mysum = mysum + a[i];
    }

  /* Lock the mutex and update the global sum, then exit */
  pthread_mutex_lock (&sum_mutex);
  sum = sum + mysum;
  pthread_mutex_unlock (&sum_mutex);
  pthread_exit(NULL);
}


int main(int argc, char *argv[])
{
  int i, start, tids[NTHREADS];
  pthread_t threads[NTHREADS];
  pthread_attr_t attr;

  /* Pthreads setup: initialize mutex and explicitly create threads in a
     joinable state (for portability).  Pass each thread its loop offset */
  pthread_mutex_init(&sum_mutex, NULL);
  pthread_attr_init(&attr);
  pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
  for (i=0; i

We would compile it with mpicc, as in

mpicc -o loop loop.c

to which we receive the response

[myid@user001 loop]$ mpicc -o loop loop.c
loops.c(68): (col. 3) remark: LOOP WAS VECTORIZED.
loops.c(32): (col. 3) remark: LOOP WAS VECTORIZED.
[myid@user001 loop]$

Try running it yourself from the command line on palmetto to see the results.

Resources

Tutorials and Resources, OpenMP home page
Lawrence Livermore OpenMP Tutorial
Lawrence Livermore MPI Tutorial
Lawrence Livermore POSIX Threads Programming Tutorial
Introduction to OpenMP by Ruud van der Pas, Sun Microsystems



Maintained by CITI web services                    Copyright ©2008 Clemson University, Clemson, S.C. 29634, (864) 656-331