Dauger Research Inc. *
Home About Press Releases Contact Info
Line LeftPooch IconLineLinePooch Application
Line 
Line Right
What is Pooch?
Purchase
Setting Up
Running Parallel
Writing Parallel
Sw Development Kit
Compiling the MPIs

Tutorials  >
Parallelization
Parallel Zoology
Parallel Knock
Parallel Adder
Pascal's Triangle
Parallel Circle Pi
Parallel Life
MacMPI Visualization

FAQ

Documentation
Support


Site Map

Pooch Home > Writing Parallel > Parallel Adder

Parallelization - Parallel Adder

The following is a description of a basic single-processor code and its transition into a parallel code. We hope this example of a simple parallelization process serves to demonstrate what steps and calls are needed for codes whose work can be performed independently of one another. A simple "divide and conquer" approach is appropriate here.
Sample Code
adder.c adder.f90
paralleladder.c paralleladder.f90
  

To the experienced parallel programmer, this example is a "trivially parallelizable" code because the entire work of the code can be partitioned between the processors very simply. That is the point of this discussion: we are demonstrating how one might go about converting this kind of single-processor code into a parallel code.

This technique could apply to your situation in at least three possible ways. First, this technique is applicable to to problems such as minmax searches for the best move in a game such as chess, "monte carlo" techniques, or computationally-intensive searches through a large parameter space. If you are fortunate enough to have a code of this category, then you need to look little further than this technique. Second, selected components of a code could also fall under this category, so this technique could be useful to apply to those portions of a single-processor code. Finally, if this is your first look at parallel code, this example serves as an excellent start to learn how to go about writing and thinking about parallel codes.

Before: Single-Processor adder

We begin with a code utilizing single-processor execution. This example, adder, sums the square of each integer from 1 to 10000, inclusive. It is organized into three routines:
1 + 4 + 9 + 16 + 25 + 36 + 49 + 64 + 81 + 100 + 121 + .... + 99960004 + 99980001 + 100000000

  

  • kernelroutine - performs the elemental work of squaring each integer

  • computeloop - allocates memory, loops over the integers, calls kernelroutine, and saves and sums their results

  • main - performs any required initialization, calls computeloop, saves its results into a file, printf's the sum, and deallocates memory

The code is structured in such a way to clearly separate the pieces that perform the work that is secondary yet essential to solving the problem. These include allocating the appropriate memory and saving data to an output file. Besides identifying where these functions are being performed, the structure makes it obvious that a much more complicated problem can substitute for the kernelroutine shown. We also intend this explicit structure to represent corresponding portions of a much larger code. The C source is shown in Listing 1.


Listing 1 - adder.c

See also adder.f90


#include <stdio.h> /* standard I/O routines */

#include <stdlib.h> /* standard library routines, such as memory */
#include <math.h> /* math libraries */

/* A routine performing elemental work.
This can be replaced with a much larger routine. */
double kernelroutine(double input);
double kernelroutine(double input)
{
return (input+1)*(input+1);
}

/* HighestIndex specifies how highest index to sample */
#define HighestIndex 10000L

void computeloop(double *theSum, double **theArray);
void computeloop(double *theSum, double **theArray)
{
/* local copies of the data to be output */
double *myArray, mySum=0;
/* limit of the loop */
long loopEnd=HighestIndex;

/* allocate an array to hold all the results */
myArray=malloc(sizeof(double)*loopEnd);

if (myArray) {
long index;

/* loop over indicies */
for(index=0; index<loopEnd; index++) {
/* call the kernel routine for each index, and save into the array */
myArray[index]=kernelroutine(index);
/* sum as we go */
mySum+=myArray[index];

}

}

/* return the sum and the array */
*theSum=mySum;
*theArray=myArray;
}

int main(int argc, char *argv[])
{
/* main copies of the sum and the array */
double theSum, *theArray=NULL;

printf("Beginning computation...\n");

computeloop(&theSum, &theArray);

if (theArray) {/* error checking */
FILE *fp;

/* save the array into a data file */
fp=fopen("output", "w");
if (fp) {
printf("writing array...\n");

fwrite(theArray, sizeof(double), HighestIndex, fp);

fclose(fp);
}

printf("the sum is %f\n", theSum);

/* clean up after myself */
free(theArray);

}
else
printf("memory allocation failure\n");

return 0;
}

When this code is run with its HighestIndex constant set to 10000, the code states that the sum is 333383335000 and produces a 80000 character long binary file containing the squares of integers 1 through 10000.

After: paralleladder

A key to parallelizing an application is choosing the appropriate partitioning of the problem. The obvious choice here is to divide the number of integers by the number of processors. However, a few details must be worked out: What if the number of processors doesn't divide evenly into the number of integers? How does each processor know which integers to work on so they don't do the same integer twice or miss one? Once they have the answers, how do the sum numbers on one processor with those on another, and how do these processors save the data into one file?

1 + 4 + 9 + 16 + 25 + 36 + 49 + 64 + 81 + 100 + 121 + ... 11108889 + 11115556 + 11122225 + 11128896 + ... 44435556 + 44448889 + 44462224 + 44475561 + ...



+ + =
  

The parallelization process here is primarily a matter of managing and delegating data and computation. Changes to main are limited to "boilerplate" code that prepare and tear down the parallel environment and pass the sufficient information to computeloop so that it can organize the work and a if test so only processor 0 creates an output file. The most important modifications are in computeloop to perform the coordination and calculate the partitioning between processors. Given the identification number of the processor it is running on and the number of processors in the system, computeloop calculates how to partition the problem among the processors. To minimize bottlenecks, the executable running on each processor calculates its assignment itself, without a central authority delegating assignments. Once it finishes its work, it collects the data back to processor 0 so main can write the output in one convenient place for the user.

The detailed answer is in the code. Listing 2 shows the C source of paralleladder.c, with the changes relative to adder.c underlined.


Listing 2 - paralleladder.c, with changes compared to adder.c underlined

See also paralleladder.f90


#include <stdio.h> /* standard I/O routines */
#include <stdlib.h> /* standard library routines, such as memory */
#include <math.h> /* math libraries */

#include "mpi.h" /* MPI library */

/* A routine performing elemental work.
This can be replaced with a much larger routine. */
double kernelroutine(double input); 
double kernelroutine(double input) 
	{
	return (input+1)*(input+1); 
	}

/* HighestIndex specifies how highest index to sample */
#define HighestIndex			10000L

void computeloop(double *theSum, double **theArray, int idproc, int nproc); 
void computeloop(double *theSum, double **theArray, int idproc, int nproc)
	{
	/* local copies of the data to be output */
	double *myArray, mySum=0; 
	/* limit of the loop */
	long loopEnd=(HighestIndex+nproc-1)/nproc;
	/* just this proc's piece */
	
	/* this processor's index offset */
	long offset=idproc*loopEnd;
	
	/* allocate an array to hold all the results */
	myArray=malloc(sizeof(double)*loopEnd); 
	
	if (myArray) {
		long index; 
		
		/* loop over indicies */
		for(index=0; index<loopEnd; index++) {
/* call the kernel routine for each index, and save into the array */
			myArray[index]=kernelroutine(index+offset); 
			/* sum as we go */
			if (index+offset<HighestIndex)
			/* limit to the desired indicies */
				mySum+=myArray[index]; 
			
			}
		
		/* proc 0 needs to hold the entire array */
		{double *bigArray=idproc?NULL:malloc(sizeof(double)*loopEnd*nproc);
	
		/* gathers the data from the other arrays ... */
		MPI_Gather(myArray, loopEnd, MPI_DOUBLE, 
			bigArray, loopEnd, MPI_DOUBLE, 
			/* ... to proc 0 */ 0, MPI_COMM_WORLD); 
		
		if (!idproc) {
			free(myArray); 
			myArray=bigArray; 
			}
		}
		
		/* performs a parallel sum across processors 
			and saves the result ... */
		MPI_Reduce(&mySum, &mySum, 1, MPI_DOUBLE, MPI_SUM, 
			/* ... at proc 0 */ 0, MPI_COMM_WORLD); 
		
		}
	
	/* return the sum and the array */
	*theSum=mySum; 
	*theArray=myArray; 
	}
	
int ppinit(int argc, char *argv[], int *idproc, int *nproc); 
void ppexit(void);

int main(int argc, char *argv[])
	{
	/* main copies of the sum and the array */
	double theSum, *theArray=NULL; 
	int idproc, nproc, ierr; 
	
	/* initialize parallel processing */
	ierr = ppinit(argc, argv, &idproc, &nproc); 
	
	if (ierr) return ierr; /* stop right there if there's a problem */
	
	printf("I'm processor #%d in a %d-processor cluster.\n", idproc, nproc); 
	
	printf("Beginning computation...\n"); 
	
	computeloop(&theSum, &theArray, idproc, nproc); 
	
	if (theArray) {/* error checking */
		
		if (!idproc) {/* only processor 0 */
			FILE *fp; 
		
		/* save the array into a data file */
			fp=fopen("output", "w"); 
			if (fp) {
				printf("writing array...\n"); 
				
				fwrite(theArray, sizeof(double), HighestIndex, fp); 
				
				fclose(fp); 
				}
			
			printf("the sum is %f\n", theSum); 
			
			}
		
		/* clean up after myself */
		free(theArray); 
		
		}
	else 
		printf("memory allocation failure\n"); 
	
	/* only proc 0 pauses for user exit */
	if (!idproc) {
		printf("press return to continue\n"); 
		getc(stdin); 
		}
	
	ppexit(); 
	
	return 0; 
	}
	
#ifdef __MWERKS__
/* only for Metrowerks CodeWarrior */
#include <SIOUX.h>
#endif 

int ppinit(int argc, char *argv[], int *idproc, int *nproc)
	{
/* this subroutine initializes parallel processing
   idproc = processor id
   nproc = number of real or virtual processors obtained */
   int ierr;
   *nproc=0; 

/* initialize the MPI execution environment */

   ierr = MPI_Init(&argc, &argv);
   if (!ierr) {

/* determine the rank of the calling process in the communicator */
		ierr = MPI_Comm_rank(MPI_COMM_WORLD, idproc);

/* determine the size of the group associated with the communicator */
		ierr = MPI_Comm_size(MPI_COMM_WORLD, nproc);

#ifdef __MWERKS__
/* only for Metrowerks CodeWarrior */
		SIOUXSettings.asktosaveonclose=0; 
		SIOUXSettings.autocloseonquit=1;
#endif 
		
		}
	
	return ierr;
	}

void ppexit(void)
	{
/* this subroutine terminates parallel processing */
	int ierr;

/* terminate MPI execution environment */
	ierr = MPI_Finalize();
	
	}

The changes from adder to paralleladder:

  • mpi.h - the header file for the MPI library, required to access information about the parallel system and perform communication

  • idproc, nproc - nproc describes how many processors are currently running this job and idproc identifies the designation, labeled from 0 to nproc - 1, of this processor. This information is sufficient to identify exactly which part of the problem this instance of the executable should work on. In this case, these variables are supplied to computeloop by main.

  • loopEnd=(HighestIndex+nproc-1)/nproc; - loopEnd describes how many integers this processor should perform kernelroutine as it was in adder. Here, we choose to make have each processor perform the same amount of work, yet the total amount of work is at least that necessary to complete the problem. Depending on HighestIndex and nproc, the last processor might do a little too much, but that processor would end up waiting for the others to catch up if it did skip the excess work and it's only a small waste if HighestIndex is sufficiently large.

  • offset=idproc*loopEnd; - By shifting the start of the sampling for this particular processor, offset is how each processor knows not to overlap work with other processors, without skipping the needed work. With this variable and loopEnd, this is sufficient information to specify the partition of work assigned to this processor.

  • the malloc - performs the memory allocation for each processor. Except for processor 0, the memory allocation is the amount sufficient to hold the results of this processor's partition of work. In the case of processor 0, it must create an array big enough hold the results of its own work and that of every other processor to save the data later.

  • if (index+offset<HighestIndex) - limit the summation to only the indicies we want in the sum

  • MPI_COMM_WORLD - MPI defines communicator worlds or communicators that define a set of processors that can communicate with each other. At initialization, one communicator, MPI_COMM_WORLD, covers all the processors in the system. Other MPI calls can define arbitrary subsets of MPI_COMM_WORLD, making it possible to confine a code to a particular processor subset just by passing it the appropriate communicator. In simpler cases such as this, using MPI_COMM_WORLD is sufficient.

  • MPI_Gather - fills bigArray of processor 0 with the data in myArray from the other processors in preparation for saving into one data file. myArray, loopEnd, and MPI_DOUBLE specify the first element, size, and data type of the array. 0 specifies where the data is to be collected. This call can be considered a "convenience" routine that simplifies the process of "gathering" regularly organized data from other processors.

  • MPI_Reduce - computes the sum, specified by MPI_SUM, of each mySum variable of each processor and collects the result in processor 0. This call can sum an entire array of values, so mySum is passed as an array of length one. When supplied MPI_MAX or MPI_MIN, it can also perform other operations such as comparing values across processors and retaining the maximum or minimum values.

  • ppinit - performs initialization of the parallel computing environment. Part of the boilerplate for MPI codes, this routine calls:
    • MPI_Init - performs the actual initialization of MPI. It returns the error code; zero means no error.
    • MPI_Comm_size - accesses the processor count of the parallel system
    • MPI_Comm_rank - accesses the identification number of this particular processor

  • ppexit - by calling MPI_Finalize, terminates the parallel computing environment. Also part of the boilerplate of MPI codes, a call to MPI_Finalize is needed to properly cleanup and close the connections between codes running on other processors and release control.

  • __MWERKS__ - these #ifdef blocks merely compensate for a pecularity about the Metrowerks CodeWarrior compiler. These blocks allows the executable on other processors terminate automatically. If you are using another compiler, you don't have to worry about this code.

Note that kernelroutine did not need to be altered at all. That reflects the independence of the work performed by this routine. The code that organizes and coordinates the work between processors is in computeloop. main's changes merely perform the boilerplate work of setting up and tearing down the parallel environment, plus an if statement so only processor 0 writes the output file.

Something to note is that this code does not explicitly call MPI_Send or MPI_Recv or any other elemental communication calls. Instead it calls MPI_Gather and MPI_Reduce. There isn't anything wrong with using elemental versus collective calls, if either will do the job. In fact, the MPI_Gather could just as easily be replaced with the following:


Listing 3 - an alternative to the above MPI_Gather call

if (idproc) 
	MPI_Send(myArray, loopEnd, MPI_DOUBLE, 0, idproc, MPI_COMM_WORLD); 
else {
	long otherProc; 
	MPI_Status status; 
	for(index=0; index<loopEnd; index++) bigArray[index]=myArray[index]; 
	for(otherProc=1; otherProc<nproc; otherProc++)
		MPI_Recv(&bigArray[otherProc*loopEnd], loopEnd, MPI_DOUBLE, 
			otherProc, otherProc, MPI_COMM_WORLD, &status); 
	}

We suggest replacing MPI_Reduce with elemental calls as an exercise for the reader.

Also note that it is not necessary to partition the work into contiguous blocks of regions. One could interleave the work: on a 4-processor system, processor 0 does 1, 5, 9, 13, etc., processor 1 does 2, 6, 10, 14, etc., processor 2 does 3, 7, 11, 15, etc., and processor 3 does 4, 8, 12, 16, etc. It may be appropriate or more efficient to interleave partitions for load balancing or other reasons. The AltiVec Fractal Carbon demo and Fresnel Diffraction Explorer take this approach. We suggest rearranging paralleladder's partitioning to interleaved parallelization, rather than block partitioning, as an exercise for the reader.

The structure of both adder and paralleladder are organized into a few separate routines. This practice is usually considered good coding style, but there is nothing stopping a writer from combining these routines all into one large routine. The same could be done with both the single-processor and parallel code. However, the choice was made for clarity rather than compactness.

The purpose of this discussion was to most clearly highlight what changes are needed to parallelize a code. Although your code may or may not be structured this way, you may be able to recognize what pieces of your code correspond to the routines in adder, which could suggest how to parallelize it. In addition, a more complicated, more sophisticated code that could benefit from this technique may already be organized like adder for the sake of code readibility and maintainability. Again, by comparing the above codes, we hope it may become more easily apparent how to parallelize your code.

References

Using MPI by William Gropp, Ewing Lusk and Anthony Skjellum is a good reference for the MPI library. In addition, many of the techniques and style of the above parallel code were adopted from Dr. Viktor K. Decyk. He expresses these ideas in a few references, including:

  • V. K. Decyk, "How to Write (Nearly) Portable Fortran Programs for Parallel Computers", Computers In Physics, 7, p. 418 (1993).
  • V. K. Decyk, "Skeleton PIC Codes for Parallel Computers", Computer Physics Communications, 87, p. 87 (1995).

You're welcome to read more about parallel computing via the Tutorials page.



Pooch is a trademark of Dauger Research

Pooch  Altivec Fractal Carbon  Atom in a Box  Solar System Simulator  Fresnel Diffraction Explorer 

Copyright 2001-2005 Dauger Research, Inc. All rights reserved.
Dauger Research, Inc. · P.O. Box 3074 Huntington Beach, CA 92605