Pooch - Parallel Life

Home

About

Press Releases

Contact Info

Sw Development Kit
Compiling the MPIs

Tutorials >
Parallelization
Parallel Zoology
Parallel Knock
Parallel Adder
Pascal's Triangle
Parallel Circle Pi
Parallel Life
MacMPI Visualization

FAQ

Site Map

Pooch Home > Writing Parallel > Parallel Life

Parallelization - Parallel Life

We describe how we would convert a single-processor code calculating a two-dimensional, propagation-style problem into a parallel code. This example demonstrates how one might handle a system whose partitions are not independent of one another, but exhibit localized dependencies. An approach that simply cuts up the problem into isolated pieces is insufficient because of the internal interactions inherent in the system. Nonetheless, the work of the code can be partitioned between the processors with their dependencies handled efficiently.
Sample Code
life.c Fortran versions coming soon!
parallellife.c

This technique applies to a variety of possible codes. Problems that can be translated into finite difference equations, which include the diffusion equation, wave propagation, fluid dynamics, or other problems described by partial differential equations, have an almost direct correspondence to the components of this code. In the two-dimensional graphics space, filters such as Gaussian blur, sharpen, or other convolution-style algorithms also can be thought of as a propagation of information from one pixel to its neighbors, that is, a local interaction. Also, portions of a code could be similar to those problems, making this technique applicable. Other more complex codes such as particle-based simulations, including the particle push of most plasma PIC codes, molecular dynamics codes, or astronomical or other N-body simulations, could benefit from the parallelization approach demonstrated here. This example is like the previous Parallel Pascal's Triangle except that this problem is two dimensional and the interprocess communication is bidirectional at each time step. Finally, if you are learning about parallel code, this following builds on how to combine partitioning with elemental message-passing.

Before: Single-Processor Game of Life

The Game of Life is an example of cellular autonoma introduced by John Conway, Professor of Mathematics at Princeton. The space is divided into a square grid, and in each grid is a "cell" that is either alive or dead. The life and death of these cells evolves from one time step to the next based on the previous state of that cell and its neighbors. The result is a completely deterministic, dynamic system of cells that, at least superficially, mimics the appearance of cellular life in a petri dish.

The rules of this "Game" are very simple. Each cell has eight neighbors, four of them one square away (like a rook move) and four more one diagonal step away (like a bishop move). If three of these neighbors are alive, then the cell in the center will come to life in the next time step regardless of its state in the previous one. If two neighbors are alive, life in the center cell will survive only if it was already alive. In all other cases, the cell will die of either overpopulation (four or more live neighbors) or lonliness (two or fewer live neighbors). These rules, chosen by Professor Conway, result in an interesting evolution.
5 live neighbors => X (cell dies)

The single-processor version of this Game of Life code, life, performs a direct computation of Conway's Life rules on an allocated two-dimensional array. Its algorithm is as follows: Begin with two equal-sized arrays, one for the current time step and one for the next. For each cell, count how many of its eight neighbors are alive, then apply the Life rules and save the result in the array for the next time step. Next, maintain any boundary conditions as necessary (in this case periodic), and then swap the arrays. Continue for the desired number of time steps.

In fact, this code allocates arrays one cell larger in each of four directions out from the array. This makes the routine that counts neighboring live cells easier to write because it does not have to check if a neighbor is outside the array bounds. The extra cells are maintained by the boundary condition routine, which, in this case, maintains periodic boundary conditions by duplicating the first and last rows and columns into the extra parts of the array. Factoring the boundary condition code is a good practice because it makes it easy to adjust the boundary conditions to, say, a vacuum (all dead cells) or perhaps a driving function.

Separating the tasks of the code into distinct purposes also lends itself to easy experimentation with different initial conditions, or different cellular rules, or mixed boundary conditions. In fact, this code detects if the life array has become stagnant and introduces patterns that can cause interesting cellular reactions.

This code is organized into three primary routines plus main:

propagatelife - counts the life in neighboring cells and applies the Life rules, generating output in a second array. Swapping references to a pair of arrays on alternating calls to this routine obviates potentially costly memory moves or copies.
maintainboundaryconditions - maintains the boundary conditions of the sample of life by writing the extra space in the life array based on the state of the internal parts of the array. In this implementation, it performs a simple copy of the last row to the array just before the first row and a copy of the first row to the array just after the last row. It copies the first and last columns similarly.
performcalculation - allocates memory, initializes and seeds the system, calls maintainboundaryconditions and propagatelife, and displays the ongoing state of the system. If its test for a stagnant situation returns true, it introduces new material. Once the system has finished propagating, it deallocates memory and returns.
main - initializes the random number generator and calls performcalculation

While the above routines perform work primary to the problem, additional routines perform functions secondary to the problem. These include allocating and formatting the appropriate memory structures (NewLifeArray) and printing the state of the system (fprintresult). As a side effect, fprintresult returns a count of the total number of active cells in the array, which is used to determine if new life should be triggered using addmaterial. A side effect of propagatelife is that it increments a integer in the live cell of the array to indicate how long a particular cell has been alive. This auxillary data is used in fprintresult for its display and its active cell calculation. fractionalsleep, like sleep, will have the process give up time for the specified number of seconds, but fractionalsleep accepts fractional seconds, allowing us to slow down the calculation to several frames per second. The C source is shown in Listing 1.

Listing 1 - life.c

#include <stdio.h> /* standard I/O routines */ #include <stdlib.h> /* standard library routines, such as memory */ #include <math.h> /* math libraries */ #include <time.h> /* time libraries */ void fractionalsleep(double sleepRequest); void fractionalsleep(double sleepRequest) {/* like sleep(), except that it accepts fractions of a second */ if (sleepRequest>0) { struct timespec request; double intSleep; request.tv_nsec=(1e9*modf(sleepRequest, &intSleep)); request.tv_sec=(long)intSleep; nanosleep(&request, NULL); } } #ifndef Byte #define Byte unsigned char #endif /* Defines array structure and access */ typedef struct LifeArrayStruct { Byte *array; long rowBytes; /* access with array[col+rowBytes*row] */ long columnSize; /* <= rowBytes-2 */ long rowSize; Byte buffer[4]; /* allocation must be at least sizeof(Byte)*(rowSize+2)*(columnSize+2) */ } LifeArrayStruct, *LifeArrayPtr, **LifeArrayHandle; #include <string.h> LifeArrayPtr NewLifeArray(long columnSize, long rowSize); LifeArrayPtr NewLifeArray(long columnSize, long rowSize) {LifeArrayPtr out=NULL; if ((columnSize>0)&&(rowSize>0)) { long rowBytes=columnSize+2; long allocation=sizeof(LifeArrayStruct)+rowBytes*(rowSize+2)*sizeof(Byte); out=malloc(allocation); if (out) { memset(out, 0, allocation); out->rowBytes=rowBytes; out->columnSize=columnSize; out->rowSize=rowSize; out->array=&out->buffer[1+out->rowBytes]; } } return out; } void propogatelife(LifeArrayPtr out, LifeArrayPtr in); /* assumed to be equal in size */ void propogatelife(LifeArrayPtr out, LifeArrayPtr in) {/* Propogates life according to rules by J. Conway in 1970 */ if (out&&in) { long row=in->rowSize; while (row--) { long column=in->columnSize; Byte *outRow=&out->array[row*out->rowBytes]; Byte *inRow=&in->array[row*in->rowBytes]; /* Hey you! Vectorize this! */ while (column--) { long neighborCount=(inRow[column-1]?1:0) +(inRow[column+1]?1:0); neighborCount+=(inRow[column+in->rowBytes-1]?1:0) +(inRow[column+in->rowBytes]?1:0) +(inRow[column+in->rowBytes+1]?1:0); neighborCount+=(inRow[column-1-in->rowBytes]?1:0) +(inRow[column-in->rowBytes]?1:0) +(inRow[column+1-in->rowBytes]?1:0); switch (neighborCount) { default: case 0: case 1: /* not enough */ case 4: case 5: case 6: case 7: case 8: case 9: /* too much */ outRow[column]=0; /* die */ break; case 2: if (!(inRow[column])) { outRow[column]=0; /* stay dead */ break; } case 3: /* just right */ if (!(outRow[column]=inRow[column]+1)) outRow[column]=0xff; /* live */ break; } } } } } void maintainboundaryconditions(LifeArrayPtr in); void maintainboundaryconditions(LifeArrayPtr in) {/* Maintains periodic boundary conditions */ if (in) { long row=in->rowSize; /* copy ends of rows */ while (row--) { Byte *rowP=&in->array[row*in->rowBytes]; rowP[-1]=rowP[in->columnSize-1]; rowP[in->columnSize]=rowP[0]; } /* copy first and last rows */ {long column=in->columnSize+2; Byte *firstRow=&in->array[-1]; Byte *lastRow=&in->array[(in->rowSize-1)*in->rowBytes-1]; while (column--) { firstRow[column-in->rowBytes]=lastRow[column]; lastRow[column+in->rowBytes]=firstRow[column]; } } } } long fprintresult(FILE *stream, LifeArrayPtr in); long fprintresult(FILE *stream, LifeArrayPtr in) {long out=0; if (in) {/* fprintf the array */ long column; for (column=0; column<in->columnSize; column++) { long row; for (row=0; row<in->rowSize; row++) { char c=in->array[in->rowBytes*row+column]; switch (c) { case 0: c=' '; break; case 2: case 3: case 4: case 5: case 6: case 7: case 8: case 9: case 10: out++; case 1: out++; c+='0'-1; break; default: c='*'; break; } if (stream) fprintf(stream, "%c", c); } if (stream) fprintf(stream, "\n"); } } return out>>1; } void saveresult(char *filename, LifeArrayPtr in); void saveresult(char *filename, LifeArrayPtr in) {/* save the array into a file */ if (filename) { FILE *fp=fopen(filename, "a+"); if (fp) { fprintresult(fp, in); fclose(fp); } } } long genArray[]={0x0247, 0x07d9, 0x09be, 0x0fb9, 0x17d9, 0x1f4e, 0x1fdc, 0x27df, 0x8957}; void addmaterial(LifeArrayPtr in); void addmaterial(LifeArrayPtr in) {/* when called add material to the life array */ if (in) { long rowstart=(rand()%(in->rowSize-6)); long colstart=(rand()%(in->columnSize-6)); if (rowstart<0) rowstart=-rowstart; if (colstart<0) colstart=-colstart; rowstart++; colstart++; { long j=4, s=(rand()&3)?0:(rand()%(sizeof(genArray)/sizeof(long))); if (s<0) s=0; s=genArray[s]; printf("adding material %x\n", s); sleep(1); /**/ if (rand()&1) {/* flip */ s=((s&0xf000L)>>12)| ((s&0x0f00L)>>4)| ((s&0x0f0L)<<4)| ((s&0x0fL)<<12); } if (rand()&1) {/* flip */ s=((s&0x8888L)>>3)| ((s&0x4444L)>>1)| ((s&0x2222L)<<1)| ((s&0x1111L)<<3); } while (j--) { long i=4; while (i--) { in->array[i+colstart+(rowstart+j)*in->rowBytes]|= (s&1)<<2; s>>=1; } } } } } #define RowDimension 80 #define ColumnDimension 22 void performcalculation(); void performcalculation() { /* allocate */ LifeArrayPtr arrayA, arrayB; arrayA=NewLifeArray(ColumnDimension, RowDimension); arrayB=NewLifeArray(ColumnDimension, RowDimension); if (arrayA&&arrayB) { long frameCount=0; long countdown=8; /* set initial conditions */ {long row=RowDimension; while (row--) { Byte *rowArray=&arrayA->array[row*arrayA->rowBytes]; long column=ColumnDimension; while (column--) { rowArray[column]=rand()&1; } } } fprintresult(stdout, arrayA); while (frameCount<5000) { LifeArrayPtr lastArray=frameCount&1?arrayB:arrayA; LifeArrayPtr nextArray=frameCount&1?arrayA:arrayB; long activecellcount; maintainboundaryconditions(lastArray); propogatelife(nextArray, lastArray); system("clear"); printf("\fFrame %d \n", frameCount); activecellcount=fprintresult(stdout, nextArray); fractionalsleep(0.125); if (activecellcount*100<(ColumnDimension*RowDimension)) { if (!countdown) { addmaterial(nextArray); countdown=16; } else countdown--; } else countdown=32; frameCount++; } } /* deallocate */ if (arrayA) free(arrayA); if (arrayB) free(arrayB); } int main(void) { { long throwoutcount=time(NULL)&0x3ffL; printf( "Initializing random number generator by throwing out %d random numbers...\n", throwoutcount); while (throwoutcount--) rand(); } printf("Beginning calculation... (first random is %d)\n", rand()); performcalculation(); return 0; }

When this code is run with its default array size of 80 by 22, the code produces a series of frames displaying the evolution of life-like behavior. For example:

Frame 148 000 1 0 0 2 030 0 10 ** 01 1 32 * * 0001 2 1 1 * 11 0 25 03 13 3 0 010 0 21 2 01 001 2 30 1 ** 012 0001 10 ** ** *0 10240 * * 0 01 * * 21 04 1 * * 0 0 * * 0 0 00 0 1 * * * 010 1111 1 1 0 * * ** 0 0 500005 1 * * * ** * * 60 01 0 ** * 0 0 07 4 0 21 * * 1 4 6 4 0 0*0 * * 0*0 23 421 1 1 0

Each cell is printed represented as a character. The life calculation depends on only whether or not a cell has life. The numbers, while having no effect on the life calculation, indicate how long ago that cell came to life. The asterisks are cells that have persisted for more than nine time steps. By design, the extra cells in the array are not displayed, but their effect is evident in how some structures wrap around the space to the other side.

Note the shapes the cells form. The physical shape structures indicate both the history and the future of that shape, but interactions between structures occur only when one structure is a cell away from the other. The study and evolution of these structures is intriguing work but is outside the scope of this tutorial.

After: parallellife
Choosing the appropriate partitioning of the problem is key to parallelizing that application. The decision made here is to evenly divide the problem space into one block per processor. An important detail is that the trailing edge of one block will influence and be influenced by the leading edge of the next, a result of the nature of the interdependence in this system: interactions are localized.

After addressing "boilerplate" issues such as processor count and identification data, the parallelization process focuses on how and where to propagate interactions between processors. In fact the total changes to this code are remarkably few. Changes to performcalculation are limited to "boilerplate" code that start up and shut down the parallel environment and pass sufficient information only to maintainboundaryconditions so that it can sustain the overall calculation.

In fact, maintainboundaryconditions is the only routine that has message passing code! It calculates the identification numbers of its immediate neighbors, then exchanges data with those neighboring processors. The message passing is embodied in the boundary condition routine because that's exactly how we are using message passing: the boundary between processors is maintained using messages from the neighboring space. Before, this routine maintained periodic boundary conditions; now it's maintaining the boundaries of this processors space using data from its neighbors. This is the message passing necessary to perform the calculation The executable running on each processor calculates how it will communicate on its own, automatically minimizing bottlenecks. There is no central authority delegating assignments or directing data transfer.

The convenience of adding message passing was in part because we already had arrays one cell larger in each direction. Before, we were already manipulating the edges of the problem space to maintain boundary conditions. Now we are using those extra cells to propagate data between edges of the problem space assigned to each processor. maintainboundaryconditions, executed almost simultaneously on each processor, supplies each other the information needed to perform this interaction. These extra cells are also called "guard cells". The guard cells are there to allow the computation code, propagatelife, to remain unchanged by acting on a slightly larger array. The guard cell is the extra room for a duplicate of the closest cells in the neighboring processor. Interprocessor communication between the primary computation calls reduces to maintaining these duplicate elements as needed.

At each time step, all the processors display its portion of the calculation.

The detailed answer is in the code. Listing 2 shows the C source of parallellife, with the changes relative to life.c underlined.

Listing 2 - parallellife.c, with changes compared to life.c underlined

#include <stdio.h> /* standard I/O routines */ #include <stdlib.h> /* standard library routines, such as memory */ #include <math.h> /* math libraries */ #include <time.h> /* time libraries */ void fractionalsleep(double sleepRequest); void fractionalsleep(double sleepRequest) {/* like sleep(), except that it accepts fractions of a second */ if (sleepRequest>0) { struct timespec request; double intSleep; request.tv_nsec=(1e9*modf(sleepRequest, &intSleep)); request.tv_sec=(long)intSleep; nanosleep(&request, NULL); } } #ifndef Byte #define Byte unsigned char #endif /* Defines array structure and access */ typedef struct LifeArrayStruct { Byte *array; long rowBytes; /* access with array[col+rowBytes*row] */ long columnSize; /* <= rowBytes-2 */ long rowSize; Byte buffer[4]; /* allocation must be at least sizeof(Byte)*(rowSize+2)*(columnSize+2) */ } LifeArrayStruct, *LifeArrayPtr, **LifeArrayHandle; #include <string.h> LifeArrayPtr NewLifeArray(long columnSize, long rowSize); LifeArrayPtr NewLifeArray(long columnSize, long rowSize) {LifeArrayPtr out=NULL; if ((columnSize>0)&&(rowSize>0)) { long rowBytes=columnSize+2; long allocation=sizeof(LifeArrayStruct)+rowBytes*(rowSize+2)*sizeof(Byte); out=malloc(allocation); if (out) { memset(out, 0, allocation); out->rowBytes=rowBytes; out->columnSize=columnSize; out->rowSize=rowSize; out->array=&out->buffer[1+out->rowBytes]; } } return out; } void propogatelife(LifeArrayPtr out, LifeArrayPtr in); /* assumed to be equal in size */ void propogatelife(LifeArrayPtr out, LifeArrayPtr in) {/* Propogates life according to rules by in 1970 */ if (out&&in) { long row=in->rowSize; while (row--) { long column=in->columnSize; Byte *outRow=&out->array[row*out->rowBytes]; Byte *inRow=&in->array[row*in->rowBytes]; /* Hey you! Vectorize this! */ while (column--) { long neighborCount=(inRow[column-1]?1:0) +(inRow[column+1]?1:0); neighborCount+=(inRow[column+in->rowBytes-1]?1:0) +(inRow[column+in->rowBytes]?1:0) +(inRow[column+in->rowBytes+1]?1:0); neighborCount+=(inRow[column-1-in->rowBytes]?1:0) +(inRow[column-in->rowBytes]?1:0) +(inRow[column+1-in->rowBytes]?1:0); switch (neighborCount) { default: case 0: case 1: /* not enough */ case 4: case 5: case 6: case 7: case 8: case 9: /* too much */ outRow[column]=0; /* die */ break; case 2: if (!(inRow[column])) { outRow[column]=0; /* stay dead */ break; } case 3: /* just right */ if (!(outRow[column]=inRow[column]+1)) outRow[column]=0xff; /* live */ break; } } } } } #pragma mark - Parallel code follows #include "mpi.h" /* MPI library */ int ppinit(int *idproc, int *nproc); void ppexit(void); void maintainboundaryconditions(LifeArrayPtr in, int idproc, int nproc); void maintainboundaryconditions(LifeArrayPtr in, int idproc, int nproc) {/* Maintains periodic boundary conditions */ if (in) { long row=in->rowSize; /* copy ends of rows */ while (row--) { Byte *rowP=&in->array[row*in->rowBytes]; rowP[-1]=rowP[in->columnSize-1]; rowP[in->columnSize]=rowP[0]; } /* copy first and last rows */ {long column=in->columnSize+2; Byte *firstRow=&in->array[-1]; Byte *lastRow=&in->array[(in->rowSize-1)*in->rowBytes-1]; if (nproc) {int ierr; MPI_Status status; MPI_Request leftReq, rightReq; int rightIDProc=idproc+1; int leftIDProc=idproc-1; if (rightIDProc>=nproc) rightIDProc=0; if (leftIDProc<0) leftIDProc=nproc-1; ierr=MPI_Irecv(&firstRow[-in->rowBytes], sizeof(Byte)*column, MPI_BYTE, leftIDProc, leftIDProc, MPI_COMM_WORLD, &leftReq); ierr=MPI_Irecv(&lastRow[+in->rowBytes], sizeof(Byte)*column, MPI_BYTE, rightIDProc, rightIDProc, MPI_COMM_WORLD, &rightReq); ierr=MPI_Send(lastRow, sizeof(Byte)*column, MPI_BYTE, rightIDProc, idproc, MPI_COMM_WORLD); ierr=MPI_Send(firstRow, sizeof(Byte)*column, MPI_BYTE, leftIDProc, idproc, MPI_COMM_WORLD); ierr=MPI_Wait(&leftReq, &status); ierr=MPI_Wait(&rightReq, &status); } else while (column--) { firstRow[column-in->rowBytes]=lastRow[column]; lastRow[column+in->rowBytes]=firstRow[column]; } } } } #pragma mark - most of the parallel code is above long fprintresult(FILE *stream, LifeArrayPtr in); long fprintresult(FILE *stream, LifeArrayPtr in) {long out=0; if (in) {/* fprintf the array */ long column; for (column=0; column<in->columnSize; column++) { long row; for (row=0; row<in->rowSize; row++) { char c=in->array[in->rowBytes*row+column]; switch (c) { case 0: c=' '; break; case 2: case 3: case 4: case 5: case 6: case 7: case 8: case 9: case 10: out++; case 1: out++; c+='0'-1; break; default: c='*'; break; } if (stream) fprintf(stream, "%c", c); } if (stream) fprintf(stream, "\n"); } } return out>>1; } void saveresult(char *filename, LifeArrayPtr in); void saveresult(char *filename, LifeArrayPtr in) {/* save the array into a file */ if (filename) { FILE *fp=fopen(filename, "a+"); if (fp) { fprintresult(fp, in); fclose(fp); } } } long genArray[]={0x0247, 0x07d9, 0x09be, 0x0fb9, 0x17d9, 0x1f4e, 0x1fdc, 0x27df, 0x8957}; void addmaterial(LifeArrayPtr in); void addmaterial(LifeArrayPtr in) {/* when called add material to the life array */ if (in) { long rowstart=(rand()%(in->rowSize-6)); long colstart=(rand()%(in->columnSize-6)); if (rowstart<0) rowstart=-rowstart; if (colstart<0) colstart=-colstart; rowstart++; colstart++; { long j=4, s=(rand()&3)?0:(rand()%(sizeof(genArray)/sizeof(long))); if (s<0) s=0; s=genArray[s]; printf("adding material %x\n", s); sleep(1); if (rand()&1) {/* flip */ s=((s&0xf000L)>>12)| ((s&0x0f00L)>>4)| ((s&0x0f0L)<<4)| ((s&0x0fL)<<12); } if (rand()&1) {/* flip */ s=((s&0x8888L)>>3)| ((s&0x4444L)>>1)| ((s&0x2222L)<<1)| ((s&0x1111L)<<3); } while (j--) { long i=4; while (i--) { in->array[i+colstart+(rowstart+j)*in->rowBytes]|= (s&1)<<2; s>>=1; } } } } } #define RowDimension 80 #define ColumnDimension 22 void performcalculation(); void performcalculation() { /* allocate */ LifeArrayPtr arrayA, arrayB; arrayA=NewLifeArray(ColumnDimension, RowDimension); arrayB=NewLifeArray(ColumnDimension, RowDimension); if (arrayA&&arrayB) { long frameCount=0; long countdown=8; int nproc=0, idproc=0, ierr; /* set initial conditions */ {long row=RowDimension; while (row--) { Byte *rowArray=&arrayA->array[row*arrayA->rowBytes]; long column=ColumnDimension; while (column--) { rowArray[column]=rand()&1; } } } fprintresult(stdout, arrayA); ierr=ppinit(&idproc, &nproc); if (ierr) { printf( "MPI_Init has returned error %d. Assuming single-processor... \n", ierr); nproc=idproc=0; } while (frameCount<5000) { LifeArrayPtr lastArray=frameCount&1?arrayB:arrayA; LifeArrayPtr nextArray=frameCount&1?arrayA:arrayB; long activecellcount; maintainboundaryconditions(lastArray, idproc, nproc); propogatelife(nextArray, lastArray); system("clear"); printf("\fFrame %d idproc=%d\n", frameCount, idproc); activecellcount=fprintresult(stdout, nextArray); fractionalsleep(0.125); if (activecellcount*100<(ColumnDimension*RowDimension)) { if (!countdown) { addmaterial(nextArray); countdown=16; } else countdown--; } else countdown=32; frameCount++; } if (nproc) ppexit(); } /* deallocate */ if (arrayA) free(arrayA); if (arrayB) free(arrayB); } int main(void) { { long throwoutcount=time(NULL)&0x3ffL; printf( "Initializing random number generator by throwing out %d random numbers...\n", throwoutcount); while (throwoutcount--) rand(); } printf("Beginning calculation... (first random is %d)\n", rand()); performcalculation(); return 0; } #ifdef __MWERKS__ /* only for Metrowerks CodeWarrior */ #include <SIOUX.h> #endif int ppinit(int argc, char *argv[], int *idproc, int *nproc) { /* this subroutine initializes parallel processing idproc = processor id nproc = number of real or virtual processors obtained */ int ierr; *nproc=0; /* initialize the MPI execution environment */ ierr = MPI_Init(&argc, &argv); if (!ierr) { /* determine the rank of the calling process in the communicator */ ierr = MPI_Comm_rank(MPI_COMM_WORLD, idproc); /* determine the size of the group associated with the communicator */ ierr = MPI_Comm_size(MPI_COMM_WORLD, nproc); #ifdef __MWERKS__ /* only for Metrowerks CodeWarrior */ SIOUXSettings.asktosaveonclose=0; SIOUXSettings.autocloseonquit=1; #endif } return ierr; } void ppexit(void) { /* this subroutine terminates parallel processing */ int ierr; /* waits until everybody gets here */ ierr = MPI_Barrier(MPI_COMM_WORLD); /* terminate MPI execution environment */ ierr = MPI_Finalize(); }

The changes from pascalstriangle to parallelpascalstriangle:

mpi.h - the header file for the MPI library, required to access information about the parallel system and perform communication
idproc, nproc - nproc describes how many processors are currently running this job and idproc identifies the designation, labeled from 0 to nproc - 1, of this processor. This information is sufficient to identify exactly which part of the problem this instance of the executable should work on. In this case, these variables are supplied to maintainboundaryconditions by performcalculation.
leftIDProc and rightIDProc - this variables describe the designated processor id's of this processor's left and right neighbors for communication inside the iteration loop.
MPI_Irecv, MPI_Send, and MPI_Wait - These elemental MPI calls perform the communication linking the trailing edge of one processor's data with the leading edge of the next processor's data. leftIDProc and rightIDProc identify to and from which processors the messages are sent and received.
The arrays are one cell larger in each direction than the space for which this processor is actually responsible. These extra cells are used as a "guard" cells. The guard cells hold a duplicate of the closest cells in the neighboring partition, making it possible to reuse propagatelife without modification.
maintainboundaryconditions needs to perform data swaps with both of its neighbors. It first calls MPI_Irecv on guard cells at both edges of the partition. Then it sends the edges of its space to its neighbors using MPI_Send, once for each edge. That fills in the neighbors' guard cells, and the neighbors do the same filling this processor's guard cells.
MPI_Irecv specifies from what processor to expect the message and where the message is to be written when it arrives from that other processor, but the message pass is (most likely) not yet complete. Control is returned to the caller immediately (hence the "I"), allowing the caller to do other work while the message is being handled. This behavior is called asynchronous message passing.
If we had called MPI_Recv here, this code would deadlock because every processor would wait on a receive because no processor has sent a message. We choose to use MPI_Irecv instead of MPI_Recv because we want to send messages other processors at the same time as we are receiving messages from them. Because we know that every other processor is doing the same thing with its neighbors, it's wise to make it possible for our neighbor to complete its send of the message while this processor is itself sending a message. This processor then calls MPI_Wait, passing corresponding MPI_Request structures produced by the earlier MPI_Irecv's, to wait for the recieve to complete. One should always confirm that asynchronous messages complete using MPI_Wait or its relative, MPI_Test.
MPI_COMM_WORLD - MPI defines communicator worlds or communicators that define a set of processors that can communicate with each other. At initialization, one communicator, MPI_COMM_WORLD, covers all the processors in the system. Other MPI calls can define arbitrary subsets of MPI_COMM_WORLD, making it possible to confine a code to a particular processor subset just by passing it the appropriate communicator. In simpler cases such as this, using MPI_COMM_WORLD is sufficient.
ppinit - performs initialization of the parallel computing environment. Part of the boilerplate for MPI codes, this routine calls:
- MPI_Init - performs the actual initialization of MPI. It returns the error code; zero means no error.
- MPI_Comm_size - accesses the processor count of the parallel system
- MPI_Comm_rank - accesses the identification number of this particular processor
ppexit - calling MPI_Finalize terminates the parallel computing environment. Also part of the boilerplate of MPI codes, a call to MPI_Finalize is needed to properly cleanup and close the connections between codes running on other processors and release control. The call to MPI_Barrier is optional, included here so that when all processors end up waiting for the user at processor 0 to press the return key.
__MWERKS__ - these #ifdef blocks merely compensate for a pecularity about the Metrowerks CodeWarrior compiler. These blocks allows the executable on other processors terminate automatically. If you are using another compiler, you don't have to worry about this code.

Note that the routine that performs the actual propagation computation, propagatelife, was left altered. That was possible because they were supplied just enough of the problem space to fulfill each processor's responsibility. maintainboundaryconditions maintained the interaction between processors. performcalculation's changes merely perform the boilerplate work of setting up and tearing down the parallel environment.

When this code, with its ColumnDimension constant set to 27, is run in parallel on three processors, the code produces output like this:

Processor 0 output

Processor 1 output

Processor 2 output

Frame 1199 idproc=0 0100 ** 4032 ** 0 10 011 1 0 43 300 2 05 1 10 0 0 1 15 240 02 0 0 0 10 01 1 2 62 202 14 1 0 48 1 1 79 0 0 3

Frame 1199 idproc=1 000 11210 1 310 0 0 2 3 57 44 0 3 050 0 20 121 010 0 4 1 010 14 20 70 02 31 0 1 3 0 121 000 24 31 2 0 01 1 0

Frame 1199 idproc=2 020 1 ** 10 030 ** 01 1 0 * 0 01 1 0 * 0 02 1 12 00 1 1 0 1 00 1 2 200 0 0 2 01 30 0 030 160 1

We can see that the space is sliced into three parts, a consequence of the partitioning of the problem space between these three processors. By design, cells at the edge of one partition and those at the neighboring edge of the next influence each other as if they were part of the same problem space. It is very important to realize that no one processor ever has a complete picture of the problem: each processor has information only about its own space. At each time step, all processors display the data they have.

Because this is a tutorial, the message-passing calls in the maintainboundaryconditions routine were written to be most readable, rather than fully optimized. You might notice that the order of the message passing will generally have all processors send to its right and receive from its left, then send to its left and receive from its right. This message passing pattern isn't necessarily the most optimal for typical networking hardware. Because many Ethernet switches support full duplex, meaning that it can send and receive simultaneously, between pairs of machines, it would be better if the even processors would swap data with its odd processor neighbor on one side, then the other. We can do this by replacing the message passing code with the code in Listing 3.

Listing 3 - an optimized alternative to the the message passing code in the maintainboundaryconditions routine

{long i; for(i=2; i--; ) if (i^idproc&1) { ierr=MPI_Irecv(&firstRow[-in->rowBytes], sizeof(Byte)*column, MPI_BYTE, leftIDProc, leftIDProc, MPI_COMM_WORLD, &leftReq); ierr=MPI_Send(firstRow, sizeof(Byte)*column, MPI_BYTE, leftIDProc, idproc, MPI_COMM_WORLD); ierr=MPI_Wait(&leftReq, &status); } else { ierr=MPI_Irecv(&lastRow[+in->rowBytes], sizeof(Byte)*column, MPI_BYTE, rightIDProc, rightIDProc, MPI_COMM_WORLD, &rightReq); ierr=MPI_Send(lastRow, sizeof(Byte)*column, MPI_BYTE, rightIDProc, idproc, MPI_COMM_WORLD); ierr=MPI_Wait(&rightReq, &status); } }

The loop calls the if statement twice, and the odd processors call the top half while the even processors call the bottom half, then they switch. This version is not necessary for code correctness, but it takes advantage of typical features of networking hardware. This is an optimization that applies when processors are simultaneously and bidirectionally swapping data.

The purpose of this discussion was to most clearly highlight what changes are needed to parallelize a application whose problem has inherent local interactions. We chose a simple two-dimensional problem of this type so attention could be drawn to the parallelization and message passing. Components of your code may correspond to the routines in the life code, so this parallelization example could suggest how to parallelize your code.

This example can be a starting point to experiment with parallel computing. For example, one could try modifying the above parallel code with different initial conditions, or introduce live cells at a later point in the computation. One could display time steps in the calculation using a display more complex than text, including a complete GUI. One could even enlarge the system and represent cells as individual pixels, which would inherently make the problem scale better because the computation time would increase relative to the communication time. Different propagation rules could be applied, including ones that would require more guard cells between partitions. One could even make this code interactive, allowing a user or users to click on where new seeds would dynamically be introduced. These fancier features are not shown here because they would distract from the purpose of this discussion, but the reader is welcome to explore the possibilities.

References

Using MPI by William Gropp, Ewing Lusk and Anthony Skjellum is a good reference for the MPI library. Many of the techniques and style of the above parallel code were adopted from Dr. Viktor K. Decyk. He expresses these ideas in a few references, including:

V. K. Decyk, "How to Write (Nearly) Portable Fortran Programs for Parallel Computers", Computers In Physics, 7, p. 418 (1993).
V. K. Decyk, "Skeleton PIC Codes for Parallel Computers", Computer Physics Communications, 87, p. 87 (1995).

Finally, the Game of Life has been extensively researched since it was first created in 1970 by John Horton Conway. Among the many references:

Wikipedia entry on John Conway's "Life"
M. Gardner, Mathematical games: the fantastic combinations of John Conway's new solitaire game "Life", Sci. Am. 223 (1970) 120--123 (online version)
a pictoral glossary of entities found in "Life"
a Java version of "Life"
a directory of related pages

You're welcome to read more about parallel computing via the Tutorials page.

Pooch is a trademark of Dauger Research

Pooch

| Altivec Fractal Carbon

| Atom in a Box

| Solar System Simulator

| Fresnel Diffraction Explorer

Dauger Research, Inc. · P.O. Box 3074 Huntington Beach, CA 92605