Parallelization - Parallel Life
We describe how we would convert a single-processor code calculating a
two-dimensional, propagation-style problem
into a parallel code.
This example demonstrates how one might handle a system whose partitions are not independent of
one another, but exhibit localized dependencies.
An approach that simply cuts up the problem into isolated pieces is
insufficient because of the internal interactions
inherent in the system.
Nonetheless, the work of the code can be partitioned between the processors with their
dependencies
handled efficiently.
|
This technique applies to a variety of possible codes. Problems that can be translated into
finite difference equations, which include the diffusion equation, wave propagation, fluid dynamics, or
other problems described by partial differential equations, have an almost direct correspondence to the
components of this code.
In the two-dimensional graphics space, filters such as Gaussian blur, sharpen, or other convolution-style
algorithms also can be thought of as
a propagation of information from one pixel to its neighbors, that is, a local interaction.
Also, portions of a code could be similar to those problems, making this technique applicable.
Other more complex codes such as particle-based simulations, including the particle push of most plasma PIC codes,
molecular dynamics codes,
or
astronomical or other N-body simulations, could benefit from the parallelization approach demonstrated here.
This example is like the previous
Parallel Pascal's Triangle
except that this problem is two dimensional and the interprocess communication is bidirectional at
each time step.
Finally, if you are learning about parallel code,
this following builds on how to combine partitioning with elemental message-passing.
Before: Single-Processor Game of Life
The Game of Life is an example of cellular autonoma introduced by John Conway, Professor of Mathematics at Princeton.
The space is divided into a square grid, and in each grid is a "cell" that is either alive or dead.
The life and death of these cells evolves from one time step to the next
based on the previous state of that cell and its neighbors.
The result is a completely deterministic, dynamic system of cells that, at least
superficially, mimics the appearance of cellular life in a petri dish.
The rules of this "Game" are very simple. Each cell has eight neighbors, four of them one
square away (like a rook move)
and four more one diagonal
step away (like a bishop move).
If three of these neighbors are alive, then
the cell in the center will come to life in the next time step regardless of its state in the previous one.
If two neighbors are alive, life in the center cell will survive only if it was already alive.
In all other cases, the cell will die of either overpopulation (four or more live neighbors) or
lonliness (two or fewer live neighbors).
These rules, chosen by Professor Conway,
result in an interesting evolution.
|
5 live neighbors => X (cell dies)
|
The single-processor version of this Game of Life code, life,
performs a direct computation of Conway's Life rules on an allocated two-dimensional array.
Its algorithm is as follows: Begin with two equal-sized arrays, one for the current time step and one for the next.
For each cell, count how many of its eight neighbors are alive, then apply the Life rules and save the result in the
array for the next time step.
Next, maintain any boundary conditions as necessary (in this case periodic), and then swap the arrays.
Continue for the desired number of time steps.
In fact, this code allocates arrays one cell larger in each of four directions
out from the array. This makes the routine that counts neighboring live cells easier to write because
it does not have to check if a neighbor is outside the array bounds.
The extra cells are maintained by the boundary condition routine, which, in this case,
maintains periodic boundary conditions by duplicating the first and last rows and columns into the
extra parts of the array.
Factoring the boundary condition code is a good practice because it makes it easy to
adjust the boundary conditions to, say, a vacuum (all dead cells) or
perhaps a driving function.
Separating the tasks of the code into distinct purposes also lends itself to
easy experimentation
with different initial conditions,
or different cellular rules,
or mixed boundary conditions.
In fact, this code detects if the life array has become stagnant and introduces
patterns that can cause interesting cellular reactions.
This code is organized into three primary routines plus main:
- propagatelife - counts the life in neighboring cells and applies the Life rules,
generating output in a second array. Swapping references to a pair of arrays on alternating calls to
this routine obviates potentially costly memory moves or copies.
- maintainboundaryconditions - maintains the boundary conditions of the
sample of life by writing the extra space in the life array based on the state of
the internal parts of the array. In this implementation, it performs a simple
copy of the last row to the array just before the first row and a copy of the first row
to the array just after the last row. It copies the first and last columns similarly.
- performcalculation - allocates memory, initializes and seeds the system,
calls maintainboundaryconditions and propagatelife, and
displays the ongoing state of the system.
If its test for a stagnant situation returns true, it introduces new material.
Once the system has finished propagating, it deallocates memory and returns.
- main - initializes the random number generator and calls performcalculation
While the above routines perform work primary to the problem,
additional routines perform functions secondary to the problem.
These include
allocating and formatting the appropriate memory structures (NewLifeArray)
and printing the state of the system (fprintresult).
As a side effect, fprintresult
returns a count of the total number of active cells in the array,
which is used to determine if new life should be triggered
using addmaterial.
A side effect of propagatelife
is that it increments a integer in the live cell of the array to
indicate how long a particular cell has been alive.
This auxillary data is used in fprintresult
for its display and its active cell calculation.
fractionalsleep, like sleep, will
have the process give up time for the specified number of seconds, but
fractionalsleep accepts fractional seconds, allowing us to
slow down the calculation to several frames per second.
The C source is shown in Listing 1.
Listing 1 - life.c
#include <stdio.h> /* standard I/O routines */
#include <stdlib.h> /* standard library routines, such as memory */
#include <math.h> /* math libraries */
#include <time.h> /* time libraries */
void fractionalsleep(double sleepRequest);
void fractionalsleep(double sleepRequest)
{/* like sleep(), except that it accepts fractions of a second */
if (sleepRequest>0) {
struct timespec request;
double intSleep;
request.tv_nsec=(1e9*modf(sleepRequest, &intSleep));
request.tv_sec=(long)intSleep;
nanosleep(&request, NULL);
}
}
#ifndef Byte
#define Byte unsigned char
#endif
/* Defines array structure and access */
typedef struct LifeArrayStruct {
Byte *array;
long rowBytes; /* access with array[col+rowBytes*row] */
long columnSize; /* <= rowBytes-2 */
long rowSize;
Byte buffer[4];
/* allocation must be at least sizeof(Byte)*(rowSize+2)*(columnSize+2) */
} LifeArrayStruct, *LifeArrayPtr, **LifeArrayHandle;
#include <string.h>
LifeArrayPtr NewLifeArray(long columnSize, long rowSize);
LifeArrayPtr NewLifeArray(long columnSize, long rowSize)
{LifeArrayPtr out=NULL;
if ((columnSize>0)&&(rowSize>0)) {
long rowBytes=columnSize+2;
long allocation=sizeof(LifeArrayStruct)+rowBytes*(rowSize+2)*sizeof(Byte);
out=malloc(allocation);
if (out) {
memset(out, 0, allocation);
out->rowBytes=rowBytes;
out->columnSize=columnSize;
out->rowSize=rowSize;
out->array=&out->buffer[1+out->rowBytes];
}
}
return out;
}
void propogatelife(LifeArrayPtr out, LifeArrayPtr in); /* assumed to be equal in size */
void propogatelife(LifeArrayPtr out, LifeArrayPtr in)
{/* Propogates life according to rules by J. Conway in 1970 */
if (out&&in) {
long row=in->rowSize;
while (row--) {
long column=in->columnSize;
Byte *outRow=&out->array[row*out->rowBytes];
Byte *inRow=&in->array[row*in->rowBytes];
/* Hey you! Vectorize this! */
while (column--) {
long neighborCount=(inRow[column-1]?1:0)
+(inRow[column+1]?1:0);
neighborCount+=(inRow[column+in->rowBytes-1]?1:0)
+(inRow[column+in->rowBytes]?1:0)
+(inRow[column+in->rowBytes+1]?1:0);
neighborCount+=(inRow[column-1-in->rowBytes]?1:0)
+(inRow[column-in->rowBytes]?1:0)
+(inRow[column+1-in->rowBytes]?1:0);
switch (neighborCount) {
default:
case 0: case 1:
/* not enough */
case 4: case 5: case 6: case 7: case 8: case 9:
/* too much */
outRow[column]=0; /* die */
break;
case 2: if (!(inRow[column])) {
outRow[column]=0; /* stay dead */
break;
}
case 3: /* just right */
if (!(outRow[column]=inRow[column]+1))
outRow[column]=0xff; /* live */
break;
}
}
}
}
}
void maintainboundaryconditions(LifeArrayPtr in);
void maintainboundaryconditions(LifeArrayPtr in)
{/* Maintains periodic boundary conditions */
if (in) {
long row=in->rowSize;
/* copy ends of rows */
while (row--) {
Byte *rowP=&in->array[row*in->rowBytes];
rowP[-1]=rowP[in->columnSize-1];
rowP[in->columnSize]=rowP[0];
}
/* copy first and last rows */
{long column=in->columnSize+2;
Byte *firstRow=&in->array[-1];
Byte *lastRow=&in->array[(in->rowSize-1)*in->rowBytes-1];
while (column--) {
firstRow[column-in->rowBytes]=lastRow[column];
lastRow[column+in->rowBytes]=firstRow[column];
}
}
}
}
long fprintresult(FILE *stream, LifeArrayPtr in);
long fprintresult(FILE *stream, LifeArrayPtr in)
{long out=0;
if (in) {/* fprintf the array */
long column;
for (column=0; column<in->columnSize; column++) {
long row;
for (row=0; row<in->rowSize; row++) {
char c=in->array[in->rowBytes*row+column];
switch (c) {
case 0: c=' '; break;
case 2: case 3: case 4: case 5:
case 6: case 7: case 8: case 9: case 10:
out++;
case 1:
out++;
c+='0'-1; break;
default: c='*'; break;
}
if (stream) fprintf(stream, "%c", c);
}
if (stream) fprintf(stream, "\n");
}
}
return out>>1;
}
void saveresult(char *filename, LifeArrayPtr in);
void saveresult(char *filename, LifeArrayPtr in)
{/* save the array into a file */
if (filename) {
FILE *fp=fopen(filename, "a+");
if (fp) {
fprintresult(fp, in);
fclose(fp);
}
}
}
long genArray[]={0x0247, 0x07d9, 0x09be, 0x0fb9, 0x17d9, 0x1f4e, 0x1fdc, 0x27df, 0x8957};
void addmaterial(LifeArrayPtr in);
void addmaterial(LifeArrayPtr in)
{/* when called add material to the life array */
if (in) {
long rowstart=(rand()%(in->rowSize-6));
long colstart=(rand()%(in->columnSize-6));
if (rowstart<0) rowstart=-rowstart;
if (colstart<0) colstart=-colstart;
rowstart++;
colstart++;
{
long j=4, s=(rand()&3)?0:(rand()%(sizeof(genArray)/sizeof(long)));
if (s<0) s=0;
s=genArray[s];
printf("adding material %x\n", s); sleep(1); /**/
if (rand()&1) {/* flip */
s=((s&0xf000L)>>12)|
((s&0x0f00L)>>4)|
((s&0x0f0L)<<4)|
((s&0x0fL)<<12);
}
if (rand()&1) {/* flip */
s=((s&0x8888L)>>3)|
((s&0x4444L)>>1)|
((s&0x2222L)<<1)|
((s&0x1111L)<<3);
}
while (j--) {
long i=4;
while (i--) {
in->array[i+colstart+(rowstart+j)*in->rowBytes]|=
(s&1)<<2;
s>>=1;
}
}
}
}
}
#define RowDimension 80
#define ColumnDimension 22
void performcalculation();
void performcalculation()
{
/* allocate */
LifeArrayPtr arrayA, arrayB;
arrayA=NewLifeArray(ColumnDimension, RowDimension);
arrayB=NewLifeArray(ColumnDimension, RowDimension);
if (arrayA&&arrayB) {
long frameCount=0;
long countdown=8;
/* set initial conditions */
{long row=RowDimension;
while (row--) {
Byte *rowArray=&arrayA->array[row*arrayA->rowBytes];
long column=ColumnDimension;
while (column--) {
rowArray[column]=rand()&1;
}
}
}
fprintresult(stdout, arrayA);
while (frameCount<5000) {
LifeArrayPtr lastArray=frameCount&1?arrayB:arrayA;
LifeArrayPtr nextArray=frameCount&1?arrayA:arrayB;
long activecellcount;
maintainboundaryconditions(lastArray);
propogatelife(nextArray, lastArray);
system("clear");
printf("\fFrame %d \n", frameCount);
activecellcount=fprintresult(stdout, nextArray);
fractionalsleep(0.125);
if (activecellcount*100<(ColumnDimension*RowDimension)) {
if (!countdown) {
addmaterial(nextArray);
countdown=16;
}
else
countdown--;
}
else
countdown=32;
frameCount++;
}
}
/* deallocate */
if (arrayA) free(arrayA);
if (arrayB) free(arrayB);
}
int main(void)
{
{
long throwoutcount=time(NULL)&0x3ffL;
printf(
"Initializing random number generator by throwing out %d random numbers...\n",
throwoutcount);
while (throwoutcount--) rand();
}
printf("Beginning calculation... (first random is %d)\n", rand());
performcalculation();
return 0;
}
|
|
When this code is run with its default array size of 80 by 22, the code produces a series of
frames displaying the evolution of life-like behavior.
For example:
Frame 148 000 1 0
0 2
030 0 10 **
01 1 32 * *
0001 2 1 1 *
11 0 25
03 13
3 0
010 0 21 2
01 001 2 30 1 **
012 0001 10 ** **
*0 10240 * *
0 01 * *
21 04 1 * *
0 0 * *
0 0 00 0 1 * * *
010 1111 1 1 0 * * **
0 0 500005 1 * * *
** * * 60 01 0 **
* 0 0 07 4 0 21 *
* 1 4 6 4 0 0*0 *
* 0*0 23 421 1 1 0
|
Each cell is printed represented as a character. The life calculation depends on only whether or not a cell has life.
The numbers, while having no effect on the life calculation, indicate how long ago that cell came to life.
The asterisks are cells that have persisted for more than nine time steps.
By design, the extra cells in the array are not displayed, but their effect is evident in how some structures wrap
around the space to the other side.
Note the shapes the cells form. The physical shape structures indicate both the history and the future of that shape, but
interactions between structures occur only when one structure is a cell away from the other. The study and
evolution of these structures is intriguing work but is outside the scope of this tutorial.
After: parallellife
Choosing the appropriate partitioning of the problem is key to parallelizing that application.
The decision made here is to evenly divide the problem space into one block per processor.
An important detail is that the trailing edge of one block will influence and be
influenced by the leading edge of the next,
a result of the nature of the interdependence in this system: interactions are localized.
|
|
After addressing "boilerplate" issues such as processor count and identification data,
the parallelization process focuses on how and where to propagate interactions between processors.
In fact the total changes to this code are remarkably few.
Changes to performcalculation are limited to "boilerplate" code that start up and shut down the parallel environment
and
pass sufficient information only to maintainboundaryconditions so that it can
sustain the overall calculation.
In fact, maintainboundaryconditions is the only routine that has message passing code!
It calculates the identification numbers of its immediate neighbors, then exchanges data with those neighboring
processors. The message passing is embodied in the
boundary condition routine because that's exactly how we are using message passing: the
boundary between processors is maintained using messages from the neighboring space.
Before, this routine maintained periodic boundary conditions; now
it's maintaining the boundaries of this processors space using data from its neighbors.
This is the message passing necessary to perform the calculation
The executable running on each processor calculates how it will communicate on its own,
automatically minimizing bottlenecks.
There is no central
authority delegating assignments or directing data transfer.
|
|
The convenience of adding message passing was in part because we already had arrays one cell larger in each direction.
Before, we were already manipulating the edges of the problem space to maintain boundary conditions.
Now we are using those extra cells to propagate data between
edges of the problem space assigned to each processor.
maintainboundaryconditions, executed almost simultaneously on each processor,
supplies each other the information needed to perform this interaction.
These extra cells are also called "guard cells".
The guard cells are there to allow the computation code,
propagatelife, to remain unchanged by acting on a slightly
larger array.
The guard cell is the extra room for a duplicate of the closest cells in the neighboring processor.
Interprocessor communication between the primary computation calls
reduces to maintaining these duplicate elements as needed.
At each time step, all the processors display its portion of the calculation.
The detailed answer is in the code.
Listing 2 shows the C source of parallellife, with the changes relative to life.c underlined.
Listing 2 - parallellife.c, with changes compared to life.c underlined
#include <stdio.h> /* standard I/O routines */
#include <stdlib.h> /* standard library routines, such as memory */
#include <math.h> /* math libraries */
#include <time.h> /* time libraries */
void fractionalsleep(double sleepRequest);
void fractionalsleep(double sleepRequest)
{/* like sleep(), except that it accepts fractions of a second */
if (sleepRequest>0) {
struct timespec request;
double intSleep;
request.tv_nsec=(1e9*modf(sleepRequest, &intSleep));
request.tv_sec=(long)intSleep;
nanosleep(&request, NULL);
}
}
#ifndef Byte
#define Byte unsigned char
#endif
/* Defines array structure and access */
typedef struct LifeArrayStruct {
Byte *array;
long rowBytes; /* access with array[col+rowBytes*row] */
long columnSize; /* <= rowBytes-2 */
long rowSize;
Byte buffer[4];
/* allocation must be at least sizeof(Byte)*(rowSize+2)*(columnSize+2) */
} LifeArrayStruct, *LifeArrayPtr, **LifeArrayHandle;
#include <string.h>
LifeArrayPtr NewLifeArray(long columnSize, long rowSize);
LifeArrayPtr NewLifeArray(long columnSize, long rowSize)
{LifeArrayPtr out=NULL;
if ((columnSize>0)&&(rowSize>0)) {
long rowBytes=columnSize+2;
long allocation=sizeof(LifeArrayStruct)+rowBytes*(rowSize+2)*sizeof(Byte);
out=malloc(allocation);
if (out) {
memset(out, 0, allocation);
out->rowBytes=rowBytes;
out->columnSize=columnSize;
out->rowSize=rowSize;
out->array=&out->buffer[1+out->rowBytes];
}
}
return out;
}
void propogatelife(LifeArrayPtr out, LifeArrayPtr in); /* assumed to be equal in size */
void propogatelife(LifeArrayPtr out, LifeArrayPtr in)
{/* Propogates life according to rules by in 1970 */
if (out&&in) {
long row=in->rowSize;
while (row--) {
long column=in->columnSize;
Byte *outRow=&out->array[row*out->rowBytes];
Byte *inRow=&in->array[row*in->rowBytes];
/* Hey you! Vectorize this! */
while (column--) {
long neighborCount=(inRow[column-1]?1:0)
+(inRow[column+1]?1:0);
neighborCount+=(inRow[column+in->rowBytes-1]?1:0)
+(inRow[column+in->rowBytes]?1:0)
+(inRow[column+in->rowBytes+1]?1:0);
neighborCount+=(inRow[column-1-in->rowBytes]?1:0)
+(inRow[column-in->rowBytes]?1:0)
+(inRow[column+1-in->rowBytes]?1:0);
switch (neighborCount) {
default:
case 0: case 1:
/* not enough */
case 4: case 5: case 6: case 7: case 8: case 9:
/* too much */
outRow[column]=0; /* die */
break;
case 2: if (!(inRow[column])) {
outRow[column]=0; /* stay dead */
break;
}
case 3: /* just right */
if (!(outRow[column]=inRow[column]+1))
outRow[column]=0xff; /* live */
break;
}
}
}
}
}
#pragma mark - Parallel code follows
#include "mpi.h" /* MPI library */
int ppinit(int *idproc, int *nproc);
void ppexit(void);
void maintainboundaryconditions(LifeArrayPtr in, int idproc, int nproc);
void maintainboundaryconditions(LifeArrayPtr in, int idproc, int nproc)
{/* Maintains periodic boundary conditions */
if (in) {
long row=in->rowSize;
/* copy ends of rows */
while (row--) {
Byte *rowP=&in->array[row*in->rowBytes];
rowP[-1]=rowP[in->columnSize-1];
rowP[in->columnSize]=rowP[0];
}
/* copy first and last rows */
{long column=in->columnSize+2;
Byte *firstRow=&in->array[-1];
Byte *lastRow=&in->array[(in->rowSize-1)*in->rowBytes-1];
if (nproc) {int ierr; MPI_Status status;
MPI_Request leftReq, rightReq;
int rightIDProc=idproc+1;
int leftIDProc=idproc-1;
if (rightIDProc>=nproc) rightIDProc=0;
if (leftIDProc<0) leftIDProc=nproc-1;
ierr=MPI_Irecv(&firstRow[-in->rowBytes], sizeof(Byte)*column,
MPI_BYTE, leftIDProc, leftIDProc,
MPI_COMM_WORLD, &leftReq);
ierr=MPI_Irecv(&lastRow[+in->rowBytes], sizeof(Byte)*column,
MPI_BYTE, rightIDProc, rightIDProc,
MPI_COMM_WORLD, &rightReq);
ierr=MPI_Send(lastRow, sizeof(Byte)*column, MPI_BYTE,
rightIDProc, idproc, MPI_COMM_WORLD);
ierr=MPI_Send(firstRow, sizeof(Byte)*column, MPI_BYTE,
leftIDProc, idproc, MPI_COMM_WORLD);
ierr=MPI_Wait(&leftReq, &status);
ierr=MPI_Wait(&rightReq, &status);
}
else
while (column--) {
firstRow[column-in->rowBytes]=lastRow[column];
lastRow[column+in->rowBytes]=firstRow[column];
}
}
}
}
#pragma mark - most of the parallel code is above
long fprintresult(FILE *stream, LifeArrayPtr in);
long fprintresult(FILE *stream, LifeArrayPtr in)
{long out=0;
if (in) {/* fprintf the array */
long column;
for (column=0; column<in->columnSize; column++) {
long row;
for (row=0; row<in->rowSize; row++) {
char c=in->array[in->rowBytes*row+column];
switch (c) {
case 0: c=' '; break;
case 2: case 3: case 4: case 5:
case 6: case 7: case 8: case 9: case 10:
out++;
case 1:
out++;
c+='0'-1; break;
default: c='*'; break;
}
if (stream) fprintf(stream, "%c", c);
}
if (stream) fprintf(stream, "\n");
}
}
return out>>1;
}
void saveresult(char *filename, LifeArrayPtr in);
void saveresult(char *filename, LifeArrayPtr in)
{/* save the array into a file */
if (filename) {
FILE *fp=fopen(filename, "a+");
if (fp) {
fprintresult(fp, in);
fclose(fp);
}
}
}
long genArray[]={0x0247, 0x07d9, 0x09be, 0x0fb9, 0x17d9, 0x1f4e, 0x1fdc, 0x27df, 0x8957};
void addmaterial(LifeArrayPtr in);
void addmaterial(LifeArrayPtr in)
{/* when called add material to the life array */
if (in) {
long rowstart=(rand()%(in->rowSize-6));
long colstart=(rand()%(in->columnSize-6));
if (rowstart<0) rowstart=-rowstart;
if (colstart<0) colstart=-colstart;
rowstart++;
colstart++;
{
long j=4, s=(rand()&3)?0:(rand()%(sizeof(genArray)/sizeof(long)));
if (s<0) s=0;
s=genArray[s];
printf("adding material %x\n", s);
sleep(1);
if (rand()&1) {/* flip */
s=((s&0xf000L)>>12)|
((s&0x0f00L)>>4)|
((s&0x0f0L)<<4)|
((s&0x0fL)<<12);
}
if (rand()&1) {/* flip */
s=((s&0x8888L)>>3)|
((s&0x4444L)>>1)|
((s&0x2222L)<<1)|
((s&0x1111L)<<3);
}
while (j--) {
long i=4;
while (i--) {
in->array[i+colstart+(rowstart+j)*in->rowBytes]|=
(s&1)<<2;
s>>=1;
}
}
}
}
}
#define RowDimension 80
#define ColumnDimension 22
void performcalculation();
void performcalculation()
{
/* allocate */
LifeArrayPtr arrayA, arrayB;
arrayA=NewLifeArray(ColumnDimension, RowDimension);
arrayB=NewLifeArray(ColumnDimension, RowDimension);
if (arrayA&&arrayB) {
long frameCount=0;
long countdown=8;
int nproc=0, idproc=0, ierr;
/* set initial conditions */
{long row=RowDimension;
while (row--) {
Byte *rowArray=&arrayA->array[row*arrayA->rowBytes];
long column=ColumnDimension;
while (column--) {
rowArray[column]=rand()&1;
}
}
}
fprintresult(stdout, arrayA);
ierr=ppinit(&idproc, &nproc);
if (ierr) {
printf(
"MPI_Init has returned error %d. Assuming single-processor... \n", ierr);
nproc=idproc=0;
}
while (frameCount<5000) {
LifeArrayPtr lastArray=frameCount&1?arrayB:arrayA;
LifeArrayPtr nextArray=frameCount&1?arrayA:arrayB;
long activecellcount;
maintainboundaryconditions(lastArray, idproc, nproc);
propogatelife(nextArray, lastArray);
system("clear");
printf("\fFrame %d idproc=%d\n", frameCount, idproc);
activecellcount=fprintresult(stdout, nextArray);
fractionalsleep(0.125);
if (activecellcount*100<(ColumnDimension*RowDimension)) {
if (!countdown) {
addmaterial(nextArray);
countdown=16;
}
else
countdown--;
}
else
countdown=32;
frameCount++;
}
if (nproc)
ppexit();
}
/* deallocate */
if (arrayA) free(arrayA);
if (arrayB) free(arrayB);
}
int main(void)
{
{
long throwoutcount=time(NULL)&0x3ffL;
printf(
"Initializing random number generator by throwing out %d random numbers...\n",
throwoutcount);
while (throwoutcount--) rand();
}
printf("Beginning calculation... (first random is %d)\n", rand());
performcalculation();
return 0;
}
#ifdef __MWERKS__
/* only for Metrowerks CodeWarrior */
#include <SIOUX.h>
#endif
int ppinit(int argc, char *argv[], int *idproc, int *nproc)
{
/* this subroutine initializes parallel processing
idproc = processor id
nproc = number of real or virtual processors obtained */
int ierr;
*nproc=0;
/* initialize the MPI execution environment */
ierr = MPI_Init(&argc, &argv);
if (!ierr) {
/* determine the rank of the calling process in the communicator */
ierr = MPI_Comm_rank(MPI_COMM_WORLD, idproc);
/* determine the size of the group associated with the communicator */
ierr = MPI_Comm_size(MPI_COMM_WORLD, nproc);
#ifdef __MWERKS__
/* only for Metrowerks CodeWarrior */
SIOUXSettings.asktosaveonclose=0;
SIOUXSettings.autocloseonquit=1;
#endif
}
return ierr;
}
void ppexit(void)
{
/* this subroutine terminates parallel processing */
int ierr;
/* waits until everybody gets here */
ierr = MPI_Barrier(MPI_COMM_WORLD);
/* terminate MPI execution environment */
ierr = MPI_Finalize();
}
|
|
The changes from pascalstriangle to parallelpascalstriangle:
- mpi.h - the header file for the MPI library, required to access information about the parallel system and perform communication
- idproc, nproc -
nproc describes how many processors are currently running this job and
idproc identifies the designation, labeled from 0 to nproc - 1, of this processor.
This information is sufficient to identify exactly which part of the problem this instance of the executable should work on.
In this case, these variables are supplied to maintainboundaryconditions by performcalculation.
- leftIDProc and rightIDProc -
this variables describe the designated processor id's of
this processor's left and right neighbors for communication inside the iteration loop.
- MPI_Irecv, MPI_Send, and MPI_Wait -
These elemental MPI calls perform the communication linking the trailing edge of one processor's data
with the leading edge of the next processor's data.
leftIDProc and rightIDProc identify to and
from which processors the messages are sent and received.
The arrays are one cell larger in each direction
than the space for which this processor is actually responsible.
These extra cells are used as a "guard" cells.
The guard cells hold a duplicate of the closest cells in the neighboring partition, making
it possible to reuse propagatelife without modification.
maintainboundaryconditions needs to perform data swaps with both of its neighbors.
It first calls MPI_Irecv on guard cells at both edges of the partition.
Then it sends the edges of its space to its neighbors using MPI_Send,
once for each edge.
That fills in the neighbors' guard cells, and the neighbors do the same filling this processor's
guard cells.
MPI_Irecv specifies from what processor to expect the message and
where the message is to be written when it arrives from that other processor, but the message pass
is (most likely) not yet complete. Control is returned to the caller immediately (hence the "I"), allowing the caller
to do other work while the message is being handled. This behavior is called asynchronous message passing.
If we had called MPI_Recv here, this code would deadlock because every processor would wait on
a receive because no processor has sent a message.
We choose to use MPI_Irecv instead of MPI_Recv because we want to send messages other
processors at the same time as we are receiving messages from them.
Because we know that every other processor is doing the same thing with its neighbors,
it's wise to make it possible for our neighbor to complete its send of the message while this processor is itself
sending a message. This processor then calls MPI_Wait, passing corresponding MPI_Request structures
produced by the earlier MPI_Irecv's, to wait for the recieve to complete.
One should always confirm that asynchronous messages complete using MPI_Wait or its relative, MPI_Test.
- MPI_COMM_WORLD -
MPI defines communicator worlds or communicators that define a set of processors that can communicate with each other.
At initialization, one communicator, MPI_COMM_WORLD, covers all the processors in the system. Other MPI calls
can define arbitrary subsets of MPI_COMM_WORLD, making it possible to confine a code to a particular processor subset
just by passing it the appropriate communicator.
In simpler cases such as this, using MPI_COMM_WORLD is sufficient.
- ppinit -
performs initialization of the parallel computing environment.
Part of the boilerplate for MPI codes, this routine calls:
- MPI_Init - performs the actual initialization of MPI. It returns the error code; zero means no error.
- MPI_Comm_size - accesses the processor count of the parallel system
- MPI_Comm_rank - accesses the identification number of this particular processor
- ppexit - calling MPI_Finalize terminates
the parallel computing environment. Also part of the boilerplate of MPI codes, a call to
MPI_Finalize is needed to properly cleanup and close the connections between codes running on other
processors and release control. The call to
MPI_Barrier is optional, included here so that when all processors end up waiting for the user at processor 0 to
press the return key.
- __MWERKS__ -
these #ifdef blocks merely compensate for a pecularity about the Metrowerks CodeWarrior compiler.
These blocks allows the executable on other processors terminate automatically.
If you are using another compiler, you don't have to worry about this code.
Note that the routine that performs the actual propagation computation, propagatelife,
was left altered.
That was possible because
they were supplied just enough of the problem space to fulfill each processor's responsibility.
maintainboundaryconditions maintained the interaction between processors.
performcalculation's changes merely perform the boilerplate work of setting up and tearing down the
parallel environment.
When this code, with its ColumnDimension constant set to 27,
is run in parallel on three processors, the code produces output like this:
Processor 0 output
|
Processor 1 output
|
Processor 2 output
|
Frame 1199 idproc=0
0100
** 4032
** 0
10
011
1 0 43 300
2 05 1
10 0 0
1 15 240
02 0 0 0
10 01 1 2 62
202 14 1 0
48 1 1
79 0
0
3
|
|
Frame 1199 idproc=1
000
11210
1
310 0
0 2
3
57
44
0
3 050 0
20 121
010 0 4 1
010 14 20
70 02 31
0 1 3
0 121
000
24 31
2 0
01 1
0
|
|
Frame 1199 idproc=2
020 1 **
10 030 **
01
1
0
* 0
01 1 0 * 0
02 1 12
00 1 1 0
1 00 1 2
200 0 0
2
01 30
0
030
160 1
|
|
We can see that the space is sliced into three parts, a consequence of the partitioning of the problem space
between these three processors.
By design, cells at the edge of one partition and those at the neighboring edge of the next
influence each other as if they were part of the same problem space.
It is very important to realize that no one processor ever has a complete picture of the problem:
each processor has information only about its own space.
At each time step, all processors display the data they have.
Because this is a tutorial, the message-passing calls in the
maintainboundaryconditions
routine were written to be most readable, rather than
fully optimized.
You might notice that the order of the message passing will generally have all processors
send to its right and receive from its left, then send to its left and receive from its right.
This message passing pattern isn't necessarily the most optimal for typical networking hardware.
Because many Ethernet switches support full duplex, meaning that it can send and receive simultaneously,
between pairs of machines,
it would be better if the even processors would swap data with its odd processor neighbor on one
side, then the other.
We can do this by replacing the message passing code with the code in Listing 3.
Listing 3 - an optimized alternative to the the message passing code in the maintainboundaryconditions routine
{long i;
for(i=2; i--; )
if (i^idproc&1) {
ierr=MPI_Irecv(&firstRow[-in->rowBytes], sizeof(Byte)*column,
MPI_BYTE, leftIDProc, leftIDProc,
MPI_COMM_WORLD, &leftReq);
ierr=MPI_Send(firstRow, sizeof(Byte)*column, MPI_BYTE,
leftIDProc, idproc, MPI_COMM_WORLD);
ierr=MPI_Wait(&leftReq, &status);
}
else {
ierr=MPI_Irecv(&lastRow[+in->rowBytes], sizeof(Byte)*column,
MPI_BYTE, rightIDProc, rightIDProc,
MPI_COMM_WORLD, &rightReq);
ierr=MPI_Send(lastRow, sizeof(Byte)*column, MPI_BYTE,
rightIDProc, idproc, MPI_COMM_WORLD);
ierr=MPI_Wait(&rightReq, &status);
}
}
|
|
The loop calls the if statement twice, and the odd processors call the top half while the even
processors call the bottom half, then they switch. This version is not necessary for code correctness,
but it takes advantage of typical features of networking hardware.
This is an optimization that applies when processors are simultaneously and bidirectionally swapping data.
The purpose of this discussion was to most clearly highlight what changes are needed
to parallelize a application whose problem has inherent local interactions.
We chose a simple two-dimensional problem of this type so attention could be drawn to the parallelization and message passing.
Components of your code may
correspond to the routines in the life code,
so this parallelization example could suggest how to parallelize your code.
This example can be a starting point
to experiment with parallel computing.
For example, one could try modifying the above parallel code with
different initial conditions, or introduce live cells at a later point in the computation.
One could display time steps in the calculation using a display more complex than text, including a
complete GUI. One could even enlarge the system and represent cells as individual pixels, which
would inherently make the problem scale better because the computation time would increase relative to the
communication time.
Different propagation rules could be applied, including ones that would require more guard cells between partitions.
One could even make this code interactive, allowing a user or users to click on where new seeds would dynamically be
introduced.
These fancier features are not shown here because they would distract from the purpose of this discussion,
but the reader is welcome to explore the possibilities.
References
Using MPI by William Gropp, Ewing Lusk and Anthony Skjellum
is a good reference for the MPI library. Many of the techniques and
style of the above parallel code were adopted from Dr. Viktor K. Decyk.
He expresses these ideas in a few references, including:
- V. K. Decyk, "How to Write (Nearly) Portable Fortran Programs for Parallel Computers", Computers In Physics,
7, p. 418 (1993).
- V. K. Decyk, "Skeleton PIC Codes for Parallel Computers", Computer Physics Communications,
87, p. 87 (1995).
Finally, the Game of Life has been extensively researched since it was first created in 1970 by John Horton Conway.
Among the many references:
You're welcome to read more about parallel computing
via the Tutorials page.
Happy Computing!
|