Visualizing Message Passing with MacMPI
A unique feature of MacMPI is its monitor window, designed to aid in
visualizing the state and history of the message passing occuring in an running parallel code.
We exhibit some examples of how the monitor window is useful by describing how
it responds to various kinds of parallel code and what that display can tell us
about the behavior and performance of the parallel code.
Among MPI implementations, the presence of such a utility in the MPI is the exception, and not the rule,
and it is supported via Pooch.
As we have seen with the source code tutorials on this web site, different problems require different
message-passing patterns.
The Parallel Adder tutorial was a
demonstration of messaging from all gathering to one, while the
Parallel Circle Pi showed us an extension that
added message passing from one to all to balance the load on the processors.
The Parallel Pascal's Triangle demonstrated
nearest neighbor, or round-robin, message passing.
Parallel Life demonstrated a similar idea of message swapping between neighbors and
extended to two-dimensional problems.
Many MPI codes fit within those contexts, and still more extend on into even more complicated message-passing patterns.
|
|
| |
But how do we get a handle on seeing how our code is exhibiting the correct message-passing behavior?
MacMPI's monitor window is designed to help with that by displaying what it can about the ongoing state of the
parallel code in a window.
If the code isn't running as we had in mind, the monitor window can help there too by displaying what's really happening.
It will freeze on the culprit message state if a deadlock occurs, and it will
reveal when the code is sending too many small messages.
If you are learning about parallel code,
the following gives an idea of how to
use this aid with development of message-passing codes.
Deconstructing the Monitor Window
MacMPI's monitor window has three major sections.
The square lights at the top of the window display the
message passing pattern of the parallel computing job as it is at the moment.
Below that is a histogram of the number of messages sent and received as a function of
message size in a log-log format.
At the bottom are two dials, one
displaying the percentage of time
spent communicating and
the other is a measure of
bandwidth. Each display
instantaneous and average
estimated quantities.
Basic Message Passing
Let us consider what happens when we run a simple message-passing code.
Running the Parallel Knock code example as two tasks,
we perform a simple message pass from task 0 to task 1, then back again.
MacMPI displays this message passing sequence in its monitor window
as follows.
Task 0 |
| Task 1 |
Task 0 begins by sending a message to task 1.
The window from task 0 reports this act by displaying a green box in light #1.
|
|
Meanwhile, task 1 calls MPI_Recv, preparing to receive a message from task 0.
Its MacMPI window indicates this activity with red in light #0.
|
ierror = MPI_Send(&sendmsg,len,MPI_INT,
idproc+1,tag,MPI_COMM_WORLD);
|
|
|
ierror = MPI_Recv(&recvmsg,len,MPI_INT,
idproc-1,tag,MPI_COMM_WORLD,&status);
|
|
|
|
|
After the MPI calls return, knock has each task print the message, then proceeds to the reply.
Task 0
|
| Task 1
|
Task 0 calls MPI_Recv to receive the reply from task 1.
A red box appears in light #1.
|
|
Meanwhile, task 1 calls MPI_Send to send its reply to task 0.
MacMPI displays green in light #0.
|
ierror = MPI_Recv(&recvmsg,len,MPI_INT,
idproc+1,tag+1,MPI_COMM_WORLD,&status);
|
|
|
ierror = MPI_Send(&replymsg,len,MPI_INT,
idproc-1,tag+1,MPI_COMM_WORLD);
|
|
|
|
|
Knock's sends from 0 to 1 and back again, and MacMPI's monitor window displays this
pattern in color as the communication occurs.
Deadlocks
A common error for novices in MPI programming is a deadlock condition, where
the code hangs because the MPI cannot complete the message passing as given by the code.
Consider an extension of the Parallel Knock example above:
what if we want the messages to be exchanged simultaneously, rather than one way then the other?
The novice, thinking that the receives and the sends should be performed at the same time,
might rewrite the code like this:
Don't:
Task 0
|
|
Task 1
|
ierror = MPI_Recv(&recvmsg,len,MPI_INT,
idproc+1,tag+1,MPI_COMM_WORLD,&status);
ierror = MPI_Send(&sendmsg,len,MPI_INT,
idproc+1,tag,MPI_COMM_WORLD);
|
|
|
ierror = MPI_Recv(&recvmsg,len,MPI_INT,
idproc-1,tag,MPI_COMM_WORLD,&status);
ierror = MPI_Send(&replymsg,len,MPI_INT,
idproc-1,tag+1,MPI_COMM_WORLD);
|
|
MacMPI's monitor windows will therefore report this:
Task 0
|
|
Task 1
|
|
|
|
and the code will hang indefinitely.
Each task is waiting for a message from the other, but MPI_Recv won't return
until it finished receiving a message from the other.
That prevents either from calling MPI_Send, so this code stops here.
Depending on the message length and the network infrastructure,
a similar deadlock can occur if they both call MPI_Send first:
Don't:
Task 0
|
|
Task 1
|
ierror = MPI_Send(&sendmsg,len,MPI_INT,
idproc+1,tag,MPI_COMM_WORLD);
ierror = MPI_Recv(&recvmsg,len,MPI_INT,
idproc+1,tag+1,MPI_COMM_WORLD,&status);
|
|
|
ierror = MPI_Send(&replymsg,len,MPI_INT,
idproc-1,tag+1,MPI_COMM_WORLD);
ierror = MPI_Recv(&recvmsg,len,MPI_INT,
idproc-1,tag,MPI_COMM_WORLD,&status);
|
|
Task 0
|
|
Task 1
|
|
|
|
The fix is to use an asynchronous receive call, MPI_Irecv, and
follow the MPI_Send call with a MPI_Wait,
like this:
Do:
Task 0
|
|
Task 1
|
MPI_Request req;
ierror = MPI_Irecv(&recvmsg,len,MPI_INT,
idproc+1,tag+1,MPI_COMM_WORLD,&status,&req);
ierror = MPI_Send(&sendmsg,len,MPI_INT,
idproc+1,tag,MPI_COMM_WORLD);
ierror = MPI_Wait(&req, &status);
|
|
|
MPI_Request req;
ierror = MPI_Irecv(&recvmsg,len,MPI_INT,
idproc-1,tag,MPI_COMM_WORLD,&status,&req);
ierror = MPI_Send(&replymsg,len,MPI_INT,
idproc-1,tag+1,MPI_COMM_WORLD);
ierror = MPI_Wait(&req, &status);
|
|
Each task sends and receives at the same time, indicated with yellow.
Task 0
|
|
Task 1
|
|
|
|
Deadlocks can occur between more than two tasks that are much more complicated than those above,
but the principle is the same.
Optimizing parallel code
Especially for IP-based Ethernet networks,
two benchmark numbers almost entirely
characterize network performance: bandwidth and latency.
Bandwidth is the maximum amount of data that the network can send in a given time.
However, latency, the time it takes to for a one-byte message to arrive at its destination after leaving its source,
is what most often limits how many processors can be used on a given problem.
Anyone who has waited for a web browser to start loading a new page after
clicking its link has experienced network latency first-hand:
High bandwidth makes the page load fast once it starts;
latency is how long it takes the first byte to arrive.
Sooner or later, these latency issues will limit a parallel code's performance.
The trick is to optimize your code to minimize latency's effects.
A very effective way to do that is to aggregate smaller messages into fewer, larger ones
whenever possible.
Your code will then take the best possible advantage of the bandwith the network
can provide, rather than be limited by its latency.
Optimizing your code on a Mac cluster with a common network like 100BaseT reveals latency's effect on your code sooner,
so that if you optimize it there, your code will perform that much better when you do
get the opportunity to use a large system with a much better network.
MacMPI's monitor window can show when a parallel code is sending a large number of short messages in its
histogram section.
As an example, let's consider a correctly executing but poorly optimized version of the message passing in
the Parallel Life tutorial.
Correct, but slow:
All tasks
|
for(i=0; i<column; i++) {
ierr=MPI_Irecv(&firstRow[i-in->rowBytes], sizeof(Byte), MPI_BYTE,
leftIDProc, leftIDProc, MPI_COMM_WORLD, &leftReq);
ierr=MPI_Irecv(&lastRow[i+in->rowBytes], sizeof(Byte), MPI_BYTE,
rightIDProc, rightIDProc, MPI_COMM_WORLD, &rightReq);
ierr=MPI_Send(&lastRow[i], sizeof(Byte), MPI_BYTE,
rightIDProc, idproc, MPI_COMM_WORLD);
ierr=MPI_Send(&firstRow[i], sizeof(Byte), MPI_BYTE,
leftIDProc, idproc, MPI_COMM_WORLD);
ierr=MPI_Wait(&leftReq, &status);
ierr=MPI_Wait(&rightReq, &status);
}
|
|
This code has each process send messages to its neighbors, one byte at a time.
When run on a four-processor system, MacMPI displays these:
Task 0
|
|
Task 1
|
|
|
|
MacMPI is reporting a large number of one-byte messages accumulated on the left edge of
its histogram.
In addition, it is reporting that the code is spending a relatively high amount of time communicating
via
the blue dial named "Communication %".
On average almost 90% of the time is spent communicating.
This is a poorly optimized code.
We can improve the performance by having the code
organize the appropriate data sequentially so that it can be sent as one large message per neighbor.
Much better:
All tasks
|
ierr=MPI_Irecv(&firstRow[-in->rowBytes], sizeof(Byte)*column, MPI_BYTE,
leftIDProc, leftIDProc, MPI_COMM_WORLD, &leftReq);
ierr=MPI_Irecv(&lastRow[+in->rowBytes], sizeof(Byte)*column, MPI_BYTE,
rightIDProc, rightIDProc, MPI_COMM_WORLD, &rightReq);
ierr=MPI_Send(lastRow, sizeof(Byte)*column, MPI_BYTE,
rightIDProc, idproc, MPI_COMM_WORLD);
ierr=MPI_Send(firstRow, sizeof(Byte)*column, MPI_BYTE,
leftIDProc, idproc, MPI_COMM_WORLD);
ierr=MPI_Wait(&leftReq, &status);
ierr=MPI_Wait(&rightReq, &status);
|
|
The best thing to do is to have the code collect the data as large messages
and have pairs of processors swap data with each other.
Best:
All tasks
|
{long i;
for(i=2; i--; )
if (i^idproc&1) {
ierr=MPI_Irecv(&firstRow[-in->rowBytes], sizeof(Byte)*column,
MPI_BYTE, leftIDProc, leftIDProc,
MPI_COMM_WORLD, &leftReq);
ierr=MPI_Send(firstRow, sizeof(Byte)*column, MPI_BYTE,
leftIDProc, idproc, MPI_COMM_WORLD);
ierr=MPI_Wait(&leftReq, &status);
}
else {
ierr=MPI_Irecv(&lastRow[+in->rowBytes], sizeof(Byte)*column,
MPI_BYTE, rightIDProc, rightIDProc,
MPI_COMM_WORLD, &rightReq);
ierr=MPI_Send(lastRow, sizeof(Byte)*column, MPI_BYTE,
rightIDProc, idproc, MPI_COMM_WORLD);
ierr=MPI_Wait(&rightReq, &status);
}
}
|
|
This code has each process send messages to and receive messages from its neighbors.
On a four processor system, MacMPI displays these:
Task 0
|
|
Task 1
|
|
|
|
In reality, the lights would be white interrupted by brief, non-simultaneous flashes of yellow.
We can see in the histogram that the bulk of message passing is at a much larger message size (24-byte messages in this case)
and much fewer in number (hundreds of messages down from tens of thousands). These 24-byte messages are as large as we can
efficiently make them because that is all the data this particular code needs to transmit at each time step.
The performance improvement is also seen in the "Communication %" dial, which is reporting
at least a 60% drop in time spent communicating.
Those witnessing the code run can see a substantial performance improvement, corroborating
these measurements.
The purpose of this discussion was to show how to
use MacMPI's visualization features to help debug and optimize parallel code.
In practice, its users leave this window on while the
code is running. Only when the user is certain the code
has achieved a highly optimized status and
wishes to maximize overall performance is this monitor
window turned off to minimize overhead.
These windows make experimenting and working with parallel computing much easier and more interactive.
We have colleagues who were able to discover useful insights into their code in minutes using
MacMPI's monitor window that would have taken weeks
to understand without it.
Besides the ease with which it can be compiled and linked with typical codes,
this message-passing visualization tool is one of the primary reasons we recommend using MacMPI.
The reader is welcome to explore the possibilities.
You're welcome to read more about parallel computing
via the Tutorials page.
|