Parallelization
There are two distinct issues one encounters when using parallel computers to solve computational problems.
The first is how to construct the parallel computer, use it, launch jobs, and maintain
the parallel computer.
The second issue is how to go about organizing your computational problem
so that it can be solved on a parallel computer.
It is our experience that using a Mac cluster is a parallel computer type that solves the former issue.
The latter issue is specific to the application and often requires
application-specific attention.
|
|
|
|
Although volumes have been written on the issues involved in that second issue,
this page is meant to provide an introduction to those parallelization issues, that
of writing parallel code or parallelizing code.
Automatic parallelization of code has been an active research topic in scientific computing for the last two decades,
but such codes, except for very special cases, typically achieve little of the potential benefit of
parallel computing.
So far, the human contribution has been essential to produce
the best performing parallel codes.
Candidates for Parallelization
Any operations that take longer than a minute to complete are good candidates for parallelization.
Simpler operations like drawing to the screen or
writing email or primitive drawing would probably not be appropriate
applications to parallelize. However, large CMYK-RGB image conversions,
large scientific simulations, analyses of large datasets,
rendering complex 3D graphics, video compositing and editing,
or large scale search or structure traversing problems
would be appropriate to parallelize because they could
see benefits from parallel computing.
Partitioning the Problem
A key to parallelizing an application is choosing the appropriate partitioning of the problem.
The choice must be made so to minimize the amount of time spent communicating relative to
the time spent computing.
One can think of the partitioning being in a direction, or on an "axis" or degree of freedom
of the problem.
The size of the partitioning should not be so small that the code experiences high communication overhead,
but should be small enough so that the problem can be divided into at least as many pieces as there are processors.
Simplest Parallelization - divide and conquer
Suppose the application was required to compute a two-dimensional image and the
values of its pixels can be computed completely independently and in any order.
The simplest way to parallelize such a problem would be to cut the image into
strips and assign a node to compute a strip each until the problem is finished.
Many applications can be partitioned into pieces like these and implemented on
a parallel computer.
The older fractal demonstration codes partitioned the problem along the y-axis of the image.
Hence, each processor computed a horizontal strip of the image, then they sent the results
back to node 0, which then assembled the pieces to complete the solution for the user.
Later it was seen that the fractal problem was sometimes cut into pieces of
uneven difficultly for identical processors.
In the original AltiVec Fractal Carbon app,
the load-balancing solution used to solve that issue was to interlace the image:
each of the N nodes would compute every Nth line. For example, if there were only two
processors, one would compute all the even lines while the other would compute the odd lines.
The advantages of this technique are that it very effectively load balances a potentially uneven problem for
identical processors, and it allows all parallel tasks to figure out for themselves
(without the consultation of a central authority) what
part of the problem to work on.
However, it does not address how to load balance for heterogeneous
processors.
|
|
|
|
The Fresnel Diffraction Explorer takes
load balancing a step further. Like the Fractal codes, this parallel application
partitions the problem by along the vertical axis of the diffraction pattern.
However, node 0 of the application
assigns arbitrarily-sized pieces to each node and dynamically
adjusts the workload for each node depending on the round-trip time
from assignment to receipt of the completed results of that node's portion of the image.
It regulates the size of the work assignments so that each one
completes in about two seconds, an interval chosen to be as long as possible
while maintaining a responsive user interface.
|
|
|
|
This load-balancing algorithm handles irregularities in
the problem, the processors, and the network using one algorithm.
In addition, the Fresnel Diffraction Explorer assigns the
lines in a order so that the user can see the image evolve
into progressively finer detail as it is being computed.
The user is better able to see an approximation of the diffraction
pattern well before it is complete.
The goal is to be user-friendly even while running in parallel.
The approach of dividing the problem into independent sections is appropriate for a variety of
other interesting computational problems. Rendering a single image of a
three-dimensional scene would also be appropriate to parallelize this way.
But choosing to partition along the vertical axis is not the only way.
If such an application instead had to render many 3D scenes in a time sequence,
it could be more appropriate to divide the problem along the time axis or timeline.
Similarly, a video editing application could partition the final high-resolution render
along its timeline, and perhaps choose the partitioning at the edit points in the
video rather than choosing N pieces for N nodes.
Many searching or structure traversing problems could be parallelized by
partitioning sections of the data set at points appropriate to the nature of the
data set.
Searching a very large document for a word could be accomplished by assigning each node its piece of the document
and collecting the results of each nodes search of its section.
Chess programs and other game codes often
search a "game tree", a representation of the consequences of one player's move, then
the next move, and so forth, for desirable outcomes to determine the "best" next move.
One could parallelize such a code by assigning nodes to begin at branches not far from
the base of the tree, but far enough that there will be enough branches to assign enough work
to keep all the nodes busy. Node 0 then combines the results of those searches to
produce its final answer.
The above problems are often collected into a class that is sometimes called "trivially parallelizable".
That is, these problems are relatively straightforward to parallelize because the work
performed by each node does not depend on each other, or at least has very little dependence.
An extreme example of a "trivially parallelizable" problem is
SETI@Home, where the computational nodes could
be computing pieces of the problem for hours before sending brief replies back to a server.
The extreme nature of the SETI@Home problem makes it appropriate for implementation
over the Internet on a wide scale.
As noble as the goal of SETI@Home is, however, their implementation
is practical only for the few types of computational problems whose
communications demands are so low.
Parallelizing Tightly-Coupled Scientific Code - more difficult, but achievable
A computational problem that
is impossible to parallelize given enough time and clever ingenuity has yet to be conceived,
but nonetheless scientific computational problems can be the most difficult to
parallelize. Effective and efficient parallel codes have been created to
solve problems in materials engineering, chemistry, biology, quantum mechanics, high-energy physics,
galaxy evolution, mathematics, economics, game theory, and many more.
Below we provide a taste of the more advanced parallelization
algorithms that are needed.
Particle-based plasma simulations are an excellent example of a scientific
problem that was successfully parallelized but was not trivial to implement.
A plasma is a collection of charged particles interacting through
mutual electromagnetic fields.
These particles are not merely influenced by their nearest neighbors,
but are usually influenced by all the other particles.
(Another important part of the code is the field solve which can
require some form of a parallelized FFT. That's a chapter in itself,
so we'll focus only on the particles here.)
Typically, the partitioning is chosen so that each node is responsible
for the physics in its assigned region of space. The
partitioning can be performed in one, two, or all three dimensions.
|
|
The bulk of the computation is accomplished when each node computes
the trajectories of each of the particles in its partition.
However, at each time step some of the particles will drift far enough to leave its node's partition
of space and
cross into the neighboring partitions. That node then assembles messages containing
all the particles that now belong to its neighbors and sends those messages to the appropriate nodes.
Likewise, that node receives messages from its neighbors about particles that are its responsibility.
What makes parallelization of this part of the plasma code nontrivial is writing
the correct bookkeeping code needed to organize the problem and keep
track of whose particles belong to which node.
But the technique is very effective because the computation time is proportional to the
volume while the communication time becomes proportional to the surface area between partitions.
As long as there are enough particles in each partition, the code acieves excellent (80-90%) parallelism.
(Recently, Dr. Viktor K. Decyk has successfully run a plasma simulation containing
over 8 billion particles on a 4096-node parallel computer.)
|
|
What also makes this problem nontrivial is its communication demands.
On a cluster, nodes running plasma codes often requires a few megabytes per second
of continuous bandwidth and communication many times per second to coordinate its work.
Distributing this work over the Internet,
with potentially long delays and limited bandwidth, is unlikely to be practical,
while local 100BaseT networking
usually satisfies these demands.
A Mac cluster becomes an ideal choice to for these kinds of tightly-coupled parallel codes.
Getting to Know MPI
To get started writing code, you are welcome to study the source code
examples in the Pooch Software Development Kit. If this is your
first time writing parallel code, you should probably start by
looking over and running the knock.c or knock.f examples.
That code demonstrates how to use the most fundamental MPI calls
that start up (MPI_Init) and tear down (MPI_Finalize) the MPI environment,
access how many tasks are in the cluster (MPI_Comm_size),
identify your task (MPI_Comm_rank),
send a message (MPI_Send), and receive a message (MPI_Recv).
With only those six calls, it is possible to write
very complex and interesting parallel codes.
After you have understood those routines,
you can explore the routines for asynchronous message passing (MPI_Irecv, MPI_Isend, MPI_Test, MPI_Wait).
Those calls allow your code to send and receive messages and compute simultaneously.
Properly utilization of those calls will make it possible for your code to
take optimal advantage of parallel computing.
The pingpong.c and pingpong.f examples in the SDK demonstrate how to use those calls.
Those 10 calls are the most important parts of MPI and could very well be all that you need.
After that, MPI provides other
calls that build on the above 10 and can be thought of as convenience routines for common patterns in
parallel computing, such as collective calls that perform broadcasts or matrix-style
manipulation or collect and rearrange data.
We also provide six tutorials on writing and developing parallel code:
Parallel Knock,
Parallel Adder,
Parallel Pascal's Triangle.
Parallel Circle Pi,
Parallel Life, and
Visualizing Message Passing.
The first tutorial shows how to perform basic message passing, and the last
exhibits parallel code alongside a visualization of the resulting message passing.
The remaining four show conversion of single-processor code into
parallel code. Together, these documents demonstrate many fundamental
issues encountered when writing for parallel computers.
Further Reading
As we have stated, many volumes exist on how one would go about writing parallel code,
so we can only list a handful of those.
In Search of Clusters by Gregory F. Pfister is a good book
that compares and contrasts various types of parallel computers and
explores the issues involved with coding for such machines.
Pfister's discussion in Sections 9.3 through 9.5 comparing the application of
shared-memory versus message-passing programming to
a simple solution of LaPlace's equation is
very insightful.
Using MPI by William Gropp, Ewing Lusk and Anthony Skjellum
is a good reference for the MPI library.
The following references provides a detailed description of many of the parallelization techniques used the plasma code:
- V. K. Decyk, "How to Write (Nearly) Portable Fortran Programs for Parallel Computers", Computers In Physics,
7, p. 418 (1993).
- V. K. Decyk, "Skeleton PIC Codes for Parallel Computers", Computer Physics Communications,
87, p. 87 (1995).
- J. M. Dawson, "Computer Modeling of Plasma: Past, Present, and Future", Phys. Plasmas, 2, p. 6 (1995).
- P. M. Lyster, P.C. Liewer, R. D. Ferraro, and V. K. Decyk,
"Implementation and Characterization of Three-Dimensional
Particle-in-Cell Codes on Multiple-Instruction-Multiple-Data
Parallel Supercomputers", Computers in Physics, 9, no. 4, p. 420 (1995).
You're welcome to read more about parallel computing
via the Tutorials page.
|