Frequently Asked Questions
|
|
|
- Does your software work with the new 8-Core Mac Pros?
Yes, Pooch launches multiple tasks per node to take advantage of multiple processing cores.
The number of tasks allowed per node is equal to
the total number of licensed nodes.
For example, a 4-node Pooch will run 4 tasks per node, which is appropriate for 4-core Macs.
Similarly, an 8-node Pooch will run 8 tasks per node, which is appropriate for 8-core Macs.
Currently 16-node Pooch and greater are limited to 16 tasks per node.
- Does your software work with the newer Intel-based Macs?
Yes, version 1.7 of Pooch the first Universal Application cluster solution
that operates natively on both the new Intel-based Macs as well as PowerPC Macs.
This is the first and only parallel computing solution to support Universal Clustering:
launching other Universal Applications in parallel on a cluster.
- How do I get started using parallel computing, clusters, and Pooch?
You may start with the materials on this web site, but where you begin
depends on your background.
The Getting Started page
lists starting points depending on your
knowledge, experience, and interest.
The links from that page may serve as your introduction.
- Do you support Message-Passing Interface (MPI)?
Yes, Pooch has supported parallel computing using the distributed-memory MPI model from day one.
Pooch currently supports seven MPIs.
Information on using and compiling these MPIs can be found in
the Cluster Software Development Kit and
on the Compiling the MPIs page.
We commonly recompile Fortran and C distributed-memory MPI codes,
which were already portable across platforms like Cray, SGI, IBM SP, and Linux clusters,
on the Mac platform without modification.
Making that possible for our plasma physics simulations
was a design requirement from the first day of the Mac cluster at UCLA Physics.
We highly recommend that users start with
MacMPI_X if possible because of its long history of stability and reliability,
its wide flexibility compiling in different programming environments, and its extremely helpful
visualization tools.
- Where can I get started learning about MPI and writing parallel code?
We provide several links to help you with that.
The Parallelization page
provides an overview on the issues involved with designing and writing parallel code.
You can then view the Parallel Knock,
Parallel Adder, and
Parallel Pascal's Triangle
tutorials to get an idea of how to write parallel code or convert
existing single-processor code into parallel code.
While you are viewing the tutorials, you can use components of the
Pooch Software Development Kit
or follow the instructions on the
Compiling MPI page
to compile and write parallel code yourself.
We also link to other references in those tutorial pages.
- What is the difference between Pooch and Pooch Pro?
Pooch is the shortest path to practical parallel computing, while
Pooch Pro addresses the needs of those administering larger clusters
for a large number of users.
Pooch Pro was created because we recognize that there exists
the need to manage a cluster's compute time for many users.
At the same time, we saw that such features are unnecessary complications
for someone who just wants to get their cluster up and running and working.
This sort of bifurcation is not unusual in the industry.
It is similar to the difference between, say,
Final Cut Express and
Final Cut Pro.
Pooch Pro has features that some users of Pooch simply don't need.
If you are a single user of a cluster
or one of a small number of users who can share their cluster, then Pooch
is for you.
If you are an administrate for many users, then you should consider Pooch Pro.
- Can I take advantage of Macs with multiple processors using Pooch?
Yes.
Using Pooch, you can launch a parallel job
taking advantage of all processors in your system using just MPI.
Pooch will launch as many instances of
the executable as there are processors on the included nodes and supply the appropriate information to the MPI library.
This behavior, the default setting of the current Pooch, can be overridden using the Tasks per Computer menu in the
Options... pop-up in the Job Window.
From a programming point of view, you can simply use the count that MPI tells
your code when your MPI code is running.
It just so happens that two MPI tasks are running on each dual-processor node.
- Can I take advantage of Macs with different hardware or unequal processing speeds?
Yes, the machines need not be identical to run Pooch or parallel codes.
They would each have to have whatever minimum requirements (minimum RAM and so forth)
needed to run the particular parallel app.
For best overall performance, however, the parallel app would need to be able to
adjust to differences in the processing performance of individual nodes.
This behavior is sometimes called "load balancing".
It is not always easy to implement, so not all parallel apps are written to
perform the additional overhead to balance their work.
The Power Fractal app, for example, does not make any
adjustments for different processor speeds, so it performs best when the
nodes are identical. The Fresnel Diffraction Explorer, however, does adjust
its workload depending on individual node performance.
Pooch supports both
categories of applications, but the parallel application has the last word in
most efficiently utilizing the hardware.
- Can I combine shared-memory multiprocessing and vector processing (e.g., AltiVec or Velocity Engine) with Pooch?
Yes, those can be combined.
One can think of it as shared memory multiprocessing (MP), vector processing,
and distributed-memory MPI as three "orthogonal axes" of parallelization.
Vectorization can be accomplished directly using a compiler that supported the AltiVec macro instructions
(such as Metrowerks CodeWarrior or gcc)
or indirectly through a library (like Absoft's BLAS implementation).
The Apple's Multiprocessing libraries are available to distribute work between processors inside a box.
The way one can go about combining the three is to partition work at the highest level using distributed-memory MPI,
then the work within each MPI task would be subdivided between the two processors using shared-memory MP.
Going further, one would vectorize the inner loops within those routines.
A demonstration of all three is present in the Power Fractal app.
By default, a single instance of that Fractal code will subdivide the work between the processors using Apple's MP.
(Toggled using "Turn Multiprocessing Code On/Off" under the File menu.)
When it's launched via Pooch, the Fractal code uses MPI to coordinate the work between physical nodes,
but for each partition of the fractal, one copy of the code will distribute the work between the processors using MP.
Within those subroutines, the work to compute four pixels is mapped to operations on a
four-element floating-point vector.
Combining MPI and MP was accomplished earlier using the plasma physics codes at UCLA by Dr. Viktor Decyk.
The plasma code was a distributed-memory MPI code that ran on the supercomputers and other parallel computing platforms.
The code divides the plasma particles among the MPI tasks.
To take advantage of MP, he had each MPI task subdivide its loop over its portion of the particles
among the available processors. To assist with the work, he wrote MacMP, available from the
AppleSeed Multiprocessing Development page.
MacMP uses Apple's MP to take a subroutine and push its work onto another CPU.
In many cases, it's easier to do this using MacMP rather than calling Apple's MP library directly.
The only other caveat about shared-memory MP is that one has to be careful that the routine is "thread safe", meaning
that the subroutines you run won't step on memory that the other
threads might need as it runs.
For a code example demonstrating how vector and parallel processing can be combined, see the
Parallel Pascal's Triangle tutorial.
- Can I combine Grand Central Dispatch with Pooch?
Grand Central Dispatch (GCD) is an easier way to do multithreading,
easier than OpenMP or Apple's previous multithreading API or POSIX threads.
The benefit is that, assuming a software writer can break their code
down into separate smaller data-independent tasks, GCD can do the
intelligence to get those tasks done on many cores.
GCD is complimentary to Pooch. Think of it like an "orthogonal
axis" of parallelization (see above). Pooch and MPI can be used to parallelize
across boxes, like an outer loop, while GCD could handle tasks within
a box, like the inner loops. This is just like how the Fractal app
works by parallelizing the inner loops using vectorization but across
boxes using MPI.
You could also use just Pooch and MPI to parallelize across nodes
and cores just fine and only worry about one API for parallel code.
- Can I run the ABC application made by XYZ corporation in parallel?
If you know an application you'd like to see parallelized,
we encourage you to suggest the idea to the developers of that app.
In almost all cases, it is technically possible to parallelize a
code to take advantage of cluster computing.
We would consider the problem no more difficult than coding for dual-processors within a box, but
the benefits can be so much more than a 2x boost.
Actually, the most difficult problem is convincing the developers to parallelize their source code.
We encourage you to contact your app's developers to convince them that there is a demand for such capabilities.
We are willing to help with the parallelization process.
You may certainly refer them to this web site for information and inspiration.
- Why not run applications in parallel at a "kernel level"?
We understand how desirable such a solution would be, but,
for the forseeable future, the answer is: No, it is not practical.
There are two ways to explain why:
1. The high-level answer: Getting a typical code to run in parallel that has
not been parallelized and make it run well in parallel is a very difficult
thing to do. That has been tried in scientific computing for over a decade.
After considerable work, special "autoparallelizing" compilers have taken
non-parallel Fortran and C source,
attempted to recognize independent work, and run them on parallel computers,
all while getting
the same answer as the original non-parallel code.
(In principle the technology could
probably be applied to PowerPC assembly, but that is probably more difficult.)
Such codes did run, and they produced the right answer (very important in
science, but not so easy to do), but they did not run well. A typical code
achieved only 10-20% parallelism; that is, if you double the number of
processors, it found the answer only 10-20% faster, which is nowhere near double performance.
That low performance makes
such a solution impractical. For comparison, when we hand-parallelize a
code, we see 80-90% parallelism.
2. The low-level answer: Suppose you did somehow write a low-level code or
kernel extension or some tool that, while the application was running,
watched the raw instructions of an
application and recognized pieces that could be partitioned off into
independent sections.
And let's suppose it could recognize which pieces of
memory those instructions needed and where it output its results.
And let's suppose that it could
somehow figure out how to reassemble everything in memory back.
And let's
assume that this process of recognition, partitioning, and reassembly took
zero time, what would happen?
Typical loops and sections of independent code in a typical app are on the
order of 10s to 1000s of instructions long. The modern PowerPC tries to
push them through at a rate of once per cycle. Assuming a 1 GHz clock rate,
this piece of code might take less than a microsecond to complete. And it
may be pushing data in and out of memory at, in round numbers, 100 MB/sec.
In 1 microsecond,
that's about 100 bytes.
Each PowerPC instruction is four bytes long, so the size of the instructions
plus data is 4 * 1000 + 100 = 4100 bytes, which is about 4 kB.
You'd have to
send this little parcel of instructions and data out to another computer
over Ethernet, run it, then receive it back. On Gigabit Ethernet, we're
seeing over 40 MB/sec throughput, so a 4 kB message would take about 100
microseconds to send. But at that small size, latency, the additional
overhead time it takes to send any message at all, would dominate, which
we've seen is about 300 microseconds. So send time is 100 + 300 = 400
microseconds, then the compute time is 1 microsecond, then the time to send
the output back is again dominated by latency, about 300 microseconds.
So the total time to send out this parcel of instructions and data, compute
it, and send it back is: 400 + 1 + 300 = about 701 microseconds. This is a
piece of code that, if computed locally, would take about 1 microsecond.
The point is that chopping up a code at such a low level would be dominated
by communications time, effectively slowing down the overall computation.
(Well, perhaps you could design a 700-node system that would compute each
piece, one after the other, and after about 700 microseconds, all the pieces
would be all done. But that too is impractical, because: 1. You would probably have to have
hundreds of network cables from the one machine to the others to prevent network congestion;
and 2. based on
experience at Sandia National Laboratories, Lawrence Livermore National Laboratory,
and Ohio Supercomputing Center, that
many nodes nodes easily get out of step with each other, besides the fact
that depending on 1000s of nodes quickly becomes unreliable (MTBF is around
8 hours) without extraordinarily clever management.)
We would have to see a tremendous breakthrough in communications
performance to make this approach practical.
The required level would be having messages reliably get
from one node to another in less than a microsecond. Those kinds of speeds are seen
within computers, moving pieces of data from RAM, to the bus, to the processsor, but
not between typical computers.
Only the most advanced Cray T3-series
parallel computers reached this regime, but their "network" costs $20,000 per node.
Again, that price is impractical for most users.
And neither Ethernet nor
FireWire nor any other network technology seems to be getting close to
reliably delivering that
level (> 100x improvement in bandwidth and latency)
until maybe a decade from now at the earliest.
Remember that processor speeds will probably increase in the meantime,
raising the bar further.
So, we find we must conclude that such a low-level parallelizer,
while technically not impossible, would be impractical because of how the
communications time would dominate everything else. One could build such a
tool, but few would actually use it because it wouldn't make the typical
application faster. A fundamental
and dramatic shift in computer technology would be
required to change that conclusion.
When we design a parallel code, we try to parallelize at a much higher
level. We get the best performance by having our code compute on the order of a second
at a time in between sending possibly megabytes of data at a time to work around these latency
problems, although such parameters can vary widely.
In any case, this approach to parallelization requires a degree of intelligence to
recognize these high-level pieces, or form high-level pieces out of many
low-level ones, and how to organize them,
but the important thing is: it works.
- What is the nature of Pooch security?
Pooch is its own lock and key. You should keep track of your Pooch like you keep
track of your keys.
Before Pooch will accept commands from another Pooch, it must receive a
passcode that matches its own. Then, all subsequent commands use a 512-bit
encryption key that rotates for each message in a psuedo-random manner. Only those
two Pooches can predict the next encryption and decryption keys. If a mistake in the
passcode or commands is made at any time, Pooch will reject the connection. Since
Pooch waits a second or two before it accepts another connection, an exhaustive search
for the correct encryption keys (2^512 possibilities once per second would take over 10145
years) will be extraordinarily unlikely to succeed.
The first passcode and the start of the rotating key are 512-bit psuedo-random
numbers derived from the registration name of that Pooch (which is set at compile
time). Therefore, only Pooches of the same registration will be able to communicate
with one another. Because the registration name is unique for each Pooch customer, a
copy of Pooch registered to, say, MIT, will not be able to communicate with a Pooch
registered to UCSD. (For cross-registered Pooches or other customized configurations
or encryption methods or implementations, please email.)
Security for your cluster then becomes dependent on the security of your Pooch
registered with your registration name. Your Pooch can be installed on the Macs of your
cluster, and, if no additional copies of Pooch exist, no one can get in. But if you make a
copy of that Pooch and bring it home to access the cluster, the security of that cluster
depends on how securely you keep that extra copy of Pooch.
This approach is also known as an "administrative domain".
The nature of this security is analogous to having the ability to copy a key to a
locked office. It is not uncommon to entrust a group of people with the keys to a shared
resource, such as office equipment. The security of the equipment is shared by those
who have copies of the key. These people understand the responsibility that comes with
the privilege for that access. Access to Pooch can be shared in a similar way.
If you are using the downloadable demonstration version of Pooch, you should be
aware that the same version can be downloaded by anyone else on the web. So, if they
have the IP address of your Mac, they could access your Mac, to the extent that Pooch
allows, over the Internet. Although guessing your Mac's IP address is unlikely, a uniquely
registered Pooch makes for much better security.
- How do I uninstall Pooch completely?
You may remove Pooch and all its components by allowing Pooch to run normally, then
holding down the Option key after launching the Pooch Installer. In the Pooch Installer,
a dialog should appear that allows you to select either to upgrade or uninstall Pooch.
Clicking on uninstall will delete the running Pooch and the components that allow it to
run at logout. The latter process may require administrative authorization.
Note:
If you were using the download version of Pooch and it expired, we suggest
downloading a current version and reinstalling it to overwrite
the expired version with a new one.
The uninstall function needs to detect a running Pooch to know where its components are.
Then you can uninstall it with the above procedure.
- Does your software work with the Power Mac G5 and Panther?
Yes, we are using our software with the Xserve G5, Power Mac G5, Mac OS 10.3, and their predecessors.
Pooch has combined Power Macs, PowerBooks, Xserves from the 604es to the G5s and OS 9 through 10.3.2.
We have seen no problems using the new hardware and the new OS to run Pooch, MacMPI, and the other software on our site.
- What is the basic difference between Pooch and Xgrid?
A user has said, "Pooch is Xgrid on steriods!" We couldn't agree more.
With Pooch you can do what Xgrid can do, and much more.
Pooch handles all major types of computing involving clusters.
The kind of parallel computing that Xgrid focuses on is only of subset of what Pooch can address.
The difference is that Xgrid handles problems requiring little communication and where centralized
coordination is adequate.
Pooch
can handle that type of parallel computing plus more demanding, tightly-coupled problems clusters
are good for.
Pooch handles cluster jobs, including grid-style jobs, while Xgrid is suitable for grid-style jobs only.
See more details at the Parallel Zoology page on the differences between these types of jobs.
Clusters using Pooch are supercomputer-compatible.
Pooch supports compliance with parallel computing standards.
Pooch handles jobs that use MPI, the dominant programming interface used on
clusters and supercomputers worldwide, while Xgrid does not.
Pooch builds on the lessons already learned in scientific computing coinciding with MPI's wide adoption.
Using Xgrid requires Xgrid-specific code to operate, making the code written for Xgrid non-portable.
MPI code written on a Mac cluster need not be Pooch-specifc.
An MPI code run on a 4000-processor supercomputer runs via Pooch on Mac clusters with only a recompile.
We do that with MPI code all the time.
Pooch is more flexible with application code.
Xgrid uses a plug-in architecture to accept application code.
With Pooch, applications can stand alone or can choose to tap Pooch for cluster resources at run time.
This feature is demonstrated in the Parallel menu of the Fresnel Diffraction Explorer.
In addition, while Xgrid requires Cocoa-based code, Pooch accepts all the executable types Mac OS X and Mac OS 9 have to offer:
Cocoa, Carbon, Mach-O, Classic, AppleScript, and Unix script.
And some of these can be compiled using a variety of Fortran and C compilers.
See the Pooch SDK for details.
- What is the basic difference between Pooch and mpirun/mpich?
MPI (Message-Passing Interface) is an industry-standard programming interface, not a program,
meant for all forms of parallel computing.
Several different implementations of MPI exist.
mpich is one of them. mpich includes a launching utility that can launch jobs called mpirun, but
mpirun assumes numerous non-intuitive settings, connections, and files are all correctly configured, organized, and operating
such as NFS, rsh or ssh, and machine lists. Pooch requires only
the simplest settings of a modern computer, such as those needed to run a web browser.
But Pooch's capabilities extend far beyond those of mpirun.
Pooch serves as a queuing system, scheduler, cluster management utility,
graphical front end, and scripting interface, among many other functions.
These functions are far beyond mpirun's scope and would otherwise require
the user to integrate a host of easily incompatible command-line utilities.
Pooch provides all that functionality in one convenient, reliable package.
- What is the basic difference between Pooch and Beowulf?
To say that Pooch is an incremental improvement on Beowulf is like saying
the original 1984 Macintosh was an incremental improvement on the IBM PC.
Like the first computers to use graphical user interfaces,
Pooch resulted from a complete rethinking of how to build and operate a parallel computer.
We mean one that is designed, from the start, to be convenient, reliable, flexible, easy to use, and friendly,
and, therefore, powerful.
With Pooch, we reinvented the cluster computer.
|