Dauger Research, Inc., 128 Xserve G5 Cluster Running the AltiVec Fractal Benchmark Page

128 Xserve G5 Cluster Running the AltiVec Fractal Benchmark

Xserve cluster achieves 1.21 TeraFlop using 256 2-GHz G5's and demonstrates excellent potential scalability

Dawson Xserve G5 Cluster Running the AltiVec Fractal Benchmark

Hardware:

Nodes:	128 Dual-Processor Xserve G5/2GHz's
Network:	Cisco 6500 Series Gigabit Switch

Software:

Application:	AltiVec Fractal Carbon
Communications/Message-Passing Library:	MacMPI_X.c
Cluster Support and Management:	Pooch Pro
Operating System:	Mac OS X Server 10.3

Name:

The Dawson Cluster

Prime User and Purchaser:

UCLA Plasma Physics Group

Location and Management:

UCLA Academic Technology Services Group

Date:

November 8-14, 2004

The above figure illustrates the potential performance and scalability of clusters using Apple's new Xserve G5. The UCLA Plasma Physics Group Group recently acquired and assembled a cluster using 128 Xserve G5s. Using Pooch Pro, the group ran the AltiVec Fractal Carbon benchmark and achieved 1.21 trillion floating-point operations per second on this Xserve cluster. (This result is similar to that using Linpack, placing the Dawson cluster in the most recent Top 500 Supercomputer List.)

The Plasma Physics Group has extensive experience using other parallel computer and cluster types. This cluster, named in honor of the late Professor John M. Dawson, is in active use for plasma physics projects at UCLA and other projects in collaboration with the Stanford Linear Accelerator (SLAC). UCLA's Academic Technology Services manages, houses, and provides facilities for the Dawson cluster. Both MacMPI and LAM/MPI are used on this cluster. As of this writing, this is the largest, most powerful Xserve cluster built for physical science known to exist in academia.

Applications

The Dawson cluster is in active use for plasma physics research. Its primary applications are particle-based "Particle-In-Cell" (PIC) plasma codes:

QuickPIC - a 3-D parallel quasi-static PIC code originally developed for plasma wakefield accelerator research being used for beam plasma interactions by Warren Mori and his team
osiris - a three-dimensional electromagnetic PIC code used to model plasma beam instabilities in the Stanford Linear Accelerator
P³arsec - a three-dimensional fully electromagnetic PIC code written by John Tonge as part of his doctoral work, used to investigate plasma confinement, Alfvén waves (see George Morales), and other topics
Quantum PIC - a code that utilizes classical paths to time-evolve interacting quantum wavefunctions based on Dean Dauger's doctoral work
Numerical Tokomak project - a suite of three-dimensional gyrokinetic PIC codes used to model large plasma confinement devices for fusion science by Jean-Noel Leboeuf and Rick Sydora
UCLA Parallel PIC - a framework of object-oriented components for the rapid constructions of new parallel PIC codes, written by Viktor Decyk

These are parallel Fortran and Fortran 90 codes using MPI for portable message-passing.

About the Benchmark

The different colored lines indicate the fractal benchmark code operating on different problem sizes. As expected on any parallel computer running a particular problem type, larger problems scale better. The AltiVec Fractal Carbon demo uses fractal computations that are iterative in nature. For a portion of the fractal image, these iterations may continue ad infinitum; therefore, a maximum iteration count is imposed. In the AltiVec Fractal Carbon demo, this limit is specified using the Maximum Count setting. Increasing the Maximum Count setting to 16384, then 65536, and so on, increases the problem size. It was clear that, given sufficient problem size, the Xserve G5 cluster was able to acheive over a TeraFlop (1 TF = 1000 GF = one trillion floating-point calculations per second).

The performance is determined by the total number of floating-point calculations performed that contribute to the answer and the time it takes to construct the answer. This time includes not only the time it takes to complete the computation, but also the time it takes to communicate the results to the screen on node 0 for the user to see. Also note that we quote the actual achieved performance, a practical measure of true performance while solving a problem, rather than the theoretical peak performance.

The time it takes to compute most of these fractals is roughly proportional to the Maximum Count setting, yet, since the number of pixels is the same, the communications time remains constant. For the smallest problem sizes on a large number of nodes, it was clear that communications time became greater than the computation time. By increasing the problem size significantly, the computation time was once again much greater than the communications time.

The dark "Ideal" line is an extrapolation multiplying the node count by the performance of one node alone. As shown in the graph, the cluster's performance while solving the larger problems closely approach that "Ideal" extrapolation. That observation tells us we can find no evidence of an intrinsic limit to the size of a Mac cluster.

Conclusion

After running a series of numerically-intensive trials on a 128-node Xserve G5 cluster, we were able to achieve over a TeraFlop on certain problems. These results were repeatable. No evidence of an intrinsic limit to the size of a Macintosh-based cluster could be found. Building on previous results using 76 Dual-Processor Power Macs at USC and using 33 Dual-Processor Xserve G4s at NASA's JPL, this finding confirms that Macintosh-based clusters are capable of excellent scalability in performance.

Acknowledgements

The above could not be accomplished without involvement of many people. Many thanks goes to UCLA's Plasma Physics Group and Academic Technology Services. We also thank Tim Parker and Skip Cicchetti from Apple Computer, Inc., for faciliating the purchase of and providing direct assistance with the cluster.