The group is currently assessing two Microsoft Windows Compute Cluster Server 2003 clusters -- both of which have been in testing for several months now. Tools built from Microsoft .Net 3.0 Workflow Foundation and Communications Foundation have enabled BAE engineers to create an efficient workflow environment in which they can collaborate effectively during the design process and access relevant parts of the systems from their own customized views with tools relevant to their tasks. One test bed is a six-node cluster of HP ProLiant dual-core, dual-processor Opteron-based servers; the other is a 12-node mix of Opteron- and Woodcrest-based servers from Supermicro.
If there’s anything that BAE has learned from its testing, it’s that little changes can have big performance implications.
“We’re running our clusters with a whole variety of interconnects, including Gigabit Ethernet, Quadric, and a Voltaire InfiniBand switch,” Appa says. “We’ve also been running both Microsoft and HP versions of MPI [Message Passing Interface]. We’ve found that all these elements have different sweet spots and behave differently depending on the application.” In the long run, this testing will enable the technology and engineering services group to provide other BAE business units looking to implement HPC with their own personal HPC “shopping lists.”
As for interfaces, “depending on the application, the size of your cluster (preferably small), and the types of switches you use, Gigabit Ethernet really isn’t that bad,” Appa says. His group has been using Gigabit switches from HP, which “for our purposes, are very good.”
Appa has also tested several compilers, and he cautions not to skimp on these tools: “A $100 compiler might make your code run 20 percent slower than a top-end compiler, so you end up having to pay for a machine that is 20 percent larger. Which is more expensive?”
Each of Appa’s configurations sits on three networks: one for message passing, one for accessing the file system, and one for management and submitting jobs. To access NAS, Appa uses iSCSI over Gigabit Ethernet, rather than FC (Fibre Channel), and has a high-performance parallel file system consisting of open source Lustre object storage technology. Why? “As clusters get larger and you have more cores running processes that are all reading one file on your file system, your file system really needs to scale or you’ll be in trouble,” Appa explains.
Meanwhile, Windows Compute Cluster has simplified both cluster management and user training -- which makes for additional benefits when it comes to freeing up staff for the more vital task of optimizing BAE apps. Although BAE’s software is already set up for HPC, Appa believes the whole process of parallelizing existing apps is reaching a turning point. “Our algorithms date back to the ’80s and do not make best use of multicore technologies,” he says. “We’re all going to have to reconsider how we write our algorithms or we’ll all suffer.”
Although each endeavor to bring HPC in-house will differ based on an enterprise’s clustering needs, BAE’s Appa has some sage advice for anyone considering the journey.
“You can’t assume that somebody will come along with a magic wand and give you the perfect HPC solution,” Appa says. “You really need to try everything out, especially if you have in-house code. There’s so much variation and change in HPC technology, and so much is code-dependent. You really have to understand the interaction between the hardware and software.”