To do this, Merlin built its own highly parallelizing analysis tools, which it runs on a high-performance Oracle RAC (Real Application Cluster) installed on a rack of Dell PowerEdge 1850 and 2850 dual-core Xeon servers. Data storage is provided by EMC CLARiiON 2Gbps and 4Gbps FC storage towers. Sitting on top of Oracle is Merlin’s HPC task-scheduling software, also created in-house, and an Oracle data mart that serves as a temporary holding ground for frequently used data subsets, much like a cache. Most of the high-speed calculations run directly on the Oracle RAC, which is fronted by a series of BEA WebLogic app servers that take in requests from a set of redundant load balancers sitting behind the company’s customer-facing Apache Web servers. Sitting in front of each of the three layers are sets of redundant firewalls.
Cluster performance is key to running complex calculations in real time, but for Merlin, performance could never come at the expense of enterprise-level reliability, scalability, and 24/7 uptime, requirements that led to several crucial design decisions.
First, tightly coupled parallel processing via message passing was simply out of the question. Instead Merlin’s architects and programmers put tremendous effort into dividing processes in an “embarrassingly parallel” fashion without any interdependencies at all. This benefits scalability and reliability, as the high-speed, low-latency communications required for interprocess communications create scalability bottlenecks. They also require cutting-edge interconnects such as Myrinet and InfiniBand, which don’t have the reliability track record of Gigabit Ethernet.
“We didn’t want some new interconnect driver crashing the system,” Mohamed says, adding that straight Gigabit has also helped Merlin achieve considerable cost savings.
Reliability and enterprise-grade support fueled Merlin’s decision to stick with an Oracle RAC, which has high-quality fault-tolerant fail-over features; dual-processor Dell PowerEdge servers; high-end EMC CLARiiON FC storage; and F5 load balancers.
“There are lots of funky platforms for HPC out there and high-bandwidth data storage solutions that can pump data at amazing rates,” Mettke says. “The problem is that you end up dealing with lots of different vendors, some of whom can’t deliver the 24/7 enterprise-level support you need. That adds another element of risk.”
Finally, all code was written using Java, C++, and SQL.
“I’ve been on the other end running code written in Assembler on thousands of nodes,” Mettke says. “We want the speed, but not at the expense of system crashes in the middle of a trading day. You can claim you have the best cluster out there, but it doesn’t matter if there’s no show when it’s showtime.”
Mettke adds that the architecture of Merlin’s HPC infrastructure is constantly evolving to accommodate new data and applications.