Sonic’s ESB takes new approach to fail-over

Bill Cullen

If the SOA movement had an official flag, on that flag would be a diagram of an ESB (enterprise service bus) — an open and distributed integration platform that provides interfaces to a wide variety of systems and applications and ensures reliable messaging among them. And if you dotted the flag with the logos of leading SOA vendors, Sonic Software’s would surely have to stand out from the rest.

As vice president of engineering at Sonic Software, Bill Cullen has led the development of the Sonic ESB and its CAA (Continuous Availability Architecture), a fail-over mechanism that not only guarantees message delivery, but reduces recovery time to mere seconds, all on inexpensive hardware.

As with other fail-over schemes, the CAA relies on writing messages to disk. After a hardware, software, or network failure, it determines where things left off, retrieves the undelivered messages, and sends them on their way. The difference is how CAA goes about the recovery process. Here Cullen and his team took their cue from Sonic’s database-style log file, which records everything in one file as a series of events.

“I think the eureka moment was realizing that if you can [replay the logs to recover state] 10 minutes after a failure, then you can be doing it constantly, while you’re running, on another machine,” Cullen remembers. “So what we do is we have a backup system, in this case a messaging broker. It listens constantly to that event stream and keeping in sync with the messaging system, even though it’s not doing the messaging itself.”

And so it is that Sonic can provide a fast fail-over, invisible to users and applications, which doesn’t require specialized hardware or costly clustering software. “That was the goal: high-speed fail-over without having to create this massive hardware infrastructure to support it,” Cullen says. “We can do this on a cheap-o Linux box or something like that. We don’t even have to have the same hardware on either side.”

Sonic has extended the CAA’s protections to other parts of the system, bringing replication and fail-over to the directory service, document translation, XML processing, and other places where customers may require resiliency.

CAA is both lightweight and flexible, allowing the replication of everything to support datacenter-to-datacenter fail-over, and a more granular application of the protections. You might call it a service-oriented approach to fail-over.