Here’s a collection of highlights, selected totally subjectively, from this week’s enterprise HPC news stream as reported at insideHPC.com.
Link and run
Evergrid launches new job management tools, partners with Platform
Evergrid provides transparent fault tolerance using an OS abstraction layer that loads between the operating system (OS) and the application. Without modifying either the application or the operating system, CAMS/AvS periodically captures the collective state of the application across the entire infrastructure while the application continues processing. By recording the state of an application and all of the OS and system state, Evergrid is able to checkpoint and resume from failures or interruptions rapidly with minimal overhead. Even failure of multiple servers or of software systems does not stop an application from being able to resume processing from a checkpoint.
This is good for long running user applications obviously, but it also provides something very powerful for data center operators: preemptive scheduling. Because jobs can be suspended and returned to execution on command, centers can now be a lot more creative with batch scheduling policies without sacrificing high utilization numbers.
This is great news for enterprise users who want bigger machines — especially as virtualization continues to coalesce services into single physical boxes — but are plagued by the instabilities that come with them.