With an eye toward helping tomorrow's data-deluged organizations, IBM researchers have created a super-fast storage system capable of scanning in 10 billion files in 43 minutes.
This system handily bested their previous system, demonstrated at Supercomputing 2007, which scanned 1 billion files in three hours.
[ Keep up with the latest approaches to managing information overload and staying compliant in InfoWorld's Enterprise Data Explosion newsletter. ]
Key to the increased performance was the use of speedy flash memory to store the metadata that the storage system uses to locate requested information. Traditionally, metadata repositories reside on disk, access to which slows operations.
"If we have that data on very fast storage, then we can do those operations much more quickly," said Bruce Hillsberg, director of storage systems at IBM Research Almaden, where the cluster was built. "Being able to use solid-state storage for metadata operations really allows us to do some of these management tasks more quickly than we could ever do if it was all on disk."
IBM foresees that its customers will be grappling with a lot more information in the years to come.
"As customers have to store and process large amounts of data for large periods of time, they will need efficient ways of managing that data," Hillsberg said.
For the new demonstration, IBM built a cluster of 10 eight-core servers equipped with a total of 6.8 terabytes of solid-state memory. IBM used four 3205 solid-state Storage Systems from Violin Memory. The resulting system was able to read files at a rate of almost 5 GB/s (gigabytes per second).
The system used a tuned version of IBM's GPFS (General Parallel File System), version 3.4. Originally developed for high-performance computing systems, GPFS is becoming increasingly relevant for other data-heavy enterprise workloads as well, Hillsberg said. GPFS allows all the processor cores to write to and from disks in parallel, which can significantly improve storage system responsiveness.
Today's file systems are not well-suited for managing data across multiple storage systems as a single namespace, Hillsberg explained. The 2007 demonstration showed how a parallel file system such as GPFS could be used as the basis for highly scalable storage systems. The new work demonstrates how such a system could be improved even more with the addition of solid-state disks.
The researchers posted a white paper that describes the architecture in a level of detail that could help other parties recreate similar systems.
IBM may also fold the ideas into its own products, Hillsberg said. Earlier IBM research work in building experimental solid-state systems has led to the creation of new software, such as IBM Easy Tier, which helps systems automatically balance data between solid-state disks and regular disks.
"I think you will see some really interesting things come out of this research," Hillsberg said of the demonstration.
IBM is not alone in its enthusiasm for using solid-state storage as a way to speed operations. In this month's Association for Computing Machinery "Communications" publication, a group of researchers from Carnegie Mellon University and Intel Labs described a server architecture that combines low-power processors and flash memory, a design that could significantly speed operations for transaction-heavy large websites.
Like IBM's setup, the researchers' FAWN (Fast Array of Wimpy Nodes) architecture only requires a relatively small amount of flash memory, on which the most frequently consulted data can be stored. They noted that while solid-state storage can cost 10 times as much as traditional disks, they can offer 100 percent performance boost.
The idea of building flash-memory-assisted servers "is not that far out. The technology already exists," said Luiz André Barroso, a Google distinguished engineer who was not involved in FAWN.