IBM opened the doors to its Almaden Research Center this week to show what its scientists are working on, including some advanced technologies for storage and data analysis.
Located at the southern tip of Silicon Valley, Almaden claims to be the birthplace of the distributed relational database and the first data mining algorithms. Fiddling with bits and bytes to improve how they're stored and analyzed continues to be a focus, although the labs also work on areas like nanotechnology, spin physics, and human-computer interaction.
[ Managing backup infrastructure right is not so simple. InfoWorld's expert contributors show you how to get it right in this "Backup Infrastructure Deep Dive" PDF guide. ]
The projects on show this week included Panache, a file system for use across wide area networks; Sage, a tool for moving data to different storage tiers automatically; and Cobra, which helps companies figure out what people are saying about them in online forums.
Panache is a clustered file system that provides applications with high-speed access to a large, central pool of data even if the applications are far away, in data centers in different parts of the country or on different continents, for example.
"Customers are asking us to give them a way, when data is created at one site, to make it available in other geographically distributed locations, so that users at those locations can access the data as if it were local," said Bruce Hillsberg, director of the Storage Systems research group.
The file system uses advanced caching techniques to make sure the data at each location is kept consistent. It has push and pull characteristics that replicate changes efficiently across multiple nodes in a wide area network, so that conflicts don't arise when changes are made to the data caches at individual nodes.
IBM says it could have several uses. Engineers working on a project in different countries can access the same set of data and make changes to it locally without worrying about the cached versions getting out of synch.
It could also reduce the time it takes to replicate virtual machines between data centers, researchers here said. Applications running inside a virtual machine access data from a virtual LUN, typically stored as a file in the data center. When a new virtual machine is configured or restarted from a failure, the OS image and its virtual LUN have to be transferred between sites, causing delays before the application is ready for use.
Panache can maintain a cache of the OS and its virtual LUN at the remote site, so it's there when needed. IBM researchers say this would greatly reduce the time and complexity of configuring new virtual machines and moving them across a wide area network. It could also help companies to reduce data center costs. Instead of hosting 20,000 virtual machines in one large data center, the faster migration capabilities would provide the option of hosting the VMs across 20 smaller data centers.
Some large cluster file systems already exist, like IBM's GPFS (General Parallel File System) and Sun's Lustre, now maintained by Oracle. Panache is unique because of its high level of parallelism, according to IBM, which allows multiple nodes to read and write to their local data cache even when they are temporarily offline.
"Panache is the first file system cache to exploit parallelism in every aspect of its design -- parallel applications can access and update the cache from multiple nodes while data and metadata is pulled into and pushed out of the cache in parallel," according to a paper describing the technology (PDF).
Panache builds on top of GPFS and also uses a proposed standard called pNFS (Parallel NFS), an update to the widely used NFS (Network File System) protocol. Because it uses standards, the nodes in a storage cluster can be based on other vendor's storage gear, though it seems likely IBM will sell a product that ties it all together.
Researchers didn't say when Panache will appear as a product, but it seems fairly complete judging from the way they talked about it.
Another storage management technology called Sage is being used internally by IBM Global Services and should be in product form fairly soon.
Sage is a tool for calculating the value of data over time and moving it to the appropriate tier of storage based on its value. The idea is to help companies get data onto the appropriate storage tier more quickly and easily, and thus reduce storage costs. A company might want to put frequently used data on high-performance Fibre Channel drives, for example, and less critical data on lower-cost SATA drives. Some data might need to be moved from one type of drive to another after a set period of time.
Once policies are applied by an administrator, Sage moves the data around automatically. It also lets IT staff run "what if" scenarios, to see what would happen to their storage environment if they set policies in a certain way. And the policies can take into account legal and compliance issues, such as not moving personal data across country borders in Europe.
The policies can be applied by storage volume -- choosing a volume associated with a particular application, for example -- or on the basis of individual files, by choosing all data created by a particular user.
Another technology on show was Cobra, or Corporate Brand and Reputation Analysis, a tool to help companies find and analyze what's being said about them by end users and commentators in discussion forums, blogs and other sites on the Web.
It's offered today through IBM's services division but will become part of its Cognos analytics product line in the future, said Scott Spangler, a senior technical staff member with IBM's Service-Oriented Technology group.
Cobra uses a service such as Boardreader to search message boards and forums and collect posts that include references to keywords, such as a brand or product name, and store them in a data warehouse.
That data is analyzed using models built for each customer, which identify patterns in what's being said by looking at things like text clustering, sentiment analysis and how frequently certain terms are used. Those patterns are then analyzed in concert to identify posts that can be useful to a company.
IBM's services division has deployed the tool at a big chocolate-bar company that used Sage to find out that vegetarians were complaining about one of the ingredients in its products, Spangler said.
There are already plenty of tools on the market that do similar things, but Spangler claims Cobra is more advanced because analysts can use it to build highly complex models that can be adapted quickly when needs change.