In an effort to secure a spot as the de facto virtualization provider for all things Hadoop, VMware today announced an open source project dubbed Serengeti that lets companies easily deploy and manage Hadoop distributions in virtual and cloud environments. The new project, along with new code contributions to the cloud platform and enhancements to the Spring for Apache Hadoop development platform, also speaks to VMware's vision of providing the virtual glue between the cloud and big data.
The big picture here: Beyond making VSphere the must-have virtualization platform for Hadoop, VMware is aspiring to cash in on big data, a point made clear when the company acquired big data analytics startup Cetas earlier this year. VMware wants to make it easier for companies to use Hadoop as a big data platform across the board; Spring makes it easier for developers to create big data applications, which IT can more easily deploy via Serengeti onto a distributed, virtualized cloud infrastructure, which in turn means more business users can take advantage of those big data capabilities.
For all its big data promise, one of Hadoop's most significant shortcomings thus far its lack of support for virtualization, according to Fausto Ibarra, senior director of product management for vFabric at VMware. "Hadoop is design for a physical infrastructure, not a virtual one," he told InfoWorld. "If you want to deploy a cluster of 20 nodes, you need to procure 20 physical servers. Then you basically have to install it on each server and configure it. Hadoop represents how data centers were run 10 years ago before virtualization was mainstream."
With Serengeti, he said, an IT admin can easily create, configure, deploy, and manage Hadoop to a virtual environment using what he describes as a simple yet powerful command-line interface. "You just issue a handful of commands to specify the site for the cluster, how much memory per node, storage, networking configurations. Then Serengeti deploys it," said Ibarra.
VMware's case for decoupling Hadoop nodes from physical infrastructures in favor of virtual infrastructure -- specifically one based on vSphere -- is that organizations can benefit from faster deployment, higher availability, superior resource utilization, higher elasticity, and more secure multitenancy.
According to Ibarra, Serengeti works with any and all distributions of Hadoop from vendors like Cloudera, IBM, and Greenplum. As such, he said that Serengeti will make it easier for organizations to run different flavors of the cloud platform at once: "Companies will be able to mix and match distributions, which is a common scenario today. The finance department wants to run a certain workload on Distribution A. Marketing wants to standardize on Distibution B. With Serengeti, you can run them on the same platform and share resources among them."
In addition to Serengeti, VMware announced it's contributing new code to Hadoop -- specifically the HDFS (Hadoop Distributed File System) and Hadoop MapReduce projects -- to make them "virtualization-aware," meaning data and compute jobs can be optimally distributed across a virtual infrastructure. "[Users] will be able to distribute data and assign jobs to individual nodes in an optimal way by understanding how nodes are deployed on a distributed infrastructure," said Ibarra.
Finally, VMware announced updates to Spring for Apache Hadoop, a development platform launched in February for enterprise developers to build distributed-processing solutions for Hadoop. With the update, Spring developers can build applications that integrate with HBase, the Cascading library, and Hadoop security.
Serengeti and Spring are both available as free downloads under the Apache 2.0 license.
This story, "VMware bets on big data as it brings Hadoop to virtual world," was originally published at InfoWorld.com. Get the first word on what the important tech news really means with the InfoWorld Tech Watch blog. For the latest developments in business technology news, follow InfoWorld.com on Twitter.