Earlier this month, Red Hat announced it had acquired Gluster, developer of the GlusterFS open source file system and the Gluster Storage Platform software stack. In so doing, Red Hat set itself up as a one-stop shop for those looking to deploy big data solutions such as Apache Hadoop. But it also bought a file system that has serious potential for cloud-based deployments. If you haven't heard of Gluster yet, here's a quick look at what makes it different than most other scale-out NAS solutions.
A quick tour of Gluster
In Gluster's own words, GlusterFS is "a scalable open source clustered file system that offers a global namespace, distributed front-end, and scales to hundreds of petabytes without difficulty." That's a big claim, but GlusterFS is built to solve big problems -- really big problems. In fact, Gluster's maximum capacity is somewhere in the neighborhood of 72 brontobytes (yeah, that's a real word).
Perhaps the most important detail to know right off the bat about GlusterFS is that it accomplishes absolutely massive scale-out NAS without one thing that pretty much everyone in the big data space uses: metadata. Metadata is the data that describes where a given file or block is located in a distributed file system; it's also the Achilles' heel of most scale-out NAS solutions.
In some cases, such as Hadoop's native HDFS, metadata constitutes a dangerous single point of failure. In others, it's a barrier to truly linear performance scalability, because all nodes must continuously stay in contact with the server(s) that hold the metadata for the entire cluster -- which almost always results in additional latency and storage hardware that sits idle waiting for metadata requests to be fulfilled.
Gluster works around this problem through the use of its Elastic Hash Algorithm. Using this algorithm, every node in a Gluster cluster can compute the location of a given file without needing to contact any other node in the cluster -- essentially doing away with the need to track and exchange metadata. That gives GlusterFS a huge leg up over its competition and allows it to actually deliver on the promise of linear performance scalability.
Back-end deployment
GlusterFS is a user-space filesystem driver that can be deployed on just about any brand of Linux (commonly RHEL or CentOS). In other words, GlusterFS is entirely hardware-independent and consequently very portable. In on-premise or private cloud implementations, GlusterFS can be built on top of commodity server hardware with JBOD, DAS, or SAN storage -- leaving the choice of what hardware to use entirely up to the end-user. In public cloud environments, GlusterFS can be installed on top of existing product offerings to enable better scalability or survivability (both Amazon and Rightscale offer this right now). It is also distributed in an increasingly wide variety of virtual appliances, which allows Gluster nodes to be implemented on top of a hypervisor -- either on-premise or in the cloud.
In terms of how data is stored within a cluster of GlusterFS nodes, Gluster can be deployed in several different models with different performance and availability characteristics. The simplest is a distribute-only mode that essentially emulates a file-level RAID0 distribution. In this model, files are stored on only one Gluster node, so the loss of a single node would result in data loss. Not surprisingly, it also offers the highest level of performance and makes most efficient use of storage, since there's no file duplication.