As I work with larger enterprise clients, a few Hadoop themes have emerged. A common one is that most companies seem to be trying to avoid the pain they experienced in the heyday of JavaEE, SOA, and .Net -- as well as that terrible time when every department had to have its own portal.
To this end, they're trying to centralize Hadoop, in the way that many companies attempt to do with RDBMS or storage. Although you wouldn't use Hadoop for the same stuff you'd use an RDBMS for, Hadoop has many advantages over the RDBMS in terms of manageability. The row-store RDBMS paradigm (that is, Oracle) has inherent scalability limits, so when you attempt to create one big instance or RAC cluster to serve all, you end up serving none. With Hadoop, you have more ability to pool compute resources and dish them out.
Unfortunately, Hadoop management and deployment tools are still early stage at best. As awful as Oracle's reputation may be, I could install it by hand in minutes. Installing a Hadoop cluster that does more than "hello world" will take hours at least. Next, when you start handling hundreds or thousands of nodes, you'll find the tooling a bit lacking.
Companies are using devops tools like Chef, Puppet, and Salt to create manageable Hadoop solutions. They face many challenges on the way to centralizing Hadoop:
- Hadoop isn't a thing: Hadoop is a word we use to mean "that big data stuff" like Spark, MapReduce, Hive, HBase, and so on. There are a lot of pieces.
- Diverse workloads: Not only do you potentially need to balance a Hive:Tez workload against a Spark workload, but some workloads are more constant and sustained than others.
- Partitioning: YARN is pretty much a clusterwide version of the process scheduler and queuing system that you take for granted in the operating system of the computer, phone, or tablet you're using right now. You ask it to do stuff, and it balances it against the other stuff it's doing, then distributes the work accordingly. Obviously, this is essential. But there's a pecking order -- and who you are often determines how many resources you get. Also, streaming jobs and batch jobs may need different levels of service. You may have no choice but to deploy two or more Hadoop clusters, which you need to manage separately. Worse, what happens when workloads are cyclical?
- Priorities: Though your organization may want to provision a 1,000-node Spark cluster, it doesn't mean you have the right to provision 1,000 nodes. Can you really get the resources you need?
On one hand, many organizations have deployed Hadoop successfully. On the other, if this smells like building your own PaaS with devops tools, your nose is working correctly. You don't have a lot of choice yet. Solutions are coming, but none really solve the problems of deploying and maintaining Hadoop in a large organization yet:
- Ambari: This Apache project is a marvel and an amazing thing when it works. Each version gets better and each version manages more nodes. But Ambari isn't for provisioning more VMs and does a better job provisioning than reprovisioning or reconfiguring. Ambari probably isn't a long-term solution for provisioning large multitenanted environments with diverse workloads.
- Slider: Slider enables non-YARN applications to be managed by YARN. Many Hadoop projects at Apache are really controlled or sponsored by one of the major vendors. In this case, the sponsor is Hortonworks, so it pays to look at Hortonworks' road map for Slider. One of the more interesting developments is the ability to deploy Dockerized apps via YARN based on your workload. I haven't seen this in production yet, but it's very promising.
- Kubernetes: I admit to being biased against Kubernetes because I can't spell it. Kubernetes is a way to pool compute resources Google-style. It brings us one step closer to a PaaS-like feel for Hadoop. I can see a potential future when you use OpenShift, Kubernetes, Slider, YARN, and Docker together to manage a diverse cluster of resources. Cloudera hired a Google exec with that on his resume.
- Mesos: Mesos has some overlap with Kubernetes but competes directly with YARN or more accurately YARN/Slider. The best way to understand the difference is that YARN is more like traditional task-scheduling. A process gets scheduled against resources that YARN has available to it on the cluster. Mesos has an app request, Mesos makes an offer, and the process can "reject" that offer and wait for a better offer, sort of like dating. If you really want to understand this in detail, MapR has a good walkthrough (though possibly the conclusions are a bit biased). Finally, there's a YARN/Mesos hybrid called Myriad. The hype cycle has burned a bit quick for Mesos.
What about opting for a Hadoop provider in the public cloud? Well, there are a few answers to that question. For one, at a certain scale you begin to stop believing claims that Amazon is cheaper than having your own internal IT team maintaining things. Two, many companies have (real or imagined) beliefs around data security and regulation that prevent them from going to the cloud. Third, uploading larger data sets may not be practical, based on the amount of bandwidth you can buy and the time you need it to be processed/uploaded. Finally, many of the same challenges (especially around diverse workloads) persist in the cloud.
After the vendor wars subside and the shrill pitch of multiple solutions in the marketplace fades, we'll eventually have a turnkey solution for dealing with multiple workloads, diverse services, and different use cases in a way that provisions both the infrastructure and service components on demand.
For now, expect a lot of custom scripting and recipes. Organizations that make large-scale use of this technology simply can't wait to start centralizing. The cost of building and maintaining disparate clusters outweighs the cost of custom-building or deploying immature technology.