For large-scale analytics, a distributed file system is kind of important. Even if you’re using Spark you need to pull a lot of data into memory very quickly. Having a file system that supports high burst rates -- up to network saturation -- is a good thing. However, Hadoop’s eponymous file system (Hadoop Distributed File System, aka HDFS) may not be all it's cracked up to be.
What is a distributed file system? Think of your normal file system, which stores files in blocks. It has some way of noting where on the physical disk a block starts and how that block matches to a file. (One implementation is a file allocation table or FAT of sorts.) In a distributed file system, the blocks are “distributed” among disks attached to multiple computers. Additionally, like RAID or most SAN systems, the blocks are duplicated so that if a node is lost from the network then no data is lost.
What’s wrong with HDFS?
In HDFS, the role of the “file allocation table” is taken by the namenode. You can have more than one namenode (for redundancy), but essentially the namenode constitutes both a failure point and a type of bottleneck. While a namenode can fail over, that does take time. It also means keeping in sequence, which introduces more latency. In HDFS there is also some threading and locking stuff that happens as well as the fact that it is garbage-collected Java. Garbage collection -- especially Java garbage collection -- requires a lot of memory (generally at least 10x to be as efficient as native memory).
Moreover, in developing applications for distributed computing we often figure that whatever inefficiency we inject in language choice will be outweighed by I/O. Meaning so what if it took me 1,000 operations to open a file and give you some data, because the time it took for an I/O operation was 10x that. Simplistically speaking, the higher level the language, the more operations or “work” is executed per line of code.
That said, the lower level the component, the more we pay for this inefficiency. Meaning it is rational to trade lower operation performance for shorter time to develop and deploy, lower maintenance costs, and better security (i.e., buffer overruns and underflows are next to impossible in a high-level, garbage-collected language like Java). However, as you go lower level the inefficiency catches up. This is why most operating systems are written in C and Assembly as opposed to, say, Java (despite several attempts). One could argue that a file system is at that lower level.
The tooling around HDFS is kind of poor compared to any file system or distributed store you’ve ever dealt with. You're asking your IT operations people to administer a Java-based file system that, at best, implements bastardized versions of their “favorite” POSIX tools. Yes you can mount HDFS with NFS, but have you actually tried that? How well did that work out for you, really? The other tools for mounting HDFS are also pretty poor. Instead, you deal with weird REST bridge tools and a command-line client that doesn’t even accept most options for
ls, let alone anything else.
There are of course other particular aspects of HDFS that are inefficient or just problematic. Most derive from the fact that HDFS is a file system written in Java, of all things.
What about HDFS could be fixed? HDFS has native code extensions to make it more efficient. Meanwhile, the community has improved the namenode substantially. However, on higher-end systems with a lot of operations you still hit a namenode bottleneck that you can see in your favorite monitoring and diagnostic tools. Moreover, kicking the namenode over to solve ghost problems is something that occasionally has to happen. Overall, a more mature distributed file system written in C or C++ with mature bindings to common operating systems is often a better option.
Spark and cloud demand change
Many of the early corporate Hadoop deployments were done on-premises, and the sales compensation plan for the early vendors was based on this assumption. With the rise of Spark and cloud deployments, it isn’t uncommon to see Amazon S3 used as a data source.
The Hadoop vendors all had a vision of a more unified Hadoop platform where HDFS would integrate with security components (Cloudera and Hortonworks of course doing their own separate and incompatible security systems). However, with MapReduce giving way to Spark, and Spark being a lot more ambivalent about the type of file system, the “tight” integration with the file system seems less critical let alone feasible.
Meanwhile, alternative file systems such as MapR’s MapR FS are gaining in interest as companies discover the joys of dealing with HDFS in an operational setting. Moreover, the design of MapR FS doesn’t include namenodes, but uses a more standard and familiar cluster master election scheme. MapR’s partitioning design should also work better to avoid bottlenecking.
There are other options like Ceph, or an object store like S3, or a more standard distributed file system like Gluster. I/O Performance tests [PDF] tend to yield very positive results for Gluster. If you’re mainly looking to store files that you’re going to suck into Spark, there are rational reasons to pick something that is more operationally familiar to your IT operations folks and yet is, well, faster.
I’m not saying that large HDFS installations are going to migrate overnight, but that as we see more Spark and cloud projects we’re likely to see less and less HDFS over time. That will leave YARN as the remaining piece of Hadoop that people will still use. Maybe.