Apache Spark is the word. OK, technically that's two, but it's clear that in the last year the big data processing platform has come into its own, with heavyweights like Cloudera and IBM throwing their weight and resources behind the project as we gradually say farewell to MapReduce.
We've all seen the Spark demonstrations where people write word count applications in fewer than 10 lines of code. But if you've actually dived into Spark with abandon, you might have discovered that once you start working on something larger than toy problems, some of the sheen comes off.
Yes, Spark is amazing, but it's not quite as simple as writing a few lines of Scala and walking away. Here are five of the biggest bugbears when using Spark in production:
1. Memory issues
No, I'm not talking about the perennial issue of Spark running out of heap space in the middle of processing a large amount of data. (Project Tungsten, one of Databricks' main areas of focus in Spark 1.5 and the upcoming 1.6, does a lot here to finally relieve us from the scourge of garbage collection.) I'm talking about the myriad other memory issues you'll come across when working at scale.
It might simply be the whiplash you get when switching from using Spark in Standalone cluster mode for months, then moving to YARN and Mesos -- and discovering that all the defaults change. For example, instead of grabbing all available memory and cores automatically in Standalone, the other deployment options give you terrifyingly tiny defaults for your executors and driver. It's easy to fix, but you'll forget at least once when spinning up your job, I'll bet.
When you move beyond demos and into large data sets, you'll end up blowing up your Spark job because the reduceByKey operation you do on the 1.8TB set exceeds the default in spark.driver.maxResultSize (1GB, if you were wondering). Or maybe you're running enough parallel tasks that you run into the 128MB limit in spark.akka.frameSize.
These are fixable by altering the configuration -- and Spark does a lot better these days about pointing them out in the logs -- but it means the "smartphone of data" (as Denny Lee of Databricks described Spark earlier this week) requires lots of trial and error (problematic for potentially long-running batch jobs). Spark also demands arcane knowledge of configuration options. That's great for consultants, not so much for everybody else.
2. The small files problem ... again
If you've done any work with Hadoop, you've probably heard people complaining about the small-files problem, which refers to the way HDFS prefers to devour a limited number of large files rather than a large number of small files. If you use Spark with HDFS, you'll run into this issue. But there's another modern pattern where this is lurking, and you might not realize it until it hits you:
Yeah, so we store all the data gzipped in S3.
This is a great pattern! Except when it's lots of small gzipped files. In that case, not only does Spark have to pull those files over the network, it also has to uncompress them. Because gzipped files can be uncompressed only if you have the entire file on one core, your executors are going to spending a lot of time simply burning their cores unzipping files in sequence.
To make matters worse, each file then becomes one partition in the resulting RDD, meaning you can easily end up with an RDD with more than a million tiny partitions. (RDD stands for "resilient distributed dataset," the basic abstraction in Spark.) In order to not destroy your processing efficiency, you'll need to repartition that RDD into something more manageable, which will require lots of expensive shuffling over the network.
There's not a lot Spark can do here. The best fix is to get the data compressed in a different, splittable format (for example, LZO) and/or to investigate if you can increase the size and reduce the number of files in S3 somehow.
3. Spark Streaming
Ah, Spark Streaming, the infamous extension to the Spark API. It's the whale that turns many a developer into Ahab, forever doomed to wander the corridors muttering "if only I can work out the optimal blockInterval, then my pipeline will stay up!" to themselves with a faded glint in their eye.
Now, it's incredibly easy to stand up a streaming solution with Spark. We've all seen the demos. However, getting a resilient pipeline that can operate at scale 24/7 can be a very different matter, often leading you down into some very deep debugging wells. Again, to Spark's credit, each release is making this easier, with more information made available at the SparkUI level, direct receivers, ways of dealing with back-pressure, and so on. But it's still not quite as simple as all those conference presentations would make it look.
If you're looking for some help debugging your Spark Streaming pipeline -- or deciding when you should consider switching to Apache Storm instead -- check out these two talks I recently gave: An Introduction to Apache Spark and Spark & Storm: When & Where?
Before all you Python fans get out your pitchforks -- I like Python! I don't mean to start a programming language war. Honest! But unless there's a pressing need to use Python, I normally recommend that people write their Spark applications in Scala.
There are two main reasons for this. First, if you follow Spark development, you'll soon see that every release brings something new to the Scala/Java side of things and updates the Python APIs to include something that wasn't exposed previously (this is true to an even greater extent with the SparkR bindings). You will always be at least a step or two behind what is possible on the platform. Second, if you're using a pure RDD approach in writing your application, Python is almost always going to be slower than a Java or Scala equivalent. Use Scala! Embrace the type-safety!
But if you need things in numpy or scikit-learn that simply aren't in Spark, then yes, Python definitely becomes a viable option again -- as long as you don't mind being a little behind the Spark API curve. Hey, back off with that pitchfork.
5. Random crazy errors
What sort of crazy errors? For instance, on a recent engagement, I had a Spark job that had been working fine for over a week. Then, out of nowhere, it stopped. The executor logs were full of entries that pointed to compression/decompression errors during the shuffle stages.
There's an open ticket in Spark's Jira log that blames this on the Snappy compression scheme used during the shuffles. Oh, and the ticket points out it's intermittent.
I flipped to a different codec and all was fine -- until the next morning, whereupon I got similar errors. I then spent that day flipping between different shuffle codecs and even turning off the compression entirely, but to no avail. Eventually, I tracked down the issue to the interaction of Spark's network transport system (Netty) with the Xen hypervisor and the version of the Linux kernel that we were using on our AWS instance (say that three times fast). The fix ended up setting a flag on the Xen network drivers, and everything magically worked like nothing had ever been wrong. It was a very frustrating experience, but at least it had a happy ending.
The moral of these tales? Although Spark makes it easy to write and run complicated data processing tasks at a very large scale, you still need experience and knowledge of everything from the implementation language down to the kernel when you start operating at scale and things go awry. I know, because a significant part of my business is devoted to helping people out of these kinds of jams.