5 reasons to turn to Spark for big data analytics

Smoothing the way to advanced and real-time analytics on Hadoop, Apache Spark is fast becoming the next big thing in big data

Over the past couple of years, as Hadoop has become the dominant paradigm for big data processing, several facts have become clear. First, the Hadoop Distributed File System is the right storage platform for big data. Second, YARN is the resource allocation and management framework of choice for big data environments. Third, and maybe most important, there is no single processing framework that will solve every problem. Although MapReduce is an amazing technology, it doesn’t address every situation. 

Businesses that rely on Hadoop need a variety of analytical infrastructures and processes to find the answers to their critical questions. They need data preparation, descriptive analysis, search, predictive analysis, and more advanced capabilities like machine learning and graph processing. Also, businesses need a tool set that meets them where they are, allowing them to leverage the skill sets and other resources they already have. Until now, a single processing framework that fits all those criteria has not been available. This is the fundamental advantage of Spark. 

Though Spark is a relatively young data project, it has met all of the above requirements and more. Here are five reasons to believe that we have entered the age of Spark. 

1. Spark makes advanced analytics a reality

While a majority of large and innovative companies are looking to expand their advanced analytics capability, at a recent big data analytics event in New York, only 20 percent of the participants reported that they are currently deploying advanced analytics across the enterprise. The other 80 percent said their hands are full simply preparing data and providing basic analytics. The few data scientists these companies have spend most of their time implementing and managing descriptive analytics. 

Spark provides a framework for advanced analytics right out of the box. This framework includes a tool for accelerated queries, a machine learning library, a graph processing engine, and a streaming analytics engine. As opposed to trying to implement these analytics via MapReduce, which can be nearly impossible even with hard-to-find data scientists, Spark provides prebuilt libraries that are easier and faster to use. This also frees the data scientists to take on tasks beyond data preparation and quality control. With Spark, they can even ensure correct interpretation of the analysis results. 

2. Spark makes everything easier

A longtime criticism of Hadoop is that it is hard to use and even harder to find people who can use it. Although Hadoop has become simpler and more powerful with every new version, this critique has persisted into the present day. Instead of requiring users to understand the various complexities, such as Java and MapReduce programming patterns, Spark is made to be accessible to anyone with an understanding of databases and some scripting skills (in Python or Scala). That makes it easier for businesses to find people who can understand their data as well as the tools to process it. And it allows vendors to develop analytics solutions faster and bring new innovation to their customers sooner.

3. Spark speaks more than one language

At this point, it may be fair to ask: If SQL didn’t already exist, would we invent SQL today to address the challenges of big data analytics? Probably not -- at least not SQL alone. We would want more flexibility in getting at the answers we need, more options for organizing and retrieving data, and faster ways of moving the data into an analytics framework. Spark leaves the SQL-only mind-set behind, opening the data to the quickest and most elegant way of initiating analysis, whatever that might be for the data and business challenge at hand.

4. Spark accelerates results 

As the pace of business continues to accelerate, the need for real-time results continues to grow. Spark provides parallel in-memory processing that returns results many times faster than any other approach requiring disk access. Instant results eliminate delays that can significantly slow incremental analytics and the business processes that rely on them. As vendors begin to leverage Spark to build applications, dramatic improvements to the analyst workflow will follow. Accelerating the turnaround time for answers means that analysts can work iteratively, honing in on more precise, and more complete, answers. Spark lets analysts do what they are supposed to do: find better answers faster. 

5. Spark doesn’t care which Hadoop vendor you use 

All of the major Hadoop distributions now support Spark, with good reason. Spark is a vendor-neutral solution, meaning that implementation doesn’t tie the user to any one provider. Because Spark is open source, businesses are free to create a Spark-based analytics infrastructure without having to worry about whether they might change Hadoop vendors at some point down the road. If they change, they can bring their analytics with them.

The momentum Spark has achieved in a very short time is testament to how closely it matches the requirements of businesses using big data analytics. Right now we are only at the beginning of the “age of Spark.” As businesses begin to truly leverage Spark’s potential, we can expect to see Spark solidify its position as one of the core technologies for any big data analytics environment, and the Spark ecosystem continue to grow accordingly. Businesses looking to bring advanced and real-time analytics to big data sets should be looking into Spark now. 

Peter Schlampp is vice president of product at Platfora. Platfora is a big data analytics platform built natively on Hadoop and Spark, allowing business users and data scientists to visually interact with petabyte-scale data in seconds.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2015 IDG Communications, Inc.