Apache Spark has come to represent the next generation of big data processing tools. By drawing on open source algorithms and distributing the processing across clusters of compute nodes, the Spark and Hadoop generation have easily outdone traditional frameworks both in the types of analytics they can execute on a single platform and in the speed at which they can execute them. Spark utilizes memory for its data processing, making it much faster (100x) than disk-based Hadoop.
But Spark can run even faster with a little help. By combining Spark with Redis, the popular in-memory data structure store, you can provide another huge boost to the performance of analytics jobs. This is due to Redis’ optimized data structures and its ability to execute operations in a way that minimizes complexity and overhead. By accessing the Redis data structures and API through a connector, Spark gains even more speed.
How big is this boost? When Redis and Spark are used together, data processing (for the analysis of time series data described below) proved 45 times faster than Spark alone using either process memory or an off-heap cache to store the data -- not 45 percent faster, but 45 times faster!
Why does this matter? Increasingly, companies need analytics on their transactions at the same speed as the business transactions themselves. More and more decisions are becoming automated, and the analysis needed to drive these decisions should be available in real time. Apache Spark is a great general data processing framework, and while it is not 100 percent real time, it’s still a big step toward putting your data to work in a timelier manner.
Spark uses Resilient Distributed Datasets (RDDs), which can be stored in volatile memory or in persistent storage like HDFS. RDDs are immutable and distributed across all nodes in a Spark cluster, and they can be transformed to create other RDDs.
RDDs are an important abstraction in Spark. They represent a fault-tolerant method to present data to iterative processes, with high efficiency. Because the processing happens in memory, this represents an orders-of-magnitude improvement in processing times compared to using HDFS and MapReduce.
Redis is purpose-built for high performance. Its submillisecond latencies are fueled by optimized data structures that boost efficiency by allowing operations to be executed right next to where the data is stored. These data structures not only make efficient use of memory and reduce application complexity, but also lower network overhead, bandwidth consumption, and processing times. Redis data structures include strings, sets, sorted sets, hashes, bitmaps, hyperloglogs, and geospatial indexes. Redis data structures are used like Lego building blocks by developers -- simple conduits to deliver complex functionality.
As an illustration of how these data structures simplify application processing time and complexity, let's use the Sorted Set data structure as an example. A Sorted Set basically is a set of members ordered by their score.
You can store many types of data in here, and they're automatically ordered by a score. Common examples of data you would store in sorted sets include items by price, article names by count, time series data such as stock prices, and sensor readings by time stamp.
The beauty of sorted sets lies in Redis’ built-in operations that allow range queries, intersections of multiple sorted sets, retrieval by member ranks and score, and more to be executed simply, with unmatched speed, and at scale. Built-in operations not only conserve on code that needs to be written, but executing the operations in memory saves on network latency and bandwidth, and it enables high throughput at submillisecond latencies. Using sorted sets for time series data analysis typically results in orders-of-magnitude performance gains compared to other in-memory key/value stores or disk-based databases.
With the goal of boosting Spark’s analytic capabilities, the Redis team created the Spark-Redis connector. This package allows Spark to use Redis as one of its data sources. The connector exposes Redis’ data structures to Spark, providing a huge performance boost to all types of analyses.
To showcase the benefits to Spark, the team decided to benchmark time series analysis in Spark by executing time slice (range) queries in a few different scenarios: with Spark storing everything in on-heap memory, with Spark using Tachyon as an off-heap cache, with Spark using HDFS, and with the combination of Spark and Redis.
Using Cloudera’s Spark time series package, the Redis team created a Spark-Redis time series package that uses Redis sorted sets to accelerate time series analysis. In addition to providing Spark with access to all of Redis’ data structures, the package does two more things:
- It aligns Redis nodes with the Spark cluster automatically to make sure each Spark node uses its local Redis data, thus optimizing latency.
- It integrates with the Spark data-frame and data-source APIs that allow automatic translation of Spark SQL queries to the most efficient retrieval mechanisms for the data in Redis.
In plain English, this means the user doesn't have to worry about operational alignment between Spark and Redis and can continue to use Spark SQL for analysis, while gaining a tremendous boost in query performance.
The time series data used for this benchmark consisted of randomly generated financial data for 1,024 stocks by day, over a range of 32 years. Each stock is represented by its own Sorted Set, with the scores being the date and the members including the open, high, low, close, volume, and adjusted close values. Think of the data representation in Redis sorted sets for Spark analysis as depicted in the below figure:
In the example above, for the sorted set AAPL, you have a score representing each day (1989-01-01) and values throughout the day as a single associated row. Pulling all the values for a particular time slice is done with a simple
ZRANGEBYSCORE command in Redis, which gets all of the stock prices for a specified range of days. Redis executes this type of query up to a 100 times faster than other key/value stores.
The benchmarking bore out the performance boost. Spark using Redis turned out to execute time-slice queries 135 times faster than using Spark with HDFS and 45 times faster than either Spark using on-heap (process) memory or Spark using Tachyon as an off-heap cache. The graph below shows average execution times compared across the different scenarios.
To try this out for yourself, follow the downloadable step-by-step guide, “Getting Started with Spark and Redis.” The guide walks you through installing a typical Spark cluster and the Spark-Redis package. It also illustrates how Spark and Redis can be used together with a simple word-count example. After you've wet your feet with Spark and the Spark-Redis package, you can explore many more scenarios utilizing other Redis data structures.
While sorted sets are great for time series data, other Redis data structures -- such as sets, lists, and geospatial indexes -- can enrich Spark analyses even further. Imagine a Spark process trying to extract the best geographical locations for introducing a new product based on demographic preferences as well as proximity to urban centers. Now imagine that this process can be dramatically accelerated by data structures that come with built-in analytics such as geospatial indexes and sets. The possibilities of the Spark-Redis combination are infinite.
Spark supports a wide variety of analyses, including SQL, machine learning, graph computations, and streaming data. Using Spark’s in-memory processing capabilities gets you to a certain scale. However, adding Redis takes you even further: You not only gain a performance boost by drawing on Redis’ data structures, but you can scale Spark more elegantly to handle millions and billions of records by leveraging the shared distributed in-memory data store provided by Redis.
The time series example is only the beginning. Using Redis data structures for machine learning and graph analyses could bring disruptively fast execution times to these workloads as well.
Yiftach Shoolman is co-founder and CTO of Redis Labs.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to email@example.com.