Fire up big data processing with Apache Ignite

Apache Ignite brings RDBMS, NoSQL, and Hadoop data sets into memory to deliver huge performance gains

Fire up big data processing with Apache Ignite
Credit: Pixabay

Apache Ignite is an in-memory computing platform that can be inserted seamlessly between a user’s application layer and data layer. Apache Ignite loads data from the existing disk-based storage layer into RAM, improving performance by as much as six orders of magnitude (1 million-fold).

The in-memory data capacity can be easily scaled to handle petabytes of data simply by adding more nodes to the cluster. Further, both ACID transactions and SQL queries are supported. Ignite delivers performance, scale, and comprehensive capabilities far above and beyond what traditional in-memory databases, in-memory data grids, and other in-memory-based point solutions can offer by themselves.

Apache Ignite does not require users to rip and replace their existing databases. It works with RDBMS, NoSQL, and Hadoop data stores. Apache Ignite enables high-performance transactions, real-time streaming, and fast analytics in a single, comprehensive data access and processing layer. It uses a distributed, massively parallel architecture on affordable, commodity hardware to power existing or new applications. Apache Ignite can run on premises, on cloud platforms such as AWS and Microsoft Azure, or in a hybrid environment.

apache ignite architecture

The Apache Ignite unified API supports SQL, C++, .Net, Java, Scala, Groovy, PHP, and Node.js. The unified API connects cloud-scale applications with multiple data stores containing structured, semistructured, and unstructured data. It offers a high-performance data environment that allows companies to process full ACID transactions and generate valuable insights from real-time, interactive, and batch queries.

Users can keep their existing RDBMS in place and deploy Apache Ignite as a layer between it and the application layer. Apache Ignite automatically integrates with Oracle, MySQL, Postgres, DB2, Microsoft SQL Server, and other RDBMSes. The system automatically generates the application domain model based on the schema definition of the underlying database, then loads the data. In-memory databases typically provide only a SQL interface, whereas Ignite supports a wider group of access and processing paradigms in addition to ANSI SQL. Apache Ignite supports key/value stores, SQL access, MapReduce, HPC/MPP processing, streaming/CEP processing, clustering, and Hadoop acceleration in a single integrated in-memory computing platform.

GridGain Systems donated the original code for Apache Ignite to the Apache Software Foundation in the second half of 2014. Apache Ignite was rapidly promoted from an incubating project to a top-level Apache project in 2015. In the second quarter of 2016, Apache Ignite was downloaded nearly 200,000 times. It is used by organizations around the world.

Architecture

Apache Ignite is JVM-based distributed middleware based on a homogeneous cluster topology implementation that does not require separate server and client nodes. All nodes in an Ignite cluster are equal, and they can play any logical role per runtime application requirement.

A service provider interface (SPI) design is at the core of Apache Ignite. The SPI-based design makes every internal component of Ignite fully customizable and pluggable. This enables tremendous configurability of the system, with adaptability to any existing or future server infrastructure.

Apache Ignite also provides direct support for parallelization of distributed computations based on fork-join, MapReduce, or MPP-style processing. Ignite uses distributed parallel computations extensively, and they are fully exposed at the API level for user-defined functionality.

Key features

In-memory data grid. Apache Ignite includes an in-memory data grid that handles distributed in-memory data management, including ACID transactions, failover, advanced load balancing, and extensive SQL support. The Ignite data grid is a distributed, object-based, ACID transactional, in-memory key-value store. In contrast to traditional database management systems, which utilize disk as their primary storage mechanism, Ignite stores data in memory. By utilizing memory rather than disk, Apache Ignite is up to 1 million times faster than traditional databases.

apache ignite data grid

SQL support. Apache Ignite supports free-form ANSI SQL-99 compliant queries with virtually no limitations. Ignite can use any SQL function, aggregation, or grouping, and it supports distributed, noncolocated SQL joins and cross-cache joins. Ignite also supports the concept of field queries to help minimize network and serialization overhead.

In-memory compute grid. Apache Ignite includes a compute grid that enables parallel, in-memory processing of CPU-intensive or other resource-intensive tasks such as traditional HPC, MPP, fork-join, and MapReduce processing. Support is also provided for standard Java ExecutorService asynchronous processing.

apache ignite compute grid

In-memory service grid. The Apache Ignite service grid provides complete control over services deployed on the cluster. Users can control how many service instances should be deployed on each cluster node, ensuring proper deployment and fault tolerance. The service grid guarantees continuous availability of all deployed services in case of node failures. It also supports automatic deployment of multiple instances of a service, of a service as a singleton, and of services on node startup.

In-memory streaming. In-memory stream processing addresses a large family of applications for which traditional processing methods and disk-based storage, such as disk-based databases or file systems, are inadequate. These applications are extending the limits of traditional data processing infrastructures.

apache ignite streaming

Streaming support allows users to query rolling windows of incoming data. This enables users to answer questions such as “What are the 10 most popular products over the last hour?” or “What is the average price in a certain product category for the past 12 hours?”

Another common stream processing use case is pipelining a distributed events workflow. As events are coming into the system at high rates, the processing of events is split into multiple stages, each of which has to be properly routed within a cluster for processing. These customizable event workflows support complex event processing (CEP) applications.

In-memory Hadoop acceleration. The Apache Ignite Accelerator for Hadoop enables fast data processing in existing Hadoop environments via the tools and technology an organization is already using.

apache ignite hadoop rev

Ignite in-memory Hadoop acceleration is based on the first dual-mode, high-performance in-memory file system that is 100 percent compatible with Hadoop HDFS and an in-memory optimized MapReduce implementation. Delivering up to 100 times faster performance, in-memory HDFS and in-memory MapReduce provide easy-to-use extensions to disk-based HDFS and traditional MapReduce. This plug-and-play feature requires minimal to no integration. It works with any open source or commercial version of Hadoop 1.x or Hadoop 2.x, including Cloudera, Hortonworks, MapR, Apache, Intel, and AWS. The result is up to 100-fold faster performance for MapReduce and Hive jobs.

Distributed in-memory file system. A unique feature of Apache Ignite is the Ignite File System (IGFS), which is a file system interface to in-memory data. IGFS delivers similar functionality to Hadoop HDFS. It includes the ability to create a fully functional file system in memory. IGFS is at the core of the Apache Ignite In-Memory Accelerator for Hadoop.

The data from each file is split on separate data blocks and stored in cache. Data in each file can be accessed with a standard Java streaming API. For each part of the file, a developer can calculate an affinity and process the file’s content on corresponding nodes to avoid unnecessary networking.

Unified API. The Apache Ignite unified API supports a wide variety of common protocols for the application layer to access data. Supported protocols include SQL, Java, C++, .Net, PHP, MapReduce, Scala, Groovy, and Node.js. Ignite supports several protocols for client connectivity to Ignite clusters, including Ignite Native Clients, REST/HTTP, SSL/TLS, and Memcached.SQL.

Advanced clustering. Apache Ignite provides one of the most sophisticated clustering technologies on JVMs. Ignite nodes can automatically discover each other, which helps scale the cluster when needed without having to restart the entire cluster. Developers can also take advantage of Ignite’s hybrid cloud support, which allows users to establish connections between private clouds and public clouds such as AWS or Microsoft Azure.

Additional features. Apache Ignite provides high-performance, clusterwide messaging functionality. It allows users to exchange data via publish-subscribe and direct point-to-point communication models.

The distributed events functionality in Ignite allows applications to receive notifications about cache events occurring in a distributed grid environment. Developers can use this functionality to be notified about the execution of remote tasks or any cache data changes within the cluster. Event notifications can be grouped and sent in batches and at timely intervals. Batching notifications help attain high cache performance and low latency.

Ignite allows for most of the data structures from the java.util.concurrent framework to be used in a distributed fashion. For example, you could add to a double-ended queue (java.util.concurrent.BlockingDeque) on one node and poll it from another node. Or you could have a distributed primary key generator, which would guarantee uniqueness on all nodes.

Ignite distributed data structures include support for these standard Java APIs: Concurrent map, distributed queues and sets, AtomicLong, AtomicSequence, AtomicReference, and CountDownLatch.

Key integrations

Apache Spark. Apache Spark is a fast, general-purpose engine for large-scale data processing. Ignite and Spark are complementary in-memory computing solutions. They can be used together in many instances to achieve superior performance and functionality.

Apache Spark and Apache Ignite address somewhat different use cases and rarely compete for the same task. The table below outlines some of the key differences.

Apache Spark doesn’t provide shared storage, so data from HDFS or other disk storage must be loaded into Spark for processing. State can be passed from Spark job to job only by saving the processed data back into external storage. Ignite can share Spark state directly in memory, without storing the state to disk.

One of the main integrations for Ignite and Spark is the Apache Ignite Shared RDD API. Ignite RDDs are essentially wrappers around Ignite caches that can be deployed directly inside of executing Spark jobs. Ignite RDDs can also be used with the cache-aside pattern, where Ignite clusters are deployed separately from Spark, but still in-memory. The data is still accessed using Spark RDD APIs.

Spark supports a fairly rich SQL syntax, but it doesn’t support data indexing, so it must do full scans all the time. Spark queries may take minutes even on moderately small data sets. Ignite supports SQL indexes, resulting in much faster queries, so using Spark with Ignite can accelerate Spark SQL more than 1,000-fold. The result set returned by Ignite Shared RDDs also conforms to the Spark Dataframe API, so it can be further analyzed using standard Spark dataframes. Both Spark and Ignite natively integrate with Apache YARN and Apache Mesos, so it's easier to use them together.

When working with files instead of RDDs, it's still possible to share state between Spark jobs and applications using the Ignite In-Memory File System (IGFS). IGFS implements the Hadoop FileSystem API and can be deployed as a native Hadoop file system, exactly like HDFS. Ignite plugs in natively to any Hadoop or Spark environment. IGFS can be used with zero code changes in plug-and-play fashion.

Apache Cassandra. Apache Cassandra can serve as a high-performance solution for structured queries. But the data in Cassandra should be modeled such that each predefined query results in one row retrieval. Thus, you must know what queries will be required before modeling the data.

1 2 Page 1
From CIO: 8 Free Online Courses to Grow Your Tech Skills
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.