Scale out and conquer: Architectural decisions behind distributed in-memory systems

Open source solutions hold the key to a cost-effective, unified architecture for leveraging in-memory computing

Large-scale digital transformation and omnichannel customer engagement applications require an unprecedented level of performance, including the ability to ingest and analyze data from multiple sources in real time. Developing the infrastructure to support this capability starts with a distributed in-memory computing (IMC) solution. However, when designing an IMC architecture, it’s important to consider all the application requirements and create a unified architecture that will ensure a simple and cost-effective environment for developing, deploying and managing the application.

Many open source projects, or the companies that build commercial solutions on those open source projects, ensure simple integration between their software and other open source-based software solutions. By doing so, they enable cost-effective complementary solutions that enterprises can quickly deploy to accelerate the time to value for in-memory computing. For example, many enterprises are taking advantage of open APIs in the following solutions to create an integrated, nearly seamless distributed IMC architecture.

Apache Ignite

Apache Ignite, an open source in-memory computing platform, is the foundation for the architecture. Apache Ignite features an in-memory data grid (IMDG) deployed on a distributed server cluster that can be deployed on-premises, in private or public clouds, or in a hybrid environment. The IMDG can be easily inserted between the data and application layers of existing applications without ripping and replacing the existing database. All available memory of the IMC cluster is available for processing and the cluster can be scaled out simply by adding nodes.

Apache Ignite also features a persistent-store capability that lets an organization balance infrastructure costs and application performance by enabling active data sets that are larger than the available memory. This allows the full operational data set to be kept on disk while keeping only a user-defined subset of data in memory. This architecture, often referred to as a “memory-centric architecture,” can be built using a distributed ACID and ANSI-99 SQL-compliant disk store deployed on spinning disks, solid state drives (SSDs), Flash, 3D XPoint or other storage-class memory technologies. This architecture also enables immediate data processing following a reboot without waiting for all the data to reload into memory. Thanks to the persistent store feature, Apache Ignite can also function as a distributed SQL in-memory database (IMDB).

Finally, Apache Ignite features integrated, distributed machine learning and deep learning libraries that have been optimized for massively parallel processing. This enables each machine learning or deep learning algorithm to run locally against the data residing in-memory on each node of the IMC cluster, which allows for the continuous updating of machine learning or deep learning models without impacting system performance, even at petabyte scale.

Apache Kafka

Apache Kafka is a streaming platform for publishing and subscribing to streams of records, storing the streams of records in a durable way, and processing streams of records as they occur. Apache Kafka is typically used to build real-time streaming data pipelines that reliably move data between systems or applications and to build real-time streaming applications that transform or react to the streams of data.

For example, an IoT platform must ingest and analyze streams of sensor data. Apache Kafka could facilitate the movement and ingestion of this data. Kafka runs as a cluster on one or more servers that can span multiple datacenters. For example, users of GridGain (an in-memory computing platform built on Apache Ignite) can leverage the GridGain certified Kafka connector to easily integrate Kafka into an IMC architecture that can ingest and process massive streams of incoming data.

Apache Spark

Apache Spark is a unified, in-memory analytics engine for large-scale online analytical processing (OLAP). It is commonly used to derive insights from Hadoop. However, Spark does not provide shared storage, so an extract, transform, load (ETL) process must be used to load the data from Hadoop or other storage into Spark for processing. The Spark data packets must also be saved to a disk or memory-based storage medium in order to pass state between Spark jobs.

If the Spark RDDs or DataFrames are stored in-memory or on disk, it is possible to add data to the data set after Spark jobs run, making the Spark data packets mutable. Open APIs enable Apache Spark and Apache Ignite to work together, allowing the systems to share data directly in memory, without having to store it to disk. The combination of Apache Spark and Apache Ignite also accelerates Spark SQL queries by as much as 1000x because Apache Ignite supports SQL indexes, which Apache Spark does not.

Kubernetes

Kubernetes automates the deployment, scaling and management of containerized applications across a server cluster. It groups containers that make up an application into logical units for easy management and discovery. And it can take advantage of on-premises, hybrid, or public cloud infrastructure. Open APIs support an integration of Apache Ignite with Kubernetes that simplifies the deployment of an Ignite cluster in a Kubernetes container. It allows Kubernetes to manage resources and scale the Ignite cluster. For example, if a user specifies that a containerized Apache Ignite cluster should maintain a minimum of five nodes, Kubernetes will automatically ensure this requirement is always met.

It is now a given that organizations must move to in-memory computing to support the speed and scale required for their digital transformation and omnichannel customer engagement initiatives. By choosing the right combination of in-memory computing platform and related open source solutions, these organizations can ensure a simpler and more cost-effective environment for developing, deploying, and managing their applications.

This article is published as part of the IDG Contributor Network. Want to Join?