Big data in the application-centric cloud

Big data scales well in the application-centric cloud

big data graphics

Analytics, big data services, and Hadoop, Storm, and Spark are the major building blocks of micro-services that run in either a VM or a container. These services need a network, storage, and a processor to compose an application, and many of those applications need to be extensible and scalable. Docker's vendor-neutral plugins and VMs utilize APIs for provisioning and attaching storage, and VM instances can mount different shares and manage access through security. 

When launching a micro-service in a VM or in containers, the VM utilizes a share, while the containers use new pools created by the container plugin. Security is centralized in both models through a trust model, and in both methods, you use an object or file system for publishing or consuming data. The key is that high-performance clusters with tiering based on performance characteristics must be used for performance.

The technology to use is Apache Spark, which is an in-memory data analysis tool; this is compared to Hadoop, which uses a disk-based MapReduce paradigm. Spark's in-memory primitives provide performance superior to Hadoop's, but the decision of one over the other depends on whether the user wants real-time or near-time analytics. Containers and VMs load data into memory, and each container or VM sends queries. You can also use Apache Storm, which operates on a continuous stream of data. 

Cluster management of data is isolated from the container, and there are many types of cluster data management: a Hadoop Distributed File System, a NoSQL database, an object storage system, or an Amazon S3. Apache recommends running on a Mesos system. Multiple threads are available in each model, and Spark has one execution per core. It can use a master/slave relationship to spread across the cluster. The ideal concept is to start a master in a single container or VM and then deploy slaves to all the other nodes. If you deploy on Mesos, it will manage the master and slave relationships.

Apache Storm is a real-time computation system. Storm makes it easy to reliably process unbounded streams of data. Storm can process over a million tuples per second per node. It is scalable, and it also has fault tolerance. Thus, a combination of Spark and Storm gives the user the ability to perform analytics at an extremely fast rate. 

Running large-scale multi-tenant Storm, Spark or Hadoop clusters requires a multi-level security strategy. Every container must carry security metadata that enables access to the data based upon the security established by each of the fields within the data context. For streaming data, security is more difficult, as it is unbounded, and many companies are working on security for this type of data.

An example is a major retailer that uses Storm to analyze and process data at the same time. This retailer puts in an HDFS file for further analysis with other programs. This is done in containers in an application-centric cloud. The retailer serves multiple users with these technologies and has gained what it believes is a competitive edge. As volume increases, the retailer spins up containers with these micro-services. Customers now have a choice of flexible, automated, and elastic virtual environments for big data, and the application-centric cloud provides a platform that integrates into the modern data center.

This article is published as part of the IDG Contributor Network. Want to Join?