How to run Cassandra and Kubernetes together

Kubernetes runs distributed applications, and Apache Cassandra provides a distributed database environment. Here’s how you can run them together

How to run Cassandra and Kubernetes together
Thinkstock

Containers have become increasingly popular for developers who want to deploy applications in the cloud. To manage these new applications, Kubernetes has become a de facto standard for container orchestration. Kubernetes enables developers to build distributed applications that automatically scale elastically, depending on demand.

Kubernetes was developed to effortlessly deploy, scale, and manage stateless application workloads in production. When it comes to stateful, cloud-native data, there has been a need for the same ease of deployment and scale.

In distributed databases, Cassandra is appealing for developers that know they will have to scale out their data — it provides a fully fault tolerant database and data management approach that can run the same way across multiple locations and cloud services. As all nodes in Cassandra are equal, and each node is capable of handling read and write requests, there is no single point of failure in the Cassandra model. Data is automatically replicated between failure zones to prevent the loss of a single instance affecting the application.

Connecting Cassandra to Kubernetes

The logical next step is to use Cassandra and Kubernetes together. After all, getting a distributed database to run along with a distributed application environment makes it easier to have data and application operations take place close to each other. Not only does this avoid latency, it can help improve performance at scale.

To achieve this, however, means understanding which system is in charge. Cassandra already has the kind of fault tolerance and node placement that Kubernetes can deliver, so it is important to know which system is in charge of making the decisions. This is achieved through using a Kubernetes operator.

Operators automate the process of deploying and managing more complex applications that require domain-specific information and need to interact with external systems. Until operators were developed, stateful application components like database instances led to extra responsibilities for devops teams, as they had to undertake manual work to get their instances prepared and run in a stateful way.

There are multiple operators for Cassandra that have been developed by the Cassandra community. For this example, we’ll use cass-operator, which was put together and open-sourced by DataStax. It supports open-source Kubernetes, Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes Service (EKS), and Pivotal Container Service (PKS), so you can use the Kubernetes service that best suits your environment.

Installing a cass-operator on your own Kubernetes cluster is a simple process if you have basic knowledge of running a Kubernetes cluster. Once your Kubernetes cluster is authenticated, using kubectl, the Kubernetes cluster command-line tool, and your Kubernetes cloud instance (whether open-source Kubernetes, GKE, EKS, or PKS) is connected to your local machine, you can start applying cass-operator configuration YAML files to your cluster.

Setting up your cass-operator definitions

The next stage is applying the definitions for the cass-operator manifest, storage class, and data center to the Kubernetes cluster.

A quick note on the data center definition. This is based on the definitions used in Cassandra rather than a reference to a physical data center.

The hierarchy for this is as follows:

  • A node refers to a computer system running an instance of Cassandra. A node can be a physical host, a machine instance in the cloud, or even a Docker container.
  • A rack refers to a set of Cassandra nodes near one another. A rack can be a physical rack containing nodes connected to a common network switch. In cloud deployments, however, a rack often refers to a collection of machine instances running in the same availability zone.
  • A data center refers to a collection of logical racks, generally residing in the same building and connected by a reliable network. In cloud deployments, data centers generally map to a cloud region.
  • A cluster refers to a collection of data centers that support the same application. Cassandra clusters can run in a single cloud environment or physical data center, or be distributed across multiple locations for greater resiliency and reduced latency

Now we have confirmed our naming conventions, it’s time to set up definitions. Our example uses GKE, but the process is similar for other Kubernetes engines. There are three steps. 

Step 1

First, we need to run a kubectl command which references a YAML config file. This applies the cass-operator manifest’s definitions to the connected Kubernetes cluster. Manifests are API object descriptions, which describe the desired state of the object, in this case, your Cassandra operator. For a complete set of version-specific manifests, see this GitHub page.

Here’s an example kubectl command for GKE cloud running Kubernetes 1.16:

kubectl create -f https://raw.githubusercontent.com/datastax/cass-operator/v1.3.0/docs/user/cass-operator-manifests-v1.16.yaml

Step 2

The next kubectl command applies a YAML configuration that defines the storage settings to use for Cassandra nodes in a cluster. Kubernetes uses the StorageClass resource as an abstraction layer between pods needing persistent storage and the physical storage resources that a specific Kubernetes cluster can provide. The example uses SSD as the storage type. For more options, see this GitHub page. Here’s the direct link to the YAML applied in the storage configuration, below:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: server-storage
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  replication-type: none
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete

Step 3

Finally, using kubectl again, we apply YAML that defines our Cassandra Datacenter.

# Sized to work on 3 k8s workers nodes with 1 core / 4 GB RAM
# See neighboring example-cassdc-full.yaml for docs for each parameter
apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: dc1
spec:
  clusterName: cluster1
  serverType: cassandra
  serverVersion: "3.11.6"
  managementApiAuth:
    insecure: {}
  size: 3
  storageConfig:
    cassandraDataVolumeClaimSpec:
      storageClassName: server-storage
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
  config:   
    cassandra-yaml:
      authenticator: org.apache.cassandra.auth.PasswordAuthenticator
      authorizer: org.apache.cassandra.auth.CassandraAuthorizer
      role_manager: org.apache.cassandra.auth.CassandraRoleManager
    jvm-options:
      initial_heap_size: "800M"
      max_heap_size: "800M"

This example YAML is for an open-source Apache Cassandra 3.11.6 image, with three nodes on one rack, in the Kubernetes cluster. Here’s the direct link. There is a complete set of database-specific datacenter configurations on this GitHub page.

At this point, you will be able to look at the resources that you’ve created. These will be visible in your cloud console. In the Google Cloud Console, for example, you can click on the Clusters tab see what is running and look at the workloads. These are deployable computing units that can be created and managed in the Kubernetes cluster.

To connect to a deployed Cassandra database itself you can use cqlsh, the command-line shell, and query Cassandra using CQL from within your Kubernetes cluster. Once authenticated, you will be able to submit DDL commands to create or alter tables, etc., and manipulate data with DML instructions, such as insert and update in CQL.

What’s next for Cassandra and Kubernetes?

While there are several operators available for Apache Cassandra, there has been a need for a common operator. Companies involved in the Cassandra community, such as Sky, Orange, DataStax, and Instaclustr are collaborating to establish a common operator for Apache Cassandra on Kubernetes. This collaboration effort goes alongside the existing open-source operators, and the aim is to provide enterprises and users with a consistent scale-out stack for compute and data.

Over time, the move to cloud-native applications will have to be supported with cloud-native data as well. This will rely on more automation, driven by tools like Kubernetes. By using Kubernetes and Cassandra together, you can make your approach to data cloud-native.

To learn more about Cassandra and Kubernetes, please visit https://www.datastax.com/dev/kubernetes. For more information on running Cassandra in the cloud, check out DataStax Astra

Patrick McFadin is the VP of developer relations at DataStax, where he leads a team devoted to making users of Apache Cassandra successful. He has also worked as chief evangelist for Apache Cassandra and consultant for DataStax, where he helped build some of the largest and exciting deployments in production. Previous to DataStax, he was chief architect at Hobsons and an Oracle DBA/developer for over 15 years.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2020 IDG Communications, Inc.

How to choose a low-code development platform