How to run stateful applications on Kubernetes

Take advantage of Portworx PX-Enterprise to simplify management of data-rich workloads on Kubernetes

How to run stateful applications on Kubernetes
Henrik5000 / Getty Images

Kubernetes has many core abstractions, sometimes called primitives, that make the experience of deploying and managing applications so much better than what came before. Understanding these abstractions helps you take full advantage of Kubernetes and also avoid complexity—especially when running stateful applications like databases, data analytics, big data applications, streaming engines, machine learning, and AI apps.

In this article, I’ll review some of the fundamental abstractions in Kubernetes storage, and walk through how Portworx PX-Enterprise helps solve important challenges that arise with the need for persistent storage in Kubernetes.

Kubernetes abstractions and Kubernetes storage

The Pod is a great example of a core Kubernetes abstraction. It’s actually the first example—the starting point.

Back in 2015, other container orchestration systems started with a single container as the fundamental abstraction; Kubernetes started with Pods. A Pod is a group of one or more containers that need to run together to be useful. One simple analogy is that a Pod is like an outfit of clothing. It’s great to have a shirt and socks on, but let’s not walk out the door without pants!

Pods are like that—they let us focus on what’s needed to be useful (a running outfit) and not overload us with bookkeeping minutiae (a shoelace, one sock). Don’t get me wrong, the minutiae is being tracked by the scheduler and Kubelet (the Kubernetes agent). But it’s this abstraction that allows for the ecosystem to build on Kubernetes and for administrators to automate their infrastructure. And today, we see that most other schedulers have adopted the Pods concept, a sure sign of its usefulness.

The world of storage in Kubernetes has its primitives too, some of which may sound complex at first glance. These abstractions come together to take a complex problem—how to schedule efficiently when application demand is unpredictable—and provide a reliable solution. At the end of the day, you wouldn’t want to run in production without these abstractions.

Here are the Kubernetes abstractions that describe and control storage:

  • PersistentVolume (PV) – the representation for where data is held. Your infrastructure provider or storage vendor implements this. PVs are what you protect through standard means like backup, replication, and encryption.
  • PersistentVolumeClaim (PVC) – how a Pod requests a PersistentVolume, including describing the size of PV needed. After the request, the PVC becomes the reference between a Pod and its PersistentVolume.
    • Now, you might ask, why not skip PVCs and have Pods directly use PersistentVolume? Without a PVC concept, applications would be less portable, which we’ll explain later.
  • StorageClass (SC) – describes the types of storage that your infrastructure offers. For example, your provider may offer two flavors: fast SSD with encryption and slow HDD without encryption.
    • Just like with PVC, you might ask why is this needed? Again, these abstractions help with portability and let administrators prevent abuse from sloppy applications. We’ll also explain this point below.

Here are the Kubernetes abstractions that describe and control applications:

  • Pod – one or more containers that run on the same server, work together, and together form a basic unit of work.
  • Deployment – a controller that ensures that the desired number of application Pods are running and that manages the Pod’s lifecycle. A lifecycle event might be adding more Pods or updating the version. A Pod definition is included within the written specification of a Deployment.
    • A common question for customers is when to use a Deployment versus a StatefulSet. This is a good question that we’ll expand upon.
  • StatefulSet – manages the entirety of the database, instead of individual Pods and their PVCs.
    • It’s important to remember that a horizontally scaling database, like Cassandra, runs with multiple Pods that work together. With a StatefulSet, you don’t have to think about how each Cassandra node (instance) relates to another node. Kubernetes does that for you.

These are the fundamental Kubernetes primitives that enable portability and scalability. There is a subtle and powerful beauty to how these abstractions work together. Since StatefulSets wrap a lot of the underlying primitives, let’s start with a more basic example using PostgreSQL. Then we will directly touch on some of the primitives, starting with a PVC, and build upwards.

Deploying PostgreSQL on Kubernetes

Customers love how Kubernetes manages applications on their infrastructure. As we walk through an example with PostgreSQL, we see that the Kubernetes primitives were designed to be as portable as possible. Even before we try to equate portability with multi-cloud, we see that portability means that the proper primitives enable apps to run, re-run, and re-re-run across servers.

Portability and robustness are thus two sides of the same coin, which makes sense if we think about it. Apps have to be portable across servers if apps are to survive failures.

Back to our PostgreSQL example. The Pod identifies the container image and a PVC. Here, we will run the PostgreSQL database, so the container image is for version 10.1 of PostgreSQL. The Pod is written as a section within a Deployment specification in this example. Had we chosen Cassandra, we would have written our Pod as part of a StatefulSet.

The Deployment not only holds a Pod definition, but also allows us to make updates to that Pod as it runs. Let’s look at all of this within the Deployment specification. I’ve added comments for explanation.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
 name: postgres
spec:
 template:
   # Pod definition portion of this deployment specification
   metadata:
     labels:
       app: postgres
   spec:
     # Container to use the application image PostgreSQL 10.1
     containers:
     - image: "postgres:10.1"
       name: postgres
       envFrom:
       - configMapRef:
           name: example-config
       ports:
       - containerPort: 5432
         name: postgres
       volumeMounts:
# Container to use the PVC below called ‘postgres-data’
       - name: postgres-data
# Container sees itself as writing to the directory below
         mountPath: /var/lib/postgresql/data
     volumes:
     - name: postgres-data
# PVC available to any containers in this Pod spec
       persistentVolumeClaim:
         claimName: postgres-data-claim

In the above example, the Pod knows about its PVC but does not—and need not—know about the PersistentVolume. This part may feel a little roundabout, so bear with me. The PVC requests an amount of storage capacity and the type of storage to use. The PVC looks like this:

apiVersion: v1
metadata:
  # Create a Persistent Volume using this Storage Class definition
  annotations:
    volume.beta.kubernetes.io/storage-class: px-postgres-sc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    # Create a Persistent Volume with 5 GB of storage capacity
    requests:
      storage: 5Gi

Up to this point, the application owner has been describing their app and requirements. Now, the infrastructure administrator gets involved by defining the types of storage available by publishing StorageClasses.

It’s important to separate the concerns that the application is addressing from those of the infrastructure: the infrastructure admin needs to define what is sustainable in a shared cluster. Without such primitives, applications could trash each other—a problem that some Kubernetes alternatives are susceptible to.

In the StorageClass specification below, all PVCs that specify this StorageClass will have replication and encryption and will be configured for database I/O workloads. StorageClasses are storage provider specific. Under the covers, the vendor who provides the PersistentVolume implements these features.

apiVersion: storage.k8s.io/v1beta1
metadata:
   name: px-postgres-sc
provisioner: kubernetes.io/portworx-volume
parameters:
  # Replicate three copies using Portworx
  repl: "3"
  # Tune the I/O for the volume for databases
  io_profile: "db"
  # Encrypt the data using a key from Key Management System
  secure: "true”

Running PostgreSQL on Kubernetes

Now to install all of the above primitives, the administrator starts with the StorageClass. Typically, administrators will design several StorageClasses, allowing for tradeoffs between what different apps require and what the infrastructure can support.

To publish the first StorageClass, the administrator runs the following command with the corresponding YAML file:

$ kubectl create -f px-storage-class.yaml
storageclass.storage.k8s.io "px-postgres-sc" created

The application owner can now use storage as defined by the StorageClass. An application Pod will use a PVC to request storage. Since this is a new application, a new PersistentVolume will be created that satisfies the PVC. To create the PVC, we run the following command with our PVC file:

$ kubectl create -f pvc.yaml
persistentvolumeclaim "postgres-data-claim" created

Now we are ready to deploy the PostgreSQL database. Our database will run as a Pod with a PostgreSQL container inside it. Since the Pod was defined within a Deployment specification, we simply create all of this by running the command on the Deployment YAML file:

$ kubectl create -f postres-deployment.yaml
deployment.extensions "postgres" created

We can look backwards to see what was created. First, we ask for all PVCs by running the following command. Below, we see that the PVC we created is in a Bound status, meaning that it’s using the PersistentVolume. In other words, the storage primitives are ready for use.

$ kubectl get pvc
NAME         STATUS    VOLUME    CAPACITY   STORAGE CLASS   AGE
postgres...  Bound    pvc-3...   5Gi         px-...        17s

Next, we can look at the Pod that is running our PostgreSQL container. Below, we see that our PostgreSQL Pod is ready to serve requests.

$ kubectl get pods
NAME                     READY     STATUS    RESTARTS   AGE
postgres-dff54d66d...    1/1       Running   0          6s

Taking a step back, we see that we created all of the primitives needed to run a database. More than that, we standardized on the storage our infrastructure offers to applications, which improves the experience for all subsequent Pods. And we can now control the database Pod using our Deployment object, automating parts of the upgrade process.

The entire stack and set of primitives are shown in the figure below.

portworx kubernetes postgresql stack Portworx

The net result is that we have the language for expressing our desired state in production. We have Kubernetes that manages the applications to meet that intent. And we have a way to share our storage infrastructure with other applications.

How Portworx addresses Kubernetes storage challenges

We just walked through the Deployment of a stateful service using Kubernetes. It’s important to delve into the production requirements as we handle data-rich workloads. How do we resize a PersistentVolume? How do we encrypt microservices data while allowing for the portability benefits? Let’s delve into these topics that matter in production.

The Kubernetes primitives are powerful because each application can now scale out (handle more requests) easily and independently of other applications. Moreover, you can update particular components with similar fine-grained control. But as with any application platform, you need an infrastructure that supports this flexibility. For stateful workloads that seek the benefits of Kubernetes, there are a number of common (and vendor neutral) gaps in the storage infrastructure that present challenges.

Infrastructure limitations that impact stateful workloads on Kubernetes:

  • Application discrimination — isolate and tune I/O behavior based on applications, especially as servers are now shared among apps. Example: control over when Elasticsearch deletes or Cassandra compacts.
  • Clustered operations — allow for data access as Pods scale out across servers and across availability zones (when in public clouds). Oftentimes, the slowness of accessing storage becomes the concern.
  • Monitoring and visibility — understanding performance as applications share infrastructure. Here, we can benefit from how labels can be used to tag across Pods and then down to disks.
  • Protection — ensuring that backup, snapshots, and data protection mechanisms handle applications that are now dozens of Pods instead of a few large VMs.
  • Portability — helping teams move their data as they move their compute either for development clusters getting promoted to test environments or for multi-cloud workloads.

There are many ways storage and infrastructure solutions can make Kubernetes the best way to run stateful applications. At Portworx, we have been working on making stateful workloads as easy and resilient as stateless workloads with Kubernetes.

Below are some of the ways we have been investing in that.

Microservices first

Unlike past enterprise storage systems, PX-Enterprise is designed from the ground up for microservice applications. Much of our work has gone into extending the experience for Kubernetes users, and Portworx itself can be installed, extended, and controlled using Kubernetes.

1 2 Page 1
Page 1 of 2