Cassandra and DataStax reunited — and just in time

Past the ‘breakup,’ DataStax once again has an important role to play in the Apache Cassandra project

Cassandra and DataStax reunited — and just in time
Supplied by client

Apache Cassandra is one of the world’s most popular databases... but for years was plagued by a somewhat fractured community. DataStax, long a driving force within the Cassandra world, at one time seemed to abdicate its leadership role, apparently leaving the project in disarray.

Except that it didn’t. Didn’t leave, that is, and the project wasn’t in disarray. Not really.

Even as DataStax pulled back a bit in response to criticism from the Apache Software Foundation (ASF), companies that depend on Cassandra like Apple and Netflix stepped up to take on more leadership within the Cassandra community. Today, as we near the Cassandra 4.0 release, there’s an argument to be made that the Cassandra code and community are in better shape than they ever have been, with DataStax once again filling an important role for Cassandra.

A house divided

While single-vendor open source projects are somewhat common, they’re verboten for ASF projects. This became an issue for Cassandra, given that years ago DataStax may have contributed as much as 85 percent of the Cassandra code, by one estimate, while also running a community content forum (Planet Cassandra), Cassandra events, and more. This led to ASF accusations that DataStax exercised (or had the potential to exercise) undue influence over Cassandra. In response, DataStax pulled back, leaving the Cassandra community to fend for itself.

This didn’t dissuade companies from continuing to bet big on Cassandra. Apple, for example, had long embraced the highly scalable, high-performance distributed database, as I wrote in 2015. While Apple is famously cagey about sharing how it uses technology, we do know that the company runs 150,000 Cassandra instances, processes tens of millions of queries per second, and stores hundreds of petabytes. With such a big investment in Cassandra, Apple couldn’t afford to let it fail, so Apple worked hard to ensure that stability dramatically improved from the Cassandra 3.11 release to today’s Cassandra 4.0 release.

In this, Apple didn’t act alone.

According to Aaron Morton in 2018 (when Morton was CEO and co-founder of The Last Pickle, a Cassandra consultancy that was recently acquired by DataStax), the heavy emphasis on stabilizing Cassandra prompted more users of the project to step up and pitch in:

No doubt DataStax taking a lower profile was challenging. Ultimately though it resulted in a more diverse community as others stepped in to fill the gaps. Nate McCall, my co-founder from The Last Pickle, was elected the PMC chair [replacing DataStax’s Jonathan Ellis] and with a lot of help from the PMC worked to expand the list of committers and encouraged companies that rely on Cassandra to contribute more. In addition we are still getting important contributions from large companies such as Netflix, Uber, and Instagram.

Even as different companies and individuals joined in, they weren’t always “rowing” in the same direction. For example, instead of one, generalized Kubernetes operator for Cassandra that a variety of companies contribute to and improve, there are multiple such operators (from Sky, Orange, Instaclustr, and others). Other companies, like Instagram, forked Cassandra (“Rocksandra”). None of this activity is “bad,” per se, but it tended to blur the definition of “what Cassandra is” and spread innovation energy in diverse directions.

Which brings us back to DataStax.

Returning to the fold

Today there’s a big need for someone to help rally the Cassandra contributors around common goals. Cassandra leadership and core maintainers like Nate McCall have done a fabulous job of moving mountains to ensure the Cassandra 4.0 release (currently in beta and expected to officially release in the second quarter of 2020) delivers on stability promises made years ago. There are other needs now, and perhaps DataStax is well-positioned to fill those needs, particularly in light of new leadership that has emphasized a renewed focus on contributions to Cassandra.)

For example, while there have been good reasons for Cassandra forks to emerge, no company really wants to maintain a fork. (It’s a drain on resources even as the main branch of an open source project continues on.) Greater emphasis on building pluggability into Cassandra would remove the need for such forks. With a full-time focus on Cassandra, DataStax, working with others, can help to modularize the Cassandra code to make its architecture more pluggable. A pluggable storage engine (instead of a forkable one) would be a big advance for Cassandra. This is a non-trivial task, and not something that any single developer could do in their spare time.

In like manner, Cassandra needs a generalized Kubernetes operator (to make it easier to deploy Cassandra clusters with Kubernetes). Again, this is non-trivial work, but it’s also important because it would serve to align diverse perspectives into one project rather than diffusing them throughout several. This would be a good opportunity for DataStax, complementing the work it’s doing to improve Cassandra documentation, testing the 4.0 release, etc.

This isn’t to suggest the Cassandra community needs DataStax to assume the role of hegemonic contributor. No, if the past few years have taught us anything, it’s that many companies are capable of contributing real value to Cassandra. With that said, more work is required in focusing such efforts on common needs, concentrating rather than diffusing resources. This feels like a great way for DataStax to resume its leadership role within Cassandra.

Copyright © 2020 IDG Communications, Inc.