What is open data, and why does it matter?

Despite a lack of consensus on what open data means, organizations and open source projects are tackling it as we move into the future of computing.

What is open data, and why does it matter?

If “open source” was the rallying cry of the past two decades, “open data” may be the call to arms for the next two. Or it would be, if only we could figure out what it means.

I recently raised that banner and was met by thunderous applause. Hurray, right? Well, despite the dopamine hit (you like me, you really like me!), everyone seemed to be cheering for different things. Love it or hate it, open source has come to mean something relatively standard due to the efforts of the Open Source Initiative. No such organization exists for open data.

It strikes me that someone needs to help set that standard for open data; that open data, more than open source, will define the next era of computing. But what does “open data” mean? And will we, as Professor Dirk Riehle posits, still be asking this question 20 years from now?

Source and standards

As I recently argued, it’s convenient but wrong to assume that open source has lost its salience in the cloud era when managed services, not software/source, are what enterprises want. One reason is that open source helps to foster standards, like OpenTelemetry in the observability space or PostgreSQL in databases. I don’t mean OpenTelemetry is a standard in the sense that some standards body has spent years defining rules for accessibility and such. Instead, I mean a project that a variety of vendors accept as a common starting point for their own distributions or value-added software/services.

Software doesn’t need to be open source (under the Open Source Definition) to achieve this status, though it helps. SQL, for example, has given rise to a variety of kind of, sort of, mostly compatible implementations by a variety of vendors, and it seems to work. Or take pure proprietary software like Microsoft Windows, which I can get from a variety of vendors. In fact, in 2020 when I worked at AWS, I wrote a post on why Windows runs best on AWS and not Microsoft Azure. Another example of this would be the (admittedly hopeful) suggestion that we “make AWS’s permissions checker a universal standard down to the fine grain of what resources a program can use. With universal permissions, cloud vendors just compete on price—no nasty software lock-in.”

Good luck with that!

And good luck trying to get PostgreSQL running in your data center to map apples-to-apples with Amazon Aurora for PostgreSQL or Google Cloud SQL for PostgreSQL. They’re all PostgreSQL, right? Sure. But also, not exactly. Different vendors add different things to meet diverse customer needs. So, is PostgreSQL a standard? Yes, in the sense that I mentioned above, but not in the sense of “write once, run anywhere.”

Similarly, open data quickly devolves into a bevy of conflicting opinions on what it actually means or how to make it matter. Like open source and standards, your mileage may vary, sometimes considerably.

You keep using that word…

Part of the problem comes down to vendor priorities. Some, like Nick Heudecker, former Gartner analyst and current senior director of market strategy at Cribl, argue, “From AWS to Oracle, Snowflake and Splunk, data lock-in is how traditional vendors protect and grow revenue. The idea of open data is promising for users, but no vendor will give up that lock-in.”

Well, that stinks.

Except, those same vendors also see the value in opening on-ramps to their own products. It’s hard to completely lock down data egress while simultaneously locking down ingress. On a similar theme, Crunchy Data executive Craig Kerstiens says, speaking of how SQL enables data movement, “SQL helps on the app side, but data gravity is the hard part.” Even a vendor dead set on lock-in has to let the bridge down at times to cross the moat. It seems, therefore, that everyone has an interest in open data. But again, what exactly does this mean?

For Doug Cutting, founder of a variety of Apache projects (Lucene, Nutch, Hadoop, and Avro), open data is somewhat particular in nature and refers to data that can be shared between people or systems: “Some data should be open (e.g. civic finance), but much should not (e.g. cam footage), and some should be selectively shared by trusted parties (e.g. medical records). There’s no one-size-fits-all policy, rather a complex tapestry of practices, carefully codified and modified.”

Following that data portability theme, AWS Vice President Matt Wilson likens enterprise data to telephone number portability. In North America, requiring carriers to move phone numbers to rivals increased competition (if “marginally,” as Wilson rightly highlights).

Then there are other ways of thinking about open data. For example, Florian Wolf, founder and CEO of Mergeflow, calls PubMed “one of the biggest success stories of open data.” PubMed is “a free resource supporting the search and retrieval of biomedical and life sciences literature.” It’s a database, in other words, or a search engine that makes it easier to find scientific publications which may be stored behind a proprietary paywall. Open discovery of data but perhaps not open access to that data (not without paying, anyway).

See the problem? Open data means very different things to different people.

Defying data gravity and bridging data siloes

Then there’s the question of how we want data to move. When I say “open data” I’m guessing that most readers assume that I’m talking about moving data somewhere else, like if I wanted to move from AWS to Azure. That might sometimes be the case, though egress pricing, quite apart from any inherent data format lock-in, inhibits the movement of data. However, enterprises often struggle to move data within the four walls of their own data center or cloud.

Subbu Allamaraju, an IT leader who built Expedia’s Search & Discovery team, argues that data is messy and fragmented for reasons inherent to organizations (“fragmented ownership and accountability across organizational boundaries”) and to the data itself (“glue tech that you need to shovel and transform data around to power analytics use cases, including machine learning”). The data may well have open standards or formats, but the organizations tasked with moving data from system A to system B may be even more fragmented than their data.

This is not to say all is lost. We have great organizations such as Open Data Institute working on this and related problems, as well as open source projects such as Apache Arrow (cross-language development platform for in-memory analytics). Companies such as Airbyte (open source data integration) or Databricks (open sourced Delta Lake OSS to create an open source storage layer that brings ACID transactions to Apache Spark) are also tackling this.

It still feels like something more is needed. Figuring out what that “more” should be, however, will be as important as any particular implementation.

Copyright © 2022 IDG Communications, Inc.