Rethinking data architectures for a cloud world

An open, services-oriented approach has clear advantages for building modular and scalable applications. We should take the same approach to our data.

Rethinking data architectures for a cloud world
ivanastar / Getty Images

Data analytics solutions are continuing to emerge at a fast and furious rate. Data teams are at the center of the storm because they have to balance all the demands for access, data integrity, security, and proper governance, which entails compliance with policies and regulations. The businesses they serve need information as quickly as possible and have little patience for that precarious balancing act. The data teams have to move fast and smart.

They also have to be fortune tellers because they need to build not just the systems for today, but also the platforms for tomorrow. The first key question the data team must consider is: open or closed data architectures.

Open vs. closed data architecture

Let’s start with the phrase “data architectures.” If I were to show you an architecture diagram from any enterprise over the last 50 years, odds are that their labels for data would in fact be labels representing databases—not the data itself, but the engines that act upon the data. Names here are familiar, both old and new: Oracle, DB2, SQL Server, Teradata, Exadata, Snowflake, etc. These are all databases into which you load your datasets for either operational or analytical purposes, and they are the foundations of the “data architecture.”

By definition, those databases are what we would call “closed data architectures.” That’s not a value statement; it’s a descriptive one. It means that the data itself is closed off from other applications and must be accessed through the database engine. This is true even for moving data around with ETL jobs because at some point, to do the export or the import, you need to go through the database, whether that’s the optimal way to achieve what you want to do or not. The data is “closed” off from the rest of the architecture in this important sense.

In contrast, an “open data architecture” is one that stores the data in its own independent tier within the architecture, which allows different best-of-breed engines to be used for an organization’s variety of analytic needs. That’s important because there’s never been a silver bullet when it comes to analytic processing needs, and there likely never will be. An open architecture puts you in an ideal position to be able to use whatever best-of-breed services exist today or in the future.

To summarize: A closed data architecture brings the data to a database engine, and an open data architecture brings the database engine to the data.

data architectures Dremio

An easy way to test if you’re dealing with an open architecture is to consider how hard it would be in the future to adopt a new engine. Will you be able to run the new engine side by side with an existing one (on the same data), or will a wholesale (and likely impractical) migration be required?

Note at this point, we’ve touched on a critical aspect of “open” that has nothing to do with open source. Step one is deciding that you want your data open and available to any services that wish to take advantage of it, and that brings us to open in a cloud world.

Open, services-oriented data architecture

When applications moved from client-server to web, the fundamental architecture changed. We went from monolithic applications that ran in one process, to services-oriented applications that were broken into smaller, more specialized software services. Eventually, these became known as “microservices” and they remain the dominant design for web and mobile applications. The microservices approach held many advantages that were realized due to the nature of cloud infrastructure. In a scale-out system with on-demand resource models and numerous teams working on pieces of functionality, the “application” became nothing more than a facade for dozens or hundreds of microservices.

Everyone agrees that this approach has many advantages for building modular and scalable applications. For some reason, we’re expected to believe that this paradigm isn’t nearly as effective for data. At Dremio, we believe that’s inaccurate. We believe the logic of looking at our data in the same open, services-oriented manner as our applications is intuitively obvious and desirable. On a practical and strategic level, an open, services-oriented data architecture just makes sense.

That’s why, for us, the issue of open source software is secondary. The primary “open” that matters most is the first step of deciding an open data architecture is more desirable than a closed one. Once that happens, a watershed of goodness is unleashed. Open file and table formats (Apache Parquet, Apache Iceberg, etc.) are critical as they allow for industry-wide innovation. That innovation gets delivered in the form of services that act upon the independent data tier. Messy, costly, fragile, and compliance-undermining copying of data is greatly reduced or even eliminated. The data team gets to choose from best-of-breed services to act upon that data, slotting them into the architecture the same way we have been doing with application services for more than a decade. It’s time for data architectures to catch up.

There is one legitimate claim levied by those disputing the value of open data architectures: They’re too complicated. Complication comes with any major technological shift. Midrange computers were initially more complicated to manage than established mainframes. Then Intel-based servers were initially more complicated to manage than established midrange systems. Managing PCs was initially more complicated than managing established dumb terminals. You see the point. Each time a technology shift happens, it goes through the normal adoption curve into the mainstream. The early days are always more complicated from a management perspective, but with time, new tools and approaches reduce that complexity, resulting in the benefits far outweighing the initial complexity cost. That’s why we have innovation.

Dremio was created to make an open, services-oriented data architecture much, much easier and more powerful. With Dremio, running SQL against a lakehouse is easy because of the way we put all the pieces together. And we’ve created industry-changing open source projects along the way, such as Nessie, Apache Arrow, and Arrow Flight. These are open source projects because open source technology encourages adoption and interoperability, which are critical for service integration layers in an organization’s data architecture. Everyone wins. Customers win because they get a collective industry working on and innovating key pieces of technology to better serve them. Open source enthusiasts win because they get access to the code to better understand it, and even improve it. And we win because we use those innovations to make SQL on lakehouses fast and easy.

To put a fine point on this discussion, the reality is that no matter how “open” a vendor claims to be, no matter how much they talk about supporting open formats and open standards, even if that vendor was open source at its core, if the data architecture is closed, it is closed. Period.

One key point that Snowflake has made in recent articles is that you need to be closed in areas like the data format and storage ownership in order to meet business requirements. While this may have been true 20 years ago, recent advancements such as cloud storage and transactional table formats now enable open architectures to meet these requirements. And if a company can meet its requirements with an open architecture and all the benefits that come with it, why would it choose a closed architecture? We suspect this might be why Snowflake is spending so much time arguing that open doesn’t matter.

Data as a first-class citizen

At Dremio we’re advocating for a world where the data itself becomes a first-class citizen in the architecture. We’re making that easier and easier to realize for companies that want the benefits of an open architecture, such as: (1) flexibility to use best-of-breed engines best suited for different jobs; (2) avoiding being locked into going through a proprietary engine in order to access their data; (3) setting themselves up to take advantage of tomorrow’s innovations; and (4) eliminating the complexity that endless copying and moving of data into and out of data warehouses has created.

We’re not only committed to open standards and open source, important as they may be—we’re first and foremost committed to open data architectures. We believe that as they become easier and easier to implement and use, the advantages are overwhelming when compared to a closed data architecture. We’re also committed to equipping and educating people on this journey with initiatives like our Subsurface industry conference, which attracted over 10,000 attendees in our first-ever events last year. The momentum is building and the destination is a future with open data architectures at its core.

Tomer Shiran is co-founder and chief product officer at Dremio.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2021 IDG Communications, Inc.

How to choose a low-code development platform