Doing data warehousing the wrong way

If data pipelines and streams are the future, why are we still thinking of data as static?

Doing data warehousing the wrong way
robynmac / Getty Images

It’s felt obvious for some time that, as an industry, we’ve been trying to shove square data warehousing tools into round, data-driven application holes. But it wasn’t until I read Decodeable CEO Eric Sammer’s excellent post “We’re Abusing the Data Warehouse: RETL, ELT, and Other Weird Stuff” that I understood why and what damage we were doing in the process. As Sammer writes, “Putting high-priced analytical database systems in the hot path introduces pants-on-head anti-patterns to supportability and ops.”

In case you’re wondering, “pants-on-head anti-patterns” is not a compliment. It’s an anguished cry of “someone please stop this insanity!”

Compounding the data problem

Ask enterprises how they feel about their data warehouses, and a high percentage (83% in this survey) express dissatisfaction. They struggle to load data. They have unstructured data but the data warehouse can’t handle it, etc. These aren’t necessarily problems with the data warehouse, however. I’d hazard a guess that usually, the dissatisfaction arises from trying to force the data warehouse (or analytical database if you prefer) to do something for which it’s not well suited.

Here’s one way the error starts, according to Sammer:

By now, everyone has seen the rETL (reverse ETL) trend: You want to use data from app #1 (say, Salesforce) to enrich data in app #2 (Marketo, for example). Because most shops are already sending data from app #1 to the data warehouse with an ELT tool like Fivetran, many people took what they think was a shortcut, doing the transformation in the data warehouse and then using an rETL tool to move the data out of the warehouse and into app #2.

The high-priced data warehouses and data lakes, ELT, and rETL companies were happy to help users deploy what seemed like a pragmatic way to bring applications together, even at serious cost and complexity.

And why wouldn’t they? “The cloud data warehouse is probably the most expensive CPU cycle available,” says Sammer. Data warehouse topology ends up multiplying data (creating governance issues, among other problems), but it does have the advantage of being convenient. Data warehouses are convenient in the sense that they’re well understood. Plenty of people are trained to use them, and they are already in use.

Right problem, wrong solution

Sammer makes a compelling “pants-on-head” point that they’re the exact wrong, most costly way to build data-driven applications. Why? Because “putting a data warehouse between two Tier 1 apps is a [bad] idea.” Companies tend not to treat their analytical systems “like Tier 1, business-critical components.” Hence, “companies don’t replicate analytic tools and data for high availability across availability zones; they don’t (usually) carry pagers; they don’t duplicate processes.” The result is “enormous cost and risk.” Or, as he concludes, “We’ve accidentally designed our customer experience to rely on slow batch ELT processes.”

So what’s the alternative?

For Sammer, it’s all about streaming data, not ELT (or reETL). It’s about real-time data pipelines based on the Kappa architecture. Kappa architecture means you have an “append-only immutable log. From the log, data is streamed through a computational system and fed into auxiliary stores for serving,” explains Milinda Pathirage, a specialist in big data engineering at KPMG. Or as Confluent CEO Jay Kreps wrote in 2014 when he was an engineering lead at LinkedIn, arguing against just the sort of “duct-taped” Lambda architecture approach that Sammers dismisses, “why can’t the stream processing system just be improved to handle the full problem set in its target domain?”

That is, make the stream the center of the data universe.

The benefits, suggests Sammers, are several: “It costs far less, provides Tier 1–quality SLAs without the cost of duplicating data, allows for fractional rather than total failures, and gets your data warehouse out of the critical path. Put simply, Kappa means to pull data from app #1 as it occurs, send it in bite-sized chunks to a data gateway, transform/enrich as necessary, and then deliver to all the places it’s needed in parallel.”

Even if you don’t think the stream supersedes the database (I personally don’t), it’s easy to get on board the idea that data pipelines/streams are more the future than trying to shove data back and forth between analytical databases. Although Kappa and similar approaches deliver real-time data, it’s not really about that. According to Sammer, it’s about ending the “massive pile of tech debt that rETL is accumulating. It’s about unraveling the nest of critical dependencies on analytics tools and making support sustainable.”

The next time you’re building an app that has a leaderboard or needs to be informed by data (and that’s a swelling percentage of apps), ask yourself why your default data assumption is stasis: data sitting in storage that you then have to propel from app to app, from system to system. We don’t live in that batch-oriented world anymore. Sammers seems to be correct: Streams should be the default, not an exception.

Copyright © 2022 IDG Communications, Inc.