Ye olde ETL shoppe

New data integration vendors are promising to ETL your data to its destination in minutes. Is the old ETL process, with hundreds of complex stages, completely defunct?

I recently worked on a content project for Alooma, an ETL and data pipeline service on the cloud. ETL is “extract, transform, load,” a time-tested process for loading mission-critical events into a central data store, while making sure that the data is correct.

A new take on ETL

Alooma and its competitors like XPlenty and Stitch Data have a new take on ETL. They promise to move huge volumes of data into data warehouses or other data stores effortlessly, with all the integration and data plumbing taken care of as a managed service.

If you ask them, you do not need an ETL tool or an ETL process, per se, anymore, because the managed data pipeline comes with its own industrial-strength ETL capabilities which are much easier to use than the ETL tools of old.

Can ETL really be that simple? I didn’t take their word for it.

Ask your elders

I read up on what was traditionally called “ETL,” and found a lot of literature, much of it written ten or more years ago, about ETL processes and methodologies that appeared frighteningly complex.

To illustrate, the Udemy course “ETL Testing: From Beginner to Expert” includes no fewer than 248 lectures: seven lectures on basic concepts, nine on data warehouse architecture, 29 on dimensional modeling, eight on ETL concepts, and 22 on possible defects in the ETL process. Should I go on? The unavoidable TutorialsPoint on ETL Testing is an endless 20-part tutorial. Almost all the resources I found online are like this!

I mapped out these endless ETL processes and came back to Alooma. I talked to a few of the experts who built their enterprise-grade data pipeline. They weren’t even aware of many of the concepts and stages in this convoluted ETL process.

A member of my team, a former IT leader who managed ETL and data projects when “Amazon” was a jungle and “Azure” was a choice of tablecloth for a wedding, told me:

I get it, they are the new kids who didn’t have a black-and-white TV.

Could it be that elaborate ETL processes with hundreds of stages, still being studied by aspiring ETL experts, and practiced at many large organizations, are simply not relevant anymore in the new world of cloud-based data pipelines?

Is the old process an “olde ETL shoppe” where the data engineers of yesteryear sit around small tables sipping tea from porcelain cups?

Or perhaps some parts of that process are still relevant in new architectures. Moreover, maybe some important ETL stages have been overlooked by the new vendors?

Mapping new tools to the old process

The primary question for me was, what happened to all those hundreds of stages in the data engineer handbooks.

ETL PostgreSQL to Redshift in minutes,” as Alooma proclaims, sounds great on paper, but has the vendor really taken care of all the complexity in Udemy’s 248 ETL Testing lectures, obviating the need for this type of training?

I took the trouble of distilling those endless tutorials and lectures into discrete process stages, to identify “what really happens” in the ETL preparation and testing process.

I sat down with Alooma to see which of these stages are simply not relevant in a cloud-based ETL architecture, and which are actually relevant, but made easier by managed solutions.

olde etl image1 Gilad David Maayan

It turns out that of 32 discrete stages or issues in the old ETL process:

  • 17 stages are not relevant at all in a cloud architecture
  • 15 stages are relevant but made easier in a cloud architecture
  • For these 15 stages, many of them are handled transparently by the data pipeline platform with little or no user intervention

(This is based on my analysis of the Alooma platform; give or take, it should be similar for competing vendors as well.)

The full details of all those ETL stages and how they are translated into the new architecture are beyond the scope of this post, but for those of nerdy inclination, watch out for a detailed writeup I’ll be releasing soon.

The bottom line is that, yes—a new cloud-based data pipeline can get rid of over half of the stages in the old ETL process, and because it dramatically simplifies the other stages, you can actually set up ETL, if not in minutes as advertised, within hours or days. While old enterprise ETL projects could easily take years.

From manual bookkeeping to cash register

I thought of comparing it to bookkeeping for a physical store. Many years ago, stores had clerks who would meticulously copy each transaction into a day book, and then aggregate those transactions manually into debit and credit columns in a general ledger.

Imagine how many manual operations are required to record daily transactions for even a small store, how many mistakes are possible, and the extend of verification, testing and auditing that would be required to run this process accurately at large scale.

It is my impression that the ETL tools of old are like a calculator or a spreadsheet that can help organize and streamline the manual bookkeeping process. They can definitely make things much easier. But they leave the process as is.

The new “ETL-inside” data pipelines are like a digital cash register. Imagine what a huge difference it makes for a store with manual bookkeeping to implement a cash register: a machine able to scan the barcodes on each item and automatically generate books for the store. So many manual steps eliminated in one swoop, and more importantly, the remaining steps abstracted away into a seamless operating environment.

So yes, the advertising is true: Ye olde ETL shoppe is now a 7/11.

This article is published as part of the IDG Contributor Network. Want to Join?