Visual ETL considered harmful (to your skills)

Why hasn't the ETL process evolved with the times?

ETL (Extract Transform and Load) is the common name for the processes through which data gets pulled from multiple repositories, combined, filtered and aggregated in various ways, and then "loaded" where it is intended to be used for a specific purpose, like a target data mart. Many ETL tools are predicated on the bedrock belief that visual representations of the transformations are easier to create, manage, and understand than corresponding textual descriptions.

The idea goes that if users can simply drag and drop blocks representing data onto a visual design surface and connect the blocks via arrows representing transformation, then anyone can do it. Such visual tools have existed for more than 20 years (probably longer), yet the ETL process has not gotten radically more efficient or transparent. Why is that?

Much of it comes down to the fact that big diagrams of boxes and arrows are hard to understand, maintain, search, and reason about. Text is generally much better for those things. 

In principle, you would think that having a visual representation of an ETL pipeline would be the easiest form to reason about. For example, you might trace back from a single box to all the other boxes that pour data into said box. In fact, real ETL pipelines fan out very quickly, and you immediately find yourself considering very large numbers of tables and columns in any such walk. This quickly means that you need software to help you make sense of it. I think the right HCI metaphors have not yet been invented to really help with this.

What's more, commercial incentives are generally to lock the user into a specific UI or visual language. By expressing the underlying computation in a proprietary format that requires continued use of the UI, the vendor guarantees hard migration and continued revenue. Once a computation is embedded in one of the proprietary UIs, you really need a specialist in that UI to go in, understand it, and modify it. Furthermore, it's very hard to figure out if the modifications are correct by simple inspection. It may not feel that way in the beginning when there are only a handful of boxes and arrows, but in enterprise environments, it usually quickly gets to a point where data flow involves dozes of boxes. 

Are there other options? The traditional approach of writing straight line code in low-level imperative languages like Java or C++ has certainly been tried. Gluing together lots of different utilities with complex scripts using Bash or Python has been tried. In fact, ETL is so necessary and ubiquitous in data processing, that you're likely to find all of the possible approaches. Involving software development at every step is the slowest and most expensive way to go, generally. Whenever there's a specify-build-test-try loop involving specialists and scheduling, things bog down. 

If you've read my previous blog, you know I'm a big fan of SQL, despite its age and its many flaws. SQL is English-like and most people can read and make sense of a statement like:

SELECT customer, count(items) num_items
FROM PurchasesTable
WHERE date >= DATE ‘2014­01­01’
GROUP BY customer
HAVING num_items > 10

ETL pipelines written in SQL are easier to read than complex diagrams that required software to even render on a screen. Being able to read and interpret is often an important characteristic in order to understand why data is coming out of the pipeline in a certain way. Otherwise, the pipeline is just a black box that can only be interpreted by its owners rather than being open and understandable. SQL is often easier to write than using a custom design surface tool, especially when the concepts get a little bit more complex and interrelated. That's the concept of literacy I was referring to in my previous post. The most common case in large environments is that one is starting from a library of existing statements that need to be modified, rather than starting from a completely blank sheet of paper. Also, once written, the results from one step are much more easily composed with the next step in a computation. 

Enterprises are under pressure to generate results faster and many more people are working with data directly than ever before. Of course, dedicated UI has helped tremendously in areas like visual analysis of data, for example with Tableau. However, the greedy approach of following what seems easiest often leads to long term maintainability issues, lack of agility and higher total cost of ownership. ETL especially is an area where data quality, data governance and traceability of computation are of paramount importance and being able to read SQL is a kind of literacy in this area. Enterprises that invest in this type of literacy create long-term empowerment in their people and their own ability to understand data in disparate systems. 

I'd love to get your feedback on these thoughts. Please feel free to email me directly or post comments below!

In the 1960s, Edsger Dijkstra wrote the heartfelt "Go To Statement Considered Harmful" that inspired many other "XYZ Considered Harmful." The title was intended to challenge orthodox views on a topic. This is a lighthearted series of posts in that vein.

Copyright © 2015 IDG Communications, Inc.