How to get your mainframe's data for Hadoop analytics

IT's mainframe managers don't want to give you access but do want the mainframe's data used. Here's how to square that circle

How to get your mainframe's data for Hadoop analytics

Many so-called big data -- really, Hadoop -- projects have patterns. Many are merely enterprise integration patterns that have been refactored and rebranded. Of those, the most common is the mainframe pattern.

Because most organizations run the mainframe and its software as a giant single point of failure, the mainframe team hates everyone. Its members hate change, and they don’t want to give you access to anything. However, there is a lot of data on that mainframe and, if it can be done gently, the mainframe team is interested in people learning to use the system rather than start from the beginning. After all, the company has only begun to scratch the surface of what the mainframe and the existing system have available.

There are many great techniques that can’t be used for data integration in an environment where new software installs are highly discouraged, such as in the case of the mainframe pattern. However, rest assured that there are a lot of techniques to get around these limitations.

Sometimes the goal of mainframe-Hadoop or mainframe-Spark projects is just to look at the current state of the world. However, more frequently they want to do trend analysis and track changes in a way that the existing system doesn’t do. This requires techniques covered by change data capture (CDC).

Technique 1: Log replication

Database log replication is the gold standard. There are a lot of tools like this. They require an install on the mainframe side and a receiver either on Hadoop or nearby.

All the companies that produce this software tell you that there is no impact on the mainframe. Do not repeat any of the nonsense the salesperson says to your mainframe team, as they will begin to regard you with a very special kind of disdain and stop taking your calls. After all, it is software, running on the mainframe, so it consumes resources and there is an impact.

The way log replication works is simple: DB2 (or your favorite database) writes redo logs as it writes to a table, the log-replication software reads that and deciphers it, then it sends a message (like a JMS, Kafka, MQSeries, or Tibco-style message) to a receiver on the other end that writes it to Hadoop (or wherever) in the appropriate format. Frequently, you can control this from having a single write to batches of writes.

The advantage is that this technique gives you a lot of control over how much data gets written and when. It doesn’t lock records or tables, but you get good consistency. You can also control the impact on the mainframe.

The disadvantage is that it is another software install. This usually takes a lot of time to negotiate with the mainframe team. Additionally, these products are almost always expensive and priced in part on a sliding scale (companies with more revenue get charged more even if their IT budget isn’t big).

Technique 2: ODBC/JDBC

No mainframe team has ever let me do this in production, but you can connect with ODBC or JDBC direct to DB2 on the mainframe. This might work well for an analyze-in-place strategy (especially with a distributed cache in between). Basically, you have a mostly normal database.

One challenge is that, due to how memory works on the mainframe, you are unlikely to get multiversion concurrency (which is relatively new to DB/2 anyhow) or even row-level locking. So watch for those locking issues! (Don’t worry -- the mainframe team is highly unlikely to let you do this anyway.)

Technique 3: Flat-file dumps

On some interval, usually at night, you dump the tables to big flat files on the mainframe. Then you transmit them to a destination (usually via FTP). Ideally, after writing you move them to another filename so that it is clear they are done as opposed to still in transmission. Sometimes this is push and sometimes this is pull.

On the Hadoop side, you use Pig or Spark, or sometimes just Hive, to parse the usually delimited files and load them into tables. In an ideal world, these are incremental dumps but frequently they are full-table dumps. I’ve written SQL to diff a table against another to look for changes more times than I like to admit.

The advantage to this technique is there is usually no software install, so you can schedule this at whatever increment you prefer. It is also somewhat recoverable because you can dump a partition and reload a file whenever you like.

The disadvantage is that this technique is fairly brittle and the impact on the mainframe is bigger than is usually realized. One thing I found surprising is that the tool to do this is an option for DB2 on the mainframe, though it costs a considerable amount of money.

Technique 4: VSAM copybook files

Although I haven’t seen the latest "Independence Day" movie (having never gotten over the "uploading the Mac virus to aliens" thing from the first one), I can only assume the giant plot hole was that the aliens easily integrated with defense mainframes and traversed encoding formats with ease.

Sometimes the mainframe team is already generating VSAM/copybook file dumps on the mainframe in the somewhat native EBCDIC encoding. So, this technique has most of the same drawbacks as the flat-file dumps, with the extra burden of having to translate them as well. There are traditional tools like Syncsort, but with some finagling the open source tool Legstar also works. However, a word of caution: If you want commercial support from Legsem (Legstar's maker), I found it doesn't respond to email or answer its phones. That said, the code is mostly straightforward.

Orchestration and more

Virtually any of these techniques will require some kind of orchestration, which I’ve covered before. I’ve had more than one client require me to write that tool in shell scripts or, worse, Oozie (which is Hadoop’s worst-written piece of software and all copies should be taken out to the desert and turned into a Burning Man statue). Seriously, though, use an orchestration tool rather than writing your own or leaving it implicit.

Just because there are patterns doesn’t mean you should write this from scratch. There are certainly ETL tools that do some or most of this. To be fair, frequently the configuration and mapping required makes you wish you had done so in the end. You can check out anything from Talend to Zaloni that might work better than rolling your own.

The bottom line is that you can use mainframe data with Hadoop or Spark. There is no obstacle that you can’t overcome, via no-install to middle-of-the-night to EBCDIC techniques. As a result, you don’t have to replace the mainframe just because you’ve decided to do more advanced analytics from enterprise data hubs to analyze-in-place. The mainframe team should like that.