How GDPR impacts your software testing

Agile development teams can’t ignore the changes that the General Data Protection Regulation (GDPR) bring to software testing

European Union [EU] flag and binary code
Thinkstock

Concepts like compliance and regulations generally don’t resonate very well in the agile community. However, the General Data Protection Regulation (GDPR) is one regulation that agile development teams can’t afford to ignore—and with fewer than 52 days until it goes into effect (see the countdown clock here), we’re fast approaching the “now or never” point.

What is GDPR? The General Data Protection Regulation regulates how companies protect EU residents’ personal data. It goes into effect on May 25, 2018. For every breach after that point, companies could incur fines of up to 4 percent of the company’s annual global turnover or €20 million (whichever is greater).

A few more key points:

  1. It’s a regulation (legally binding), not a directive.
  2. It applies to a lot of personal data (any information related to a data subject and also online identifiers such as IP addresses and unique device identifiers).
  3. It has extraterritoriality (applies to organizations within and outside the EU).
  4. Data breaches will require notification to a supervisory authority within 72 hours.
  5. Businesses cannot export data outside the EU if there are insufficient data protections.

Before you write off “GDPR compliance” as another department’s problem, consider this:

  • Someone is (hopefully) testing the applications that your development teams are building.
  • You can’t execute realistic tests without appropriate test data (e.g., name, address, billing details, etc.).
  • If that test data happens to include any personal data of any EU residents, it must be collected, processed, stored, and declared in accordance with GDPR—or you might be responsible for the company incurring some sizable fines.

Even before GDPR, obtaining and applying appropriate test data has always been challenging. It’s especially tricky when you’re testing complex scenarios—for example, when an account must be in a certain state before you can exercise some core functionality, or when order status changes multiple times throughout the course of a single transaction. And the more frequently you run tests (think testing integrated into CI), the more difficult it becomes ensure that the tests have access to the necessary range of fresh, unexpired test data.

How do you meet these demands—and at the same time ensure that your test data complies with GDPR? There are two main ways: masking production data and using synthetic test data.

Most organizations get their test data from production data because (1) it’s available and (2) it’s known to be realistic. However, GDPR means that production data can no longer be used as is if it contains any private data from any EU residents. Now, that data must be masked irreversibly and deterministically (i.e., the same way across all instances).

From a testing perspective, the “extract and mask” approach presents some significant challenges. Most modern business applications involve multiple distributed components that are interconnected, but each store data in their own database.

Without getting into the low-level details, let’s just say that orchestrating all the appropriate extraction and masking is tricky due to the redundancies involved (core data like name and address is most likely stored across multiple databases).

Moreover, reinserting the masked data into all the appropriate system components in a coordinated way is even more daunting. Usually, the enterprise integration layer handles all the synchronization of data across components. If you want to extract, mask, and reinsert test data, you essentially need to re-create the logic of that enterprise integration layer.

In most agile iterations, teams need to enter “crunch mode” just to complete the expected work before the sprint review. There’s little time or enthusiasm for performing additional work focused solely on coordinating test data extraction/masking/insertion.  

Another option is to synthetically generate the test data that you need. The fact that it’s completely fake means that GDPR compliance becomes a nonissue. Another benefit: Synthetic test data can be inserted at the API layer, where the application interacts with the enterprise integration layer. This way, the application handles all the necessary data synchronization.

However, fake data can only get you so far. You can typically achieve high (though not perfect) risk coverage using synthetic test data alone. Synthetic test data generation sometimes fails when data objects with a long history are required for testing. For example, it might be difficult or even impossible to provide a 40-year life insurance contract that was signed 25 years ago. This type of legacy data typically needs to be extracted from production because it’s not easily generated.

Fortunately, this limitation is narrow in scope. For example, my company's studies have found that in retail, synthetic test data can usually achieve 98 percent risk coverage. The coverage for telecoms is also high: 96 percent. With insurance and banks, it’s a little lower, but still higher than 90 percent.

My recommendation is to use synthetically generated test data as much as possible, then fill in the gaps with masked production data. You’ll dramatically reduce the amount of test data that falls under the scope of GDPR. This lightens the burden of complying with this extremely stringent and far-reaching regulation—from a software testing perspective, at least. 

Copyright © 2018 IDG Communications, Inc.