Keeping data confidential

Prying eyes can cost companies big time, especially when medical records are exposed. These techniques for obscuring database records can help

See correction at end of article

Many companies that were doing business in a relatively carefree fashion a year ago now find their customers insisting on high levels of accountability. Of course, health-care organizations have been under the watchful eye of the Joint Commission on Accreditation of Healthcare Organizations (JCAHO) and the Health Insurance Portability and Accountability Act (HIPAA) for quite some time. But public companies outside the health-care industry have woken up to find themselves facing Sarbanes-Oxley and other mandates.

Each of these regulations requires enterprises to protect and control the flow of information. Strong, identity-based access-control systems accomplish this for applications and resources, but when it comes to databases -- against which employees may have become accustomed to performing ad-hoc queries -- the situation can be more difficult. In those cases, one way to protect data from prying eyes is to obscure it.

Obscuring data for compliance -- at its most basic level -- just means keeping people from seeing what they're not supposed to see. There are two parts to obscuring data: security and privacy. Threats range from analysts clicking on the wrong database table and accidentally seeing something they shouldn't, to a disgruntled employee looking to get back at the company. So, whether you lock everyone out of your database or just arrange the data so that no personal information can be viewed by nonessential personnel, there are time-tested techniques for protecting your company and your clients against privacy invasion.

There are many ways to obscure data from wandering eyes -- and sometimes DBAs just have to get creative if they are going to achieve their goals. The methods used range in difficulty based not only on the level of sensitivity of the data but also on the skill of the DBA, the company's level of commitment, and the amount of time afforded for the project.

Basic Techniques

Stored procedures are the most flexible tool DBAs have for obscuring data. Stored procedures are snippets of SQL code that have been compiled and saved in the database itself, resulting in better performance than that provided by noncompiled SQL. Stored procedures can also call other procedures, views, and functions and can perform any type of math available to the SQL language. Using well-designed procedures, a DBA can alter data in almost any way necessary, including encrypting it.

Views, another way to obscure data, are a logical representation of data that can join several tables at a time while maintaining a good level of security. Sensitive columns can be filtered out -- or even modified -- using simple conditional statements. Because they are less flexible than stored procedures and can't make major modifications to data, however, views are only useful for protecting against minor threats. Nonetheless, they are still useful for obscuring data as well as the underlying schema.

UDFs (user defined functions) are the next level down in data obscurity. UDFs are similar to stored procedures that can be applied to individual columns. UDFs are implemented at the database level but can be called directly from queries -- unlike stored procedures, which replace the query. Because they are more granular than stored procedures, UDFs can actually be more flexible.

But these techniques must be deployed properly for maximum effect. Here, it's assumed that you're designing these solutions for internal protection. (External solutions are a different story completely.) Typically, you would apply these methods when supplying analysts with data via the Web, for example, rather than giving them free access to execute any query they want to write.

Of course, you can bypass each of these methods and just encrypt that data in the database itself. This comes at a cost -- not only does it take extra time to write the record, because it has to pass through the encryption routine, but you also have to buy an encryption routine. These typically aren't cheap, and unless you have a cryptologist on staff, you may find yourself in the middle of a very expensive solution. Using encryption has one major advantage, however: With all the other methods, anyone who knows the database schema can bypass the mechanisms you've put in place and query the data directly. This isn't possible with database-level encryption.

Use Your Imagination

When it comes to obscuring data from users, you can seldom be too creative. Remember that data obscurity is a security measure and that good security never has a single point of failure. You would never see a network where a firewall or a router or a simple password policy was the sole security measure. So, just as your network makes use of many levels of routers, firewalls, access policies, and passwords, your database should have multiple levels of security.

When it comes to securing data, I live by a simple philosophy: Treat everyone as a potential criminal. Threats come from both sides of the firewall. You never know when a disgruntled employee might do malicious damage to the network or walk off with a copy of a database backup. These are real possibilities that must be considered -- not only for your company's own needs but also to protect your customers from the fallout of having their sensitive data exposed.

With internal threats in mind, another common method of obscuring data is to create a separate reporting database, possibly even on a separate server. An ETL (extraction, transformation, and loading) tool can be used to publish data to this separate database and transform it along the way. This may be the most secure of all the methods discussed so far, because users need not be given access to the production database. Furthermore, the DBA can take steps to ensure that the reporting database doesn't contain sensitive data.

Another crafty technique is to create routines that insert extra data as a smoke screen to confuse anyone who gains access to your database. By keeping a pool of fictitious names, phone numbers, and so on, you can generate random false records, making it impossible for even a disgruntled insider to determine the accuracy of any given record.

I implemented this solution for a major health-care organization, and it worked very well. Even the biostatisticians couldn't spot the fake data. I also created three false records for every real one, which greatly diminishes the likelihood of a thief finding a real record. Separating the genuine records from the bogus ones simply becomes too much trouble.

So how can the system administrators tell fake data from real data? Well, obviously you have to plan ahead for that one. There are two basic methods for separating the data back out. One way is to create an extra column that provides some kind of control. This column can be encrypted, or it can be a bit flag with an ambiguous name.

However, I prefer to not have anything in the database that might allude to the fact that all the data isn't real. Rather, I prefer to create records in another system and then use a join to pull the real records back out again. This way, everything in the production system is real to everyone who sees it, no matter how closely they look at it. The key is kept in a different location.

Click for larger view.

No Single Solution

By now it should be clear that the best plans for securing data will combine several of these methods.

For example, you can use stored procedures or UDFs that call encryption routines, but UDFs by themselves aren't effective because users might not wrap the column inside the function. You should give users a view to query and write the view so that it wraps the columns inside the UDF. This way, users can still perform ad-hoc queries on the data, while administrators maintain control over what the users see.

If you are adding dummy data, views can be used to conceal that fact from your users. By creating the view with a join against a foreign table that filters out the extra data, you can create an additional level of obscurity. This same technique could be deployed as a stored procedure as well. If you employ this method, don't forget to run your fake data through the same encryption routines as your real data. It completes the illusion. And of course, if you don't publish your schema, your users won't be able to bypass your measures, because they won't know what tables to query.

When I was brought in to lead the compliance effort for a major hospital chain, I entered a culture in which all the system administrators and analysts had enjoyed complete access to all the clinical databases for years. They were very familiar with both the schema and the data itself. It forced me to build another schema that summarized the data for the analysts and loaded it every night from the production systems. This schema resided on a different server, in another building, under the control of an outsourced datacenter. The administrators and analysts were then locked out of the production systems, and the backups were encrypted to protect against theft from outside. In this case, it wasn't necessary to encrypt anything because sensitive data simply wasn't being brought over to the reporting system.

Monitoring the Results

After putting your plan in place, you still need to measure your success. Auditing is the best way to achieve this. I'm not going to lie to you; a complete, enterprise-level auditing solution is not cheap, and depending on the database you have, you may never find a perfect solution. That doesn't mean you shouldn't try -- auditing not only gives you a solid measure of your success but it also provides you with an added layer of security.

Unlike some security measures, you need not keep this process under wraps. In fact, if users are aware that their database activity is being audited and that there are severe penalties for tampering, they're far less likely to try to circumvent your plan. Oftentimes this is all it takes to raise the integrity level of your staff. You can even keep your auditing costs down by using what I like to call the dog-training approach.

If you've ever bought those electrified mats that shock your dog when he walks across them, you know what I'm talking about. The mats have a pattern on them, which the dog learns to associate with the shock. When you buy the mats, they sell you dummy mats as well. The theory is that the dog will learn to not walk across anything with the mat's pattern.

Auditing works the same way. Audit your major systems, but tell your employees that all the systems are under full audit. This should give you the best of all worlds: Your company is fully compliant, because the major systems are audited; you didn't have to actually pay to audit all the systems; and you have the most honest employees around.

A previous version of this article contained an error introduced in the editing process that misstated the scope of HIPAA.