Tomorrow I’ll be giving a webcast over at ZDNet (sponsored by Oracle) on the Top 5 Database Security Resolutions for 2008. The resolutions have changed a bit since I first posted about them over here, and I decided to swap in data masking for the last one. I almost pulled it back out after I found out my sponsor (Oracle) just released a data masking product (I try to avoid being too promotional in my webinars), but it’s something I’ve been talking about for a while and it’s too important to pull just because a few people might think I was being biased.

We’re up to nearly 600 people registered for the event, making it one of the largest webcasts I’ve done.

But enough self-promotion; it’s time to talk about data masking.

Data masking started popping up as an issue about 3 years ago. At the time I was covering database security, but client calls were bouncing around between me on the security team and someone over in application development. It’s one of these annoying security issues that crosses organizational boundaries and ends up the responsibility of those will little security experience. It’s an issue that grew organically- first popping up in some audits related to GLBA (a financial services regulation), and now something we see required for PCI and a few other regulations.

Data masking is really a bad term for what we’re talking about. We can technically mask data anywhere, but when we use the term data masking we usually mean “test data generation” or “analytical data generation”. It’s the conversion of production data into either test and development data or data for a data warehouse (OLAP). For this post we’ll focus on test data generation, but the same techniques can be used for an OLAP where you want data that represents production data, but still protects the sensitive stuff.

And that’s our goal- to take sensitive data from a production system and convert it into non-sensitive data suitable for testing or analysis. We can do this through substitution, transposition, obfuscation, de-coupling, scrambling, hashing, or even encryption.

I’m going to quickly eliminate hashing and encryption from the discussion- those techniques are very effective at protecting data, but the result breaks the second rule of data masking- that the data is still representative of the source, without being sensitive.

Organizations are increasingly finding that data masking is mandated for regulatory compliance. It’s also an extremely effective way to reduce enterprise risk. Development and test environments are rarely as secure as production, and there’s little reason developers should have access to sensitive data. Analytical systems are often accessed by a wide variety of users, most of whom shouldn’t see sensitive data, with only a fraction of the access and other security controls in transactional systems.

With that, and since I get way more hits if I have the “x laws” in the title, here are the Five Laws of Data Masking:

Masking must not be reversible. However you mask your data, it should never be possible to use it to retrieve the original sensitive data.
The results must be representative of the source data. The reason to mask data instead of just generating random data is that masking allows you to protect sensitive information that still resembles production data for development and testing purposes. This could include geographic distributions, credit card distributions (e.g., leaving the first 4 numbers unchanged, but scrambling the rest), or maintaining human readability of (fake) names and addresses.
Referential integrity must be maintained. Your masking solution should maintain referential integrity- if a credit card number is a primary key, and scrambled as part of masking, then all instances of that number linked through key pairs must be scrambled identically.
Only mask non-sensitive data if it can be used to recreate sensitive data. It isn’t necessary to mask everything in your database, just those parts that you deem sensitive. But remember, some non-sensitive data can be used to either recreate or tie back to sensitive data. For example, if you scramble a medical ID but the treatment codes for a record could only map back to the original record, you also need to scramble those codes. This is called inference analysis, and your masking should protect against it.
Masking must be a repeatable process. One-off masking is not only nearly impossible to maintain, but it’s fairly ineffective. Development/test data needs to represent constantly changing production data as closely as possible. Analytical data may need to be generated daily, or even hourly. If masking isn’t an automated process it’s inefficient, expensive, and ineffective. I know of some organizations that centralize masking and offer it as an internal service to the enterprise.

These “laws” are just to start the discussion on masking. In future posts I’ll discuss my recommended data masking process and what features to look for in tools.

And if you absolutely can’t wait until I get around to a follow-on post, join me for the webinar on Friday where I’ll dig in a little deeper.

Share: