The Five Laws Of Data Masking

By Rich

Tomorrow I’ll be giving a webcast over at ZDNet (sponsored by Oracle) on the Top 5 Database Security Resolutions for 2008. The resolutions have changed a bit since I first posted about them over here, and I decided to swap in data masking for the last one. I almost pulled it back out after I found out my sponsor (Oracle) just released a data masking product (I try to avoid being too promotional in my webinars), but it’s something I’ve been talking about for a while and it’s too important to pull just because a few people might think I was being biased.

We’re up to nearly 600 people registered for the event, making it one of the largest webcasts I’ve done.

But enough self-promotion; it’s time to talk about data masking.

Data masking started popping up as an issue about 3 years ago. At the time I was covering database security, but client calls were bouncing around between me on the security team and someone over in application development. It’s one of these annoying security issues that crosses organizational boundaries and ends up the responsibility of those will little security experience. It’s an issue that grew organically- first popping up in some audits related to GLBA (a financial services regulation), and now something we see required for PCI and a few other regulations.

Data masking is really a bad term for what we’re talking about. We can technically mask data anywhere, but when we use the term data masking we usually mean “test data generation” or “analytical data generation”. It’s the conversion of production data into either test and development data or data for a data warehouse (OLAP). For this post we’ll focus on test data generation, but the same techniques can be used for an OLAP where you want data that represents production data, but still protects the sensitive stuff.

And that’s our goal- to take sensitive data from a production system and convert it into non-sensitive data suitable for testing or analysis. We can do this through substitution, transposition, obfuscation, de-coupling, scrambling, hashing, or even encryption.

I’m going to quickly eliminate hashing and encryption from the discussion- those techniques are very effective at protecting data, but the result breaks the second rule of data masking- that the data is still representative of the source, without being sensitive.

Organizations are increasingly finding that data masking is mandated for regulatory compliance. It’s also an extremely effective way to reduce enterprise risk. Development and test environments are rarely as secure as production, and there’s little reason developers should have access to sensitive data. Analytical systems are often accessed by a wide variety of users, most of whom shouldn’t see sensitive data, with only a fraction of the access and other security controls in transactional systems.

With that, and since I get way more hits if I have the “x laws” in the title, here are the Five Laws of Data Masking:

  1. Masking must not be reversible. However you mask your data, it should never be possible to use it to retrieve the original sensitive data.
  2. The results must be representative of the source data. The reason to mask data instead of just generating random data is that masking allows you to protect sensitive information that still resembles production data for development and testing purposes. This could include geographic distributions, credit card distributions (e.g., leaving the first 4 numbers unchanged, but scrambling the rest), or maintaining human readability of (fake) names and addresses.
  3. Referential integrity must be maintained. Your masking solution should maintain referential integrity- if a credit card number is a primary key, and scrambled as part of masking, then all instances of that number linked through key pairs must be scrambled identically.
  4. Only mask non-sensitive data if it can be used to recreate sensitive data. It isn’t necessary to mask everything in your database, just those parts that you deem sensitive. But remember, some non-sensitive data can be used to either recreate or tie back to sensitive data. For example, if you scramble a medical ID but the treatment codes for a record could only map back to the original record, you also need to scramble those codes. This is called inference analysis, and your masking should protect against it.
  5. Masking must be a repeatable process. One-off masking is not only nearly impossible to maintain, but it’s fairly ineffective. Development/test data needs to represent constantly changing production data as closely as possible. Analytical data may need to be generated daily, or even hourly. If masking isn’t an automated process it’s inefficient, expensive, and ineffective. I know of some organizations that centralize masking and offer it as an internal service to the enterprise.

These “laws” are just to start the discussion on masking. In future posts I’ll discuss my recommended data masking process and what features to look for in tools.

And if you absolutely can’t wait until I get around to a follow-on post, join me for the webinar on Friday where I’ll dig in a little deeper.

No Related Posts

I don’t work for, but have worked with, software supported in Eclipse which follow these rules. IRI FieldShield for source data masking, and IRI RowGen for test data creation, both preserve referential integrity with de-identified, irreversible field-level masks in databases and flat files.

By Urvashi Saxena

If a company is guilty of one of these the data should be leaked:

By Rbhill

So, why are you still using data masking?

Most of us have no idea when it comes to figuring out ways to acquire the right kind of data we need for any type of test or development project.  We

By AFarber

There is another alternative to data masking which is the use of sythetically generated, realistic, even longitudinal data sets created as a service by such companies as ExactData and utilized to avoid the risk of comprimising sensitive data. Thanks for the infomation.

**Editorial addition: the person submitting this comment is a member of the recommended company. We ask that vendors please identify themselves as such when discussing their own products and services in comments**

By Matteson

I really like your your rules. We have built our product and do cover these and more rules.We are also stepping in the Dynamic and related unstructured data.

But as I see I dont think 41% organizations Mask the data. Can you tell the source.

By Manmeet

it’s really very good and informative.
Recently I took up the data masking initiative to implement on Sybase ASE and IQ database
I am interested in knowing the Data masking Process and features need to look in the tools
As I just started looking in to this, Can you please recommend what are processes and features need to look?
If you could help me with any note/white papers or like comparison between tools will greatly appreciated

By Anil

[...] This year I was invited to speak on a panel on data masking/test data generation. As usual, it’s something we’ve talked about before, and it’s clearly a warming topic thanks to PCI and HIPAA. I’ve covered data masking for years, and was even involved in a real project long before joining Gartner, but it’s only VERY recently that interest really seems to be accelerating. You can read this post for my Five Laws of Data Masking. [...]

By On Oracle World and Inference Attacks | securosis.

No problem, just letting people know…

By rmogull

I did not mean to critize but to clarify what is going on in the datamasking and security space.  I have been working and worked with companies in many aspects of data security over the past 15 years.  I apologize if I was misunderstood.

By J Doherty


Full disclosure, I believe you are a vendor in this space. You can still post, but when criticizing other approaches it’s important to disclose your position since it comes with bias (for better or worse). It means you have experience in the area, but you also have a stake in the game.

By rmogull

If you like to leave comments, and aren’t a spammer, register for the site and email us at and we’ll turn off moderation for your account.