May 23, 2012 - Securosis

Understanding and Selecting Data Masking: Defining Data Masking

Adrian Lane May 23, 2012 1 Comment

Before I start today’s post, thank you for all the letters saying that people are looking forward to this series. We have put a lot of work into this research to ensure we capture the state of currently available technology, and we are eager to address this under-served market. As always, we encourage blog comments because they help readers understand other viewpoints that we may not reflect in the posts proper. And for the record, I’m not knocking Twitter debates – they are useful as well, but they’re more ephemeral and less accessible to folks outside the Twitter cliques – not everybody wants to follow security geeks like me. And I also apologize for our slow start since initial launch – between meeting with vendors, some medical issues, and client off-site meetings, I’m a bit behind. But I have collected all the data I think is needed to do justice to this subject, so let’s get rolling! In today’s post I will define masking and show the basics of how it works. First a couple basic terms with their traditional definitions: Mask: Similar to the traditional definition, of a facade or a method of concealment, a data mask is a function that transforms data into something similar but new. It may or may not be reversible. Obfuscation: Hiding the original value of data. Data Masking Definition Data masking platforms at minimum replace sensitive data elements in a data repository with similar values, and optionally move masked data to another location. Masking effectively creates proxy data which retains part of the value of the original. The point is to provide data that looks and acts like the original data, but which lacks sensitivity and doesn’t pose a risk of exposure, enabling use of reduced security controls for masked data repositories. This in turn reduces the scope and complexity of IT security efforts. The mask should make it impossible or impractical to reverse engineer masked values back to the original data without special additional information. We will cover additional deployment models and options later in this series, but the following graphic provides an overview: Keep in mind that ‘masking’ is a generic term, and it encompasses several possible data masking processes. In a broader sense data masking – or just ‘masking’ for the remainder of this series – encompasses collection of data, obfuscation of data, storage of data, and possibly movement of the masked information. But ‘mask’ is also used in reference to the masking operation itself – how we change the original data into something else. There are many different ways to obfuscate data depending on the type of data being stored, each embodied by a different function, and each meeting suitable for different security and data use cases. It might be helpful to think of masking in terms of Halloween masks: the level of complexity and degree of concealment both vary, depending upon the effect desired by the wearer. The following is a list of common data masks used to obfuscate data, and how their functionalities differ: Substitution: Substitution is simply replacing one value with another. For example, the mask might substitute a person’s first and last names with names from a some random phone book entry. The resulting data still constitutes a name, but has no logical relationship with the original real name unless you have access to the original substitution table. Redaction/Nulling: This is a form of substitution where we simply replace sensitive data with a generic value, such as ‘X’. For example, we could replace a phone number with “(XXX)XXX-XXXX”, or a Social Security Number (SSN) with XXX-XX-XXXX. This is the simplest and fastest form of masking, but provides very little (arguably no information) from the original. Shuffling: Shuffling is a method of randomizing existing values vertically across a data set. For example, shuffling individual values in a salary column from a table of employee data would make the table useless for learning what any particular each employee earns. But it would not change aggregate or average values for the table. Shuffling is a common randomization technique for disassociating sensitive data relationships (e.g., Bob makes $X per year) while retaining aggregate values. Transposition: This means to swap one value with another, or a portion of one string with another. Transposition can be as complex as an encryption function (see below) or a simple as swapping swapping the first four digits of a credit card number with the last four. There are many variations, but transposition usually refers to a mathematical function which moves existing data around in a consistent pattern. Averaging: Averaging is an obfuscation technique where individual numeric values are replaced by a value derived by averaging some portion of the individual number values. In our salary example above, we could substitute individual salaries with the average across a group or corporate division to hide individual salary values while retaining an aggregate relationship to the real data. De-identification: A generic term that applies to any process that strips identifying information, such as who produced the data set, or personal identities within the data set. De-identification is an important topic when dealing with complex, multi-column data sets that provide ample means for someone to reverse engineer masked data back into individual identities. Tokenization: Tokenization is substitution of data elements with random placeholder values, although vendors overuse the term ‘tokenization’ for a variety of other techniques. Tokens are non-reversible because the token bears no logical relationship with the original value. Format Preserving Encryption: Encryption is the process of transforming data into an unreadable state. For any given value the process consistently produces the same result, and it can only be reversed with special knowledge (the key). While most encryption algorithms produce strings of arbitrary length, format preserving encryption transforms the data into an unreadable state while retaining the format (overall appearance) of the original values. Each of these mask types excels in some use cases, and also of course incurs a certain amount of overhead due to its

Read Post

Research

Understanding and Selecting Data Masking: Defining Data Masking

Research

Firestarter: Multicloud Deployment Structures and Blast Radius

Firestarter: So you want to multicloud?

Firestarter: 2019: Insert Winter is Coming Meme Here

Firestarter: re:Invent Security Review

Firestarter: Hardware Hacks and Lift and Pray

Sign Up for Our Newsletter

Contact

About

Quick Links