Before I start today’s post, thank you for all the letters saying that people are looking forward to this series. We have put a lot of work into this research to ensure we capture the state of currently available technology, and we are eager to address this under-served market. As always, we encourage blog comments because they help readers understand other viewpoints that we may not reflect in the posts proper. And for the record, I’m not knocking Twitter debates – they are useful as well, but they’re more ephemeral and less accessible to folks outside the Twitter cliques – not everybody wants to follow security geeks like me. And I also apologize for our slow start since initial launch – between meeting with vendors, some medical issues, and client off-site meetings, I’m a bit behind. But I have collected all the data I think is needed to do justice to this subject, so let’s get rolling!
In today’s post I will define masking and show the basics of how it works. First a couple basic terms with their traditional definitions:
- Mask: Similar to the traditional definition, of a facade or a method of concealment, a data mask is a function that transforms data into something similar but new. It may or may not be reversible.
- Obfuscation: Hiding the original value of data.
Data Masking Definition
Data masking platforms at minimum replace sensitive data elements in a data repository with similar values, and optionally move masked data to another location. Masking effectively creates proxy data which retains part of the value of the original. The point is to provide data that looks and acts like the original data, but which lacks sensitivity and doesn’t pose a risk of exposure, enabling use of reduced security controls for masked data repositories. This in turn reduces the scope and complexity of IT security efforts. The mask should make it impossible or impractical to reverse engineer masked values back to the original data without special additional information.
We will cover additional deployment models and options later in this series, but the following graphic provides an overview:
Keep in mind that ‘masking’ is a generic term, and it encompasses several possible data masking processes. In a broader sense data masking – or just ‘masking’ for the remainder of this series – encompasses collection of data, obfuscation of data, storage of data, and possibly movement of the masked information. But ‘mask’ is also used in reference to the masking operation itself – how we change the original data into something else. There are many different ways to obfuscate data depending on the type of data being stored, each embodied by a different function, and each meeting suitable for different security and data use cases. It might be helpful to think of masking in terms of Halloween masks: the level of complexity and degree of concealment both vary, depending upon the effect desired by the wearer. The following is a list of common data masks used to obfuscate data, and how their functionalities differ:
- Substitution: Substitution is simply replacing one value with another. For example, the mask might substitute a person’s first and last names with names from a some random phone book entry. The resulting data still constitutes a name, but has no logical relationship with the original real name unless you have access to the original substitution table.
- Redaction/Nulling: This is a form of substitution where we simply replace sensitive data with a generic value, such as ‘X’. For example, we could replace a phone number with “(XXX)XXX-XXXX”, or a Social Security Number (SSN) with XXX-XX-XXXX. This is the simplest and fastest form of masking, but provides very little (arguably no information) from the original.
- Shuffling: Shuffling is a method of randomizing existing values vertically across a data set. For example, shuffling individual values in a salary column from a table of employee data would make the table useless for learning what any particular each employee earns. But it would not change aggregate or average values for the table. Shuffling is a common randomization technique for disassociating sensitive data relationships (e.g., Bob makes $X per year) while retaining aggregate values.
- Transposition: This means to swap one value with another, or a portion of one string with another. Transposition can be as complex as an encryption function (see below) or a simple as swapping swapping the first four digits of a credit card number with the last four. There are many variations, but transposition usually refers to a mathematical function which moves existing data around in a consistent pattern.
- Averaging: Averaging is an obfuscation technique where individual numeric values are replaced by a value derived by averaging some portion of the individual number values. In our salary example above, we could substitute individual salaries with the average across a group or corporate division to hide individual salary values while retaining an aggregate relationship to the real data.
- De-identification: A generic term that applies to any process that strips identifying information, such as who produced the data set, or personal identities within the data set. De-identification is an important topic when dealing with complex, multi-column data sets that provide ample means for someone to reverse engineer masked data back into individual identities.
- Tokenization: Tokenization is substitution of data elements with random placeholder values, although vendors overuse the term ‘tokenization’ for a variety of other techniques. Tokens are non-reversible because the token bears no logical relationship with the original value.
- Format Preserving Encryption: Encryption is the process of transforming data into an unreadable state. For any given value the process consistently produces the same result, and it can only be reversed with special knowledge (the key). While most encryption algorithms produce strings of arbitrary length, format preserving encryption transforms the data into an unreadable state while retaining the format (overall appearance) of the original values.
Each of these mask types excels in some use cases, and also of course incurs a certain amount of overhead due to its complexity. For example, it’s much easier to replace a phone number with a series of ‘X’s than it is to encrypt the phone number. In general, applying masks (running data through a masking function) is straightforward. However, various constraints make generating quality data masks much more complicated than it might seem at first glance. The requirement to retain some original value makes things much more difficult, and requires considerable additional intelligence from masking products. The data often originates from, or is inserted into, a relational database; so care is required with format and data types, and to ensure that the masked information satisfies integrity checks. Typical customer constraints include:
- Format preservation: The mask must produce data with the same structure as the original data. This means that if the original data is between 2 and 30 characters long the mask should produce data between 2 and 30 characters long. A common example is date value, which must maintain sizes in the day, month, and year sections, and relative positioning between them, as “31.03.2012” might not be interpreted as equivalent to “March 31, 2012” or even “03.31.2012”.
- Data type preservation: With relational data storage it is essential to maintain data types when masking data from one database to another. Relational databases require formal definition of data columns and do not tolerate text in number or date fields. In most cases format-preserving masks implicitly preserve data type, but that is not always the case. In certain cases data can be ‘cast’ from a specific data type into a generic data type (e.g., ‘
varchar
’), but it is essential to verify consistency. - Semantic integrity: Databases often place additional constraints on data they contain such as a LUHN check for credit card numbers, or a maximum value on employee salaries.
- Referential integrity: An attribute in one table or file may refer to another element in a different table or file, in which case the reference must be consistently maintained. The reference augments or modifies the meaning of each element, and is in fact part of the value of the data. As implied by the name, relational databases optimize data storage by allowing one set of data elements to ‘relate’, or refer, to another. Shuffling or substituting key data values can destroy these references (more accurately, relationships). Masking technologies must maintain referential integrity when data is moved between relational databases. This ensure that loading the new data works without errors, and avoids breaking applications which rely on these relationships.
- Aggregate value: The total and average values of a masked column of data should be retained, or at least very close to the original.
- Frequency distribution: In some cases users require random frequency distribution, whereas in others the logical groupings of values must be maintained or the masked data is not usable. For example, if the original data describes geographic locations of cancer patients by zip code, random zip codes would discard valuable geographical information. The ability to mask data while maintaining certain frequencies types of patterns is critical for maintaining the value of masked data for analytics.
- Uniqueness: Masked values must be unique. As an example, duplicate SSN numbers are not allowed if uniqueness is a required integrity constraint. This is critical aspect for referential integrity, as the columns used to link tables must have unique values.
There are many other constraints, but those are the main integrity considerations when masking data. Most vendors include masking functions for all the types described above in the base product, with the ability to specify different data integrity options for each use of a mask. And most products include built-in bundles of appropriate masks for regulatory requirements to streamline compliance efforts. We will go into more detail when we get to use cases and compliance.
Next we will dig into the technical architectures of masking solutions, go into more detail about how data is moved and transformed, and compare advantages of different models.
Reader interactions
One Reply to “Understanding and Selecting Data Masking: Defining Data Masking”
Undoubtedly, coming across this post was the right choice on my part. This post describes how acquiring statistical, programming, technical, basic, and -technical skills is important for data science aspirants undergoing the course program from 360DigiTMG will be the most righteous thing to do. Today there is a huge demand for certified data science professionals, and if you want to become a part of this field, then get enrolled in this course and become a learned and certified data scientist. Hats off to the writer who has put so much time and effort into writing articles like this, guiding freshers, and taking away all their fear and confusion. I hope to see more article work like this in the future.Data Scientist Course in Malaysia