Securosis

Research

Understanding and Selecting Data Masking: Defining Data Masking

Before I start today’s post, thank you for all the letters saying that people are looking forward to this series. We have put a lot of work into this research to ensure we capture the state of currently available technology, and we are eager to address this under-served market. As always, we encourage blog comments because they help readers understand other viewpoints that we may not reflect in the posts proper. And for the record, I’m not knocking Twitter debates – they are useful as well, but they’re more ephemeral and less accessible to folks outside the Twitter cliques – not everybody wants to follow security geeks like me. And I also apologize for our slow start since initial launch – between meeting with vendors, some medical issues, and client off-site meetings, I’m a bit behind. But I have collected all the data I think is needed to do justice to this subject, so let’s get rolling! In today’s post I will define masking and show the basics of how it works. First a couple basic terms with their traditional definitions: Mask: Similar to the traditional definition, of a facade or a method of concealment, a data mask is a function that transforms data into something similar but new. It may or may not be reversible. Obfuscation: Hiding the original value of data. Data Masking Definition Data masking platforms at minimum replace sensitive data elements in a data repository with similar values, and optionally move masked data to another location. Masking effectively creates proxy data which retains part of the value of the original. The point is to provide data that looks and acts like the original data, but which lacks sensitivity and doesn’t pose a risk of exposure, enabling use of reduced security controls for masked data repositories. This in turn reduces the scope and complexity of IT security efforts. The mask should make it impossible or impractical to reverse engineer masked values back to the original data without special additional information. We will cover additional deployment models and options later in this series, but the following graphic provides an overview: Keep in mind that ‘masking’ is a generic term, and it encompasses several possible data masking processes. In a broader sense data masking – or just ‘masking’ for the remainder of this series – encompasses collection of data, obfuscation of data, storage of data, and possibly movement of the masked information. But ‘mask’ is also used in reference to the masking operation itself – how we change the original data into something else. There are many different ways to obfuscate data depending on the type of data being stored, each embodied by a different function, and each meeting suitable for different security and data use cases. It might be helpful to think of masking in terms of Halloween masks: the level of complexity and degree of concealment both vary, depending upon the effect desired by the wearer. The following is a list of common data masks used to obfuscate data, and how their functionalities differ: Substitution: Substitution is simply replacing one value with another. For example, the mask might substitute a person’s first and last names with names from a some random phone book entry. The resulting data still constitutes a name, but has no logical relationship with the original real name unless you have access to the original substitution table. Redaction/Nulling: This is a form of substitution where we simply replace sensitive data with a generic value, such as ‘X’. For example, we could replace a phone number with “(XXX)XXX-XXXX”, or a Social Security Number (SSN) with XXX-XX-XXXX. This is the simplest and fastest form of masking, but provides very little (arguably no information) from the original. Shuffling: Shuffling is a method of randomizing existing values vertically across a data set. For example, shuffling individual values in a salary column from a table of employee data would make the table useless for learning what any particular each employee earns. But it would not change aggregate or average values for the table. Shuffling is a common randomization technique for disassociating sensitive data relationships (e.g., Bob makes $X per year) while retaining aggregate values. Transposition: This means to swap one value with another, or a portion of one string with another. Transposition can be as complex as an encryption function (see below) or a simple as swapping swapping the first four digits of a credit card number with the last four. There are many variations, but transposition usually refers to a mathematical function which moves existing data around in a consistent pattern. Averaging: Averaging is an obfuscation technique where individual numeric values are replaced by a value derived by averaging some portion of the individual number values. In our salary example above, we could substitute individual salaries with the average across a group or corporate division to hide individual salary values while retaining an aggregate relationship to the real data. De-identification: A generic term that applies to any process that strips identifying information, such as who produced the data set, or personal identities within the data set. De-identification is an important topic when dealing with complex, multi-column data sets that provide ample means for someone to reverse engineer masked data back into individual identities. Tokenization: Tokenization is substitution of data elements with random placeholder values, although vendors overuse the term ‘tokenization’ for a variety of other techniques. Tokens are non-reversible because the token bears no logical relationship with the original value. Format Preserving Encryption: Encryption is the process of transforming data into an unreadable state. For any given value the process consistently produces the same result, and it can only be reversed with special knowledge (the key). While most encryption algorithms produce strings of arbitrary length, format preserving encryption transforms the data into an unreadable state while retaining the format (overall appearance) of the original values. Each of these mask types excels in some use cases, and also of course incurs a certain amount of overhead due to its

Share:
Read Post
dinosaur-sidebar

Totally Transparent Research is the embodiment of how we work at Securosis. It’s our core operating philosophy, our research policy, and a specific process. We initially developed it to help maintain objectivity while producing licensed research, but its benefits extend to all aspects of our business.

Going beyond Open Source Research, and a far cry from the traditional syndicated research model, we think it’s the best way to produce independent, objective, quality research.

Here’s how it works:

  • Content is developed ‘live’ on the blog. Primary research is generally released in pieces, as a series of posts, so we can digest and integrate feedback, making the end results much stronger than traditional “ivory tower” research.
  • Comments are enabled for posts. All comments are kept except for spam, personal insults of a clearly inflammatory nature, and completely off-topic content that distracts from the discussion. We welcome comments critical of the work, even if somewhat insulting to the authors. Really.
  • Anyone can comment, and no registration is required. Vendors or consultants with a relevant product or offering must properly identify themselves. While their comments won’t be deleted, the writer/moderator will “call out”, identify, and possibly ridicule vendors who fail to do so.
  • Vendors considering licensing the content are welcome to provide feedback, but it must be posted in the comments - just like everyone else. There is no back channel influence on the research findings or posts.
    Analysts must reply to comments and defend the research position, or agree to modify the content.
  • At the end of the post series, the analyst compiles the posts into a paper, presentation, or other delivery vehicle. Public comments/input factors into the research, where appropriate.
  • If the research is distributed as a paper, significant commenters/contributors are acknowledged in the opening of the report. If they did not post their real names, handles used for comments are listed. Commenters do not retain any rights to the report, but their contributions will be recognized.
  • All primary research will be released under a Creative Commons license. The current license is Non-Commercial, Attribution. The analyst, at their discretion, may add a Derivative Works or Share Alike condition.
  • Securosis primary research does not discuss specific vendors or specific products/offerings, unless used to provide context, contrast or to make a point (which is very very rare).
    Although quotes from published primary research (and published primary research only) may be used in press releases, said quotes may never mention a specific vendor, even if the vendor is mentioned in the source report. Securosis must approve any quote to appear in any vendor marketing collateral.
  • Final primary research will be posted on the blog with open comments.
  • Research will be updated periodically to reflect market realities, based on the discretion of the primary analyst. Updated research will be dated and given a version number.
    For research that cannot be developed using this model, such as complex principles or models that are unsuited for a series of blog posts, the content will be chunked up and posted at or before release of the paper to solicit public feedback, and provide an open venue for comments and criticisms.
  • In rare cases Securosis may write papers outside of the primary research agenda, but only if the end result can be non-biased and valuable to the user community to supplement industry-wide efforts or advances. A “Radically Transparent Research” process will be followed in developing these papers, where absolutely all materials are public at all stages of development, including communications (email, call notes).
    Only the free primary research released on our site can be licensed. We will not accept licensing fees on research we charge users to access.
  • All licensed research will be clearly labeled with the licensees. No licensed research will be released without indicating the sources of licensing fees. Again, there will be no back channel influence. We’re open and transparent about our revenue sources.

In essence, we develop all of our research out in the open, and not only seek public comments, but keep those comments indefinitely as a record of the research creation process. If you believe we are biased or not doing our homework, you can call us out on it and it will be there in the record. Our philosophy involves cracking open the research process, and using our readers to eliminate bias and enhance the quality of the work.

On the back end, here’s how we handle this approach with licensees:

  • Licensees may propose paper topics. The topic may be accepted if it is consistent with the Securosis research agenda and goals, but only if it can be covered without bias and will be valuable to the end user community.
  • Analysts produce research according to their own research agendas, and may offer licensing under the same objectivity requirements.
  • The potential licensee will be provided an outline of our research positions and the potential research product so they can determine if it is likely to meet their objectives.
  • Once the licensee agrees, development of the primary research content begins, following the Totally Transparent Research process as outlined above. At this point, there is no money exchanged.
  • Upon completion of the paper, the licensee will receive a release candidate to determine whether the final result still meets their needs.
  • If the content does not meet their needs, the licensee is not required to pay, and the research will be released without licensing or with alternate licensees.
  • Licensees may host and reuse the content for the length of the license (typically one year). This includes placing the content behind a registration process, posting on white paper networks, or translation into other languages. The research will always be hosted at Securosis for free without registration.

Here is the language we currently place in our research project agreements:

Content will be created independently of LICENSEE with no obligations for payment. Once content is complete, LICENSEE will have a 3 day review period to determine if the content meets corporate objectives. If the content is unsuitable, LICENSEE will not be obligated for any payment and Securosis is free to distribute the whitepaper without branding or with alternate licensees, and will not complete any associated webcasts for the declining LICENSEE. Content licensing, webcasts and payment are contingent on the content being acceptable to LICENSEE. This maintains objectivity while limiting the risk to LICENSEE. Securosis maintains all rights to the content and to include Securosis branding in addition to any licensee branding.

Even this process itself is open to criticism. If you have questions or comments, you can email us or comment on the blog.