Blog

Understanding and Selecting Data Masking: Series Introduction

By Adrian Lane

Data masking has been around a long time. I have been masking since the early ’90s to create test data from production copies of customer insurance records, as well as to alter database columns before sending database exports out for “data cleansing”. At the time masking was little more than UNIX shell scripts or home grown Perl scripts to alter particular columns in .csv files. A few years later I was giddy with excitement to have my first masking ‘program’, running on a paleolithic version of Windows, which actually had a ‘wizard’ for walking through the process. No, it did not help with extraction of information from a database, but it identified the columns to be altered, provided a list of masks to apply, and dumped an error file when it ran into trouble. That saved a lot of tweaking scripts and manually reviewing dump files. And all this was several years before I heard anyone mention ‘ETL’ (Extract, Transform, Load) because ODBC and JDBC drivers to connect to databases were just arriving on the scene, and nobody had automated bulk loads back into another database. That was still science fiction.

Masking products don’t look like that any longer – now they are full-blown data security and management platforms. It feels a bit nostalgic to review data masking technologies, and somewhat surprising to find how far they have evolved into full production-quality enterprise platforms. I have been following data masking for almost two decades, and seen more evolution in the last couple years than over the first dozen. These advancements have come in two forms. First, evolution of the technology in recent years, building the capability to handle just about any type of database or data source, full automation, workflow integration, and a dozen or so data obfuscation techniques. Second, in response to substantial market demand from IT security and compliance departments, the way these tools are used has changed. Increased demands from new buying centers have forced changes in workflow, user interface, and how core capabilities are packaged. It only took a couple public breaches, where production data was easily exfiltrated from unsecured test databases, to drive masking into companies’ production data flows. Compliance requirements such as PCI-DSS cemented the need and are now a principal driver for adoption. The upshot is that most of these tools have seen significant advancement, and now include multiple robust user interfaces to support both technical and non-technical users, as well as pre-packaged solutions for different compliance mandates. Somewhere along the way, masking grew up!

I started following this vertical again because we received a number of customer questions, specifically around compliance. We have been seeing steady growth in adoption of masking over the last four years – perhaps 20% YoY – as more customers use masking to reduce information risk. In some ways it’s a more elegant solution than encryption; and for several deployment models masking is cheaper and easier than surrounding sensitive data with layers of security controls such as user rights management, encryption, database security, and various firewall technologies. When you think about securing Big Data, data analytics systems, HIPPA compliance, and using public cloud computing resources, there is plenty of reason to believe masking’s rapid adoption will continue. I have written a lot about masking on the blog, but never a focused research paper; it seems to be time for a thorough explanation of what masking does and how it helps security.

So I am excited to launch a new series: Understanding and Selecting Data Masking Solutions. I have designed this series to help would-be buyers understand what to look for in a product, and show existing customers how to leverage their investments to solve emerging problems. I’ll delve into the technology, deployment models, data flow, and management capabilities. I will discuss the four principal use cases and how the technology solves certain compliance and security issues, and close out with a brief buyers’ guide on what features to look for based upon your criteria. The outline follows:

  • Core Features: We’ll define masking, introduce the basic technology, and discuss how it’s applied to data. We will also define the major masking options (shuffling, averaging, substitution, field nulling/redaction, and mathematical transposition) and de-identification methods. And we’ll explain the need for data type & format preservation, uniqueness, and semantic & referential integrity.
  • How It Works: We will examine how masking works, focusing on how data flows through it and how information is secured. We’ll describe different options for sources, destinations, extraction methods, loading options, and where & how masking is performed. We will contrast masking against encryption and tokenization to frame advantages of particular techniques for specific use cases later.
  • Technical Architecture: Deployment models (ETL, in-place, and the various options for dynamic masking), issues, and concerns with each. We will discuss support for files and databases, and how masking integrates with these platforms. We’ll include diagrams to compare and contrast the models.
  • Advanced Features: We’ll cover current trends in data discovery, risk & criticality assessment, and mask validation. We will talk about centralized policy management, data set management, and secure data transfer. We’ll discuss integration with other systems such as trouble ticketing, encryption, tokenization, and DLP for automated workflow.
  • Use Cases: We will outline both traditional and new use cases, bringing together the evolving requirements with ongoing changes to masking technologies, along with how these use cases prompt new deployment models. This section will focus on specific customers requirements that have come up in our research; we’ll also evaluate specific masking alternatives to meet security and compliance mandates. We will cover automated workflows and scripting, as well as use of pre-defined templates for defining masks. We’ll discuss compliance masks and pre-built regulatory options, as well as control reporting.
  • Evaluate Your Needs: We’ll wrap up by mapping out evaluation criteria and a process to guide a customer buying decisions. We will distinguish between “must-have” and “nice-to-have” requirements, compliance, integration, setup, and management.

As with all Securosis research projects, we are focused on end-user education. So I strongly encourage readers to share their views and experiences, and comment when your views differ from what we describe. Community involvement helps make these research papers better, and the blog provides an open forum to discuss what’s good and bad about various masking solutions. We are interested in how end users employ these products, and always eager to hear about both successes and failures. We also encourage vendors to comment publicly on this series and add your viewpoints, but we ask that you identify yourself as a technology vendor in your comment.

Next: Core Features.

No Related Posts
Comments

Hey, your site is very interesting.. Plus it was something I’m able to definitely relate with Ill constantly stop by for your blog therefore i hope you continue making fun and interesting posts like this one…

By Appsian


Adrian, hello,

I am with IRI, a data management and protection ISV with with standalone and embedded field-level data masking solutions for multiple data sources that can be applied in a range of applications in our or external environments (including ETL, VLDB reorgs and loads, migrations, reports, Hadoop data lakes, etc.). 

Multiple masking solutions are in this suite:
http://www.iri.com/products/iri-data-protector
and ‘FieldShield’ is also included in IRI Voracity data management (discovery, integration, migration, governance, and analytic) platform.

4GL scripts describe layouts and masks, but an Eclipse GUI around it creates, modifies, runs, and manages them in a modern, automated ways.

By David Friedland


@Mike - In subsequent posts I point out that some platforms are really data management with masking bolted on. As you point out that’s not a bad thing per se, but strengths mirror initial designs. I’ll cover this in the buyers guide section as well.

On your final point, the reduction of risk combined with the preservation of value is one key advantage. The other is the ability to create data sets for complex data types—for example, being able to secure health care data is really difficult. I’ll get more into this topic with the upcoming use cases.

Thanks for the comment.

-Adrian

By Adrian Lane


First I agree with your first point that the masking tools have grown up.  Scripts were fine to do 1, 2 or 3 databases but when you need to do 100’s you need a method to the madness.

- While it is true that masking tools have come a long way with the recent customer interest in data masking not all tools were created to do the same thing.  Some were designed from the ground up to mask data others were originally designed to do something different ( archiving, subsetting, ETL ) and then added masking functionality on later.  This is not necessarily good or bad but it does influence how the tool works and the skillset needed to use it.

- I agree that masking is a more elegant solution if you are trying to reduce risk by limiting exposure of your sensitive data.  Other technologies work as well but you need to make sure you understand what situations they address.  Offshore developers are not going to get to far using encrypted data, while it should not impact a DBA’s work.

By Mike Logan


If you like to leave comments, and aren’t a spammer, register for the site and email us at info@securosis.com and we’ll turn off moderation for your account.