Understanding and Selecting Data Masking: How It Works
In this post I want to show how masking works, focusing on how masking platforms move and manipulate data. I originally intended to start with architectures and mechanics of masking systems; but it should be more helpful to start by describing the different masking models, how data flows through different systems, and the advantages and disadvantages of each. I will comment on common data sources and destinations, and the issues to consider when considering masking technology. There are many different types of data repositories and services which can be masked, so I will go into detail on these choices. For now we will stick to relational databases, to keep things simple. Let’s jump right in and discuss how the technology works. ETL When most people think about masking, they think about ETL. ‘ETL’ is short for Extraction-Transformation-Load – a concise description of the classic (and still most common) masking process. Sometimes referred to as ‘static’ masking, ETL works against a fixed export from the source repository. Each phase of ETL is typically performed on separate servers: A source data repository, a masking server that orchestrates the transformation, and a destination database. The masking server connects to the source, retrieves a copy of the data, applies the mask to specified columns of data, and then loads the result onto the target server. This process may be partially manual, fully driven by an administrator, or fully automated. Let’s examine the steps in greater detail: Extract: The first step is to ‘extract’ the data from some storage repository – most often a relational database. The extracted data is often formatted to make it easier for the mask to be applied. For example, extraction can performed with a simple SELECT query issued against a database, filtering out unwanted rows and formatting columns in the query. Results may be streamed directly to the masking application for processing or dumped into a file – such as a comma-separated .csv or tab-separated .tsv file. The extracted data is then securely transferred, as an encrypted file or over an encrypted SSL connection, to the masking platform. Transform: The second step is to apply the data mask, transforming sensitive production data into a safe approximation of the original content. See Defining Masking for available transformations. Masks are almost always applied to what database geeks call “columnar data” – which simply means data of the same type is grouped together. For example, a database may contain a ‘customer’ table, where each customer entry includes a social security (SSN). These values are grouped together into a single column, in files and databases alike, making it easier for the masking application to identify which data to mask. The masking application parses through the data, and for each column of data to be masked, it replaces each entry in the column with a masked value. Load: In the last step masked data is loaded into a destination database. The masked data is copied to one or more destination databases, where it is loaded back into tables. The destination database does not contain sensitive data, so it is not subject to the same security and audit requirements as the original database with the unmasked data. ETL is the most generic and most flexible of masking approaches. The logical ETL process flow implemented in dedicated masking platforms, data management tools with integrated masking and encryption libraries, embedded database tools – all the way down to home-grown scripts. I see all these used in production environments, with the level of skill and labor required increasing as you progress down the chain. While many masking platforms replicate the full process – performing extraction, masking, and loading on separate systems – that is not always the case. Here are some alternative masking models and processes. In-place Masking In some cases you need to create a masked copy within the source database – perhaps before moving it to another less sensitive database. In other cases the production data is moved unchanged (securely!) into another system, and then masked at the destination. When production data is discovered on a test system, the data may be masked without being moved at all. All these variations are called “in-place masking” because they skip both movement steps. The masks are applied as before, but inside the database – which raises its own security and performance considerations. There are very good reasons to mask in place. The first is to take advantage of databases’ facility with management and and manipulation of data. They are incredibly adept at data transformation, and offer very high masking performance. Leveraging built-in functions and stored procedures can speed up the masking process because the database has already parsed the data. Masking data in place – replacing data rather than creating a new copy – protects database archives and data files from snooping, should someone access backup tapes or raw disk files. If the security of data after it leaves the production database is your principal concern, then ETL and in-place masking prior to moving data to another location should satisfy security and audit requirements. Many test environments have poor security, which may require masking prior to export or use of a secure ETL exchange, to ensure sensitive data is never exposed on the network or in destination data repository. That said, among enterprise customers we have interviewed, masking data at the source (in the production database) is not a popular option. The additional computational overhead of the masking operation, in addition to the overhead required to read and write the data being transformed, may have an unacceptable impact on database performance. In many organization legacy databases struggle to keep up with day-to-day operation, and cannot absorb the additional load. Masking in the target database (after the data has been moved) is not very popular either – masking solutions are generally purchased to avoid putting sensitive data on insecure test systems, and such customers prefer to avoid loading data into untrusted test systems prior to masking. In-place masking is typically