Understanding and Selecting Data Masking: How It WorksBy Adrian Lane
In this post I want to show how masking works, focusing on how masking platforms move and manipulate data. I originally intended to start with architectures and mechanics of masking systems; but it should be more helpful to start by describing the different masking models, how data flows through different systems, and the advantages and disadvantages of each. I will comment on common data sources and destinations, and the issues to consider when considering masking technology. There are many different types of data repositories and services which can be masked, so I will go into detail on these choices. For now we will stick to relational databases, to keep things simple. Let’s jump right in and discuss how the technology works.
When most people think about masking, they think about ETL. ‘ETL’ is short for Extraction-Transformation-Load – a concise description of the classic (and still most common) masking process. Sometimes referred to as ‘static’ masking, ETL works against a fixed export from the source repository. Each phase of ETL is typically performed on separate servers: A source data repository, a masking server that orchestrates the transformation, and a destination database. The masking server connects to the source, retrieves a copy of the data, applies the mask to specified columns of data, and then loads the result onto the target server. This process may be partially manual, fully driven by an administrator, or fully automated.
Let’s examine the steps in greater detail:
- Extract: The first step is to ‘extract’ the data from some storage repository – most often a relational database. The extracted data is often formatted to make it easier for the mask to be applied. For example, extraction can performed with a simple
SELECTquery issued against a database, filtering out unwanted rows and formatting columns in the query. Results may be streamed directly to the masking application for processing or dumped into a file – such as a comma-separated
.tsvfile. The extracted data is then securely transferred, as an encrypted file or over an encrypted SSL connection, to the masking platform.
- Transform: The second step is to apply the data mask, transforming sensitive production data into a safe approximation of the original content. See Defining Masking for available transformations. Masks are almost always applied to what database geeks call “columnar data” – which simply means data of the same type is grouped together. For example, a database may contain a ‘customer’ table, where each customer entry includes a social security (SSN). These values are grouped together into a single column, in files and databases alike, making it easier for the masking application to identify which data to mask. The masking application parses through the data, and for each column of data to be masked, it replaces each entry in the column with a masked value.
- Load: In the last step masked data is loaded into a destination database. The masked data is copied to one or more destination databases, where it is loaded back into tables. The destination database does not contain sensitive data, so it is not subject to the same security and audit requirements as the original database with the unmasked data.
ETL is the most generic and most flexible of masking approaches. The logical ETL process flow implemented in dedicated masking platforms, data management tools with integrated masking and encryption libraries, embedded database tools – all the way down to home-grown scripts. I see all these used in production environments, with the level of skill and labor required increasing as you progress down the chain. While many masking platforms replicate the full process – performing extraction, masking, and loading on separate systems – that is not always the case. Here are some alternative masking models and processes.
In some cases you need to create a masked copy within the source database – perhaps before moving it to another less sensitive database. In other cases the production data is moved unchanged (securely!) into another system, and then masked at the destination. When production data is discovered on a test system, the data may be masked without being moved at all. All these variations are called “in-place masking” because they skip both movement steps. The masks are applied as before, but inside the database – which raises its own security and performance considerations.
There are very good reasons to mask in place. The first is to take advantage of databases’ facility with management and and manipulation of data. They are incredibly adept at data transformation, and offer very high masking performance. Leveraging built-in functions and stored procedures can speed up the masking process because the database has already parsed the data. Masking data in place – replacing data rather than creating a new copy – protects database archives and data files from snooping, should someone access backup tapes or raw disk files.
If the security of data after it leaves the production database is your principal concern, then ETL and in-place masking prior to moving data to another location should satisfy security and audit requirements. Many test environments have poor security, which may require masking prior to export or use of a secure ETL exchange, to ensure sensitive data is never exposed on the network or in destination data repository.
That said, among enterprise customers we have interviewed, masking data at the source (in the production database) is not a popular option. The additional computational overhead of the masking operation, in addition to the overhead required to read and write the data being transformed, may have an unacceptable impact on database performance. In many organization legacy databases struggle to keep up with day-to-day operation, and cannot absorb the additional load. Masking in the target database (after the data has been moved) is not very popular either – masking solutions are generally purchased to avoid putting sensitive data on insecure test systems, and such customers prefer to avoid loading data into untrusted test systems prior to masking. In-place masking is typically used as part of a hybrid approach, performed by local software agents prior to database insertion. We will cover specific architectural variations in our next post.
Be aware that in-place masking raises additional security concerns – ‘fingerprints’ from the original data are often left behind in logs, transaction files, and temporary files. Whether on the source or target database, this residue may provide new avenues of attacker. Wherever masking occurs, you need to verify that no unwanted copies or traces of sensitive data remain.
Dynamic masking refers to storage of both production data and the masked version in the same database. We call it dynamic masking not because the mask is generated dynamically, but because the mask is triggered dynamically. When a user requests data, their credentials and session information are examined by the masking platform. Authorized users receive unmasked production data. Users without authorization for sensitive information, or using unapproved applications, or who trigger some other security filter which marks the connection as suspect, receive masked data. The determination of which data to provide is made dynamically, in real time. The criteria for presenting masked data must be specified in advance, and alerts can be transmitted when the masked data is triggered.
Dynamic masking is typically implemented with a) database code to inspect the inbound query and select either original or masked data to return, and b) a database ‘view’, which is nothing more than a temporary virtual table. While the view is structurally similar to a real table, it contains only masked data. Dynamic masking requires a logical plug-in, trigger, or code, to divert suspect queries from the unmasked data. This type of masking is excellent for protecting data in web application environments where users cannot be perfectly authenticated, without costly recoding of applications. Additionally, dynamic masking enables testing of production applications in a production environment, by only exposing masked data to less-privileged test users. For firms with very large production databases, which are not worried about some extra overhead on production database servers, this model provides an excellent approximation of a production environment for testing.
However there are drawbacks. Most customers, especially those running on legacy systems, don’t want this burden on the production database server. Additionally, dynamic masking does not prevent data leakage – from logs, data files, or archives. This is no worse than a normal production environment, but it is important to remember that all these other types of secondary data are targets for attackers. Unfortunately, dynamic masking is currently only available for relational databases – not file systems or other platforms. Finally, should the view fail in some way, unauthorized queries might exposing sensitive data or fail completely. It is important to test behavior under adverse conditions.
Proxy Data Masking
A newer masking model is called “Proxy data masking”. The results of a user query are intercepted and masked on the fly, then returned to the user. For example, if a user queries too many credit card numbers the returned data might be redacted; similarly if the query originates from an unapproved location. This differs from dynamic masking in that it occurs outside the data repository, and is available for non-relational systems as well. The proxy may be an agent or appliance deployed ‘in-line’, between users and the data, to force all requests through the proxy.
Another recent advance on the proxy masking model is query substitution. In this variation the masking platform is smart enough to recognize a sensitive query. The query is intercepted and re-written to select information from a different (masked) column. For example, a query to select credit card numbers might be modified to request tokenized values of credit card numbers instead. The interception model is very flexible; the
SELECT statement can be directed to an alternative ‘view’ as with dynamic masking, it can fetch data from a file of pre-generated masked values, or even from a ‘join’ against another database.
One downside of proxy masking is the need to create policies which account for all the different cases which should return alternative data. Even with automated discovery tools and pre-built policies, expect to invest considerable time in creating and updating policies. As with any other proxy service it is important to consider failover, performance, and scalability prior to production deployment.
Technically, the source of data for extraction – just like the destination to load with masked data – could be just about anything. It can be a single file or multiple files, a quasi-relational database, indexed/ISAM files, NoSQL databases, document management systems, optically transcribed (OCR) images, or even ephemeral data streams such as digitized telephone conversations. As long as the masking platform is able to identify which data to mask, just about anything is possible.
- Databases: The most common data sources are relational database platforms. These databases require strict data definitions, (‘meta-data’) for every column of every table; this rigorous structure is extremely useful for locating and masking sensitive data. Insertion and extraction are commonly handled through SQL queries and database export & load functions. SQL queries offer data in human-readable format, and easily accommodate filtering and formatting of information. On the other hand, queries may place unacceptable load on production databases, or take an unacceptable amount of time to complete. Exports and imports are much less flexible, but masking providers support native export/import formats to save time and reduce server impact. Masking systems connect either though native ODBC/JDBC connections, or by issuing queries directly to the database.
- Non-relational/Indexed Databases: Mainframe databases, data warehouses, business intelligence, CRM solutions, and other large analytic systems are increasingly run on non-relational platforms to improve scalability and performance. These platforms behave much like relational databases, but lack some of their formal data structures; this makes discovery less accurate and masking more difficult. With the growing popularity of these platforms, and the difficulty of monitoring and encryption these environments, masking provides a preventative security control – removing sensitive data while preserving analytic and data-mining capabilities. Again, masking systems connect though native ODBC/JDBC connections, or by issuing queries directly to the database.
- Files: Files are as frequently sources of sensitive information as databases. Corporate IP, financial spreadsheets, medical documents, and the like are stored on file servers; they all can contain sensitive data which must be protected from hackers and disgruntled employees. Some organization choose Data Loss Prevention or File Activity Monitoring solutions to monitor and protect files, but masking provides a simple, low-cost mechanism to proactively detect and remove sensitive data from unauthorized locations. Files can be masked in-place as data is discovered in unauthorized locations, or masked files can be generated periodically to provide non-sensitive datasets to multiple audiences. Files are typically accessed through software agents installed at the OS or file system layer, and which can intelligently offer either direct or masked data.
- NoSQL Databases: Hadoop, Mongo, Cassandra, and other NoSQL database platforms are known for their scalability, and their powerful search and analysis capabilities. NoSQL platforms lack the formalized data integrity checking of relational databases, which makes data insertion much faster. Unfortunately they also lack fine-grained security controls. Their lack of formal data type definitions makes discovery of sensitive data much more difficult, and also complicates masking of such data once it is found. It is becoming increasingly common to substitute valuable information for masked data before it is loaded into such repositories, so compromise of these platforms would not be catastrophic.
- Document Management: Collaboration and document management software are key productivity applications in just about every enterprise, and they contain large amounts of sensitive data. Masking capabilities integrated with the collaboration software provide the same benefits as in relational databases, for unstructured document and messaging systems. Integrated masking for collaboration systems is still new, and commercial offerings are not currently full-featured, but they can help discover and catalog sensitive data, and in some case can replace documents with masked variants.
In the next post I will dig into the architecture of masking systems, and how they are deployed and managed. Really – this time I mean it!