Understanding and Selecting Data Masking: Management and Advanced FeaturesBy Adrian Lane
In this post we will examine many of the features and functions of masking that go beyond the basics of data collection and transformation. The first, and most important, is the management interface for the masking product. Central management is the core addition that transforms masking from a simple tool into an enterprise data security platform. Central management is not new; but capabilities, and maturity, and integration are evolving rapidly. In the second part of today’s post we will discuss advanced masking functions we are beginning to see, to give you an idea of where these products are heading. Sure, all these products provide management of the basic functions, but the basics don’t fully encompass today’s principal use cases – the advanced feature set and management interfaces differentiate the various products, and are likely to drive your choice of product.
This is the proverbial “single pane of glass” for management of data, policies, data repositories, and task automation. The user interface is how you interact with data systems and control the flow of information. A good UI can simplify your job, but a bad one will make you want to never use the product! Management interfaces have evolved to accommodate both IT management and non-technical stakeholders alike, allowing them to set policy, define workflows, understand risk, and manage where data goes. Some products even provide the capability to manage endpoint agents. Keep in mind that each masking platform has its own internal database to store policies, masks, reports, user credentials, and other pertinent information; and some offer visualization technologies and dashboards to help you see what exactly is going on with your data. The following is a list of management features to consider when evaluating the suitability of a masking platform:
- Policy Management: A policy is nothing more than a rule on how sensitive data is to be treated. Policies usually consist of a data mask – the thing that transforms data – and a data source the mask is applied to. Every masking platform comes with several predefined masks, as well as an interface to customize masks to your needs. But the policy interfaces go one step further, associating a mask with a data source. Some platforms take this one step further – allowing a policy to be automatically applied to specific data types, such as credit card numbers, regardless of source or destination. Policy management is typically simplified with predefined policy sets, as we will discuss below.
- Discovery: For most customers discovery has become a must-have feature – not least because it is essential for regulatory compliance. Data discovery is an active scan to first find data repositories, and then scan them for sensitive data. The discovery process works by scanning files and databases, matching content to known patterns (such as 9-digit Social Security numbers) or metadata (data that describes data structure) definitions. As sensitive data is discovered, the discovery tool creates a report containing both the location and a list of the sensitive data types found. Once data is discovered there are many options for what to do next. The report can be sent to interested parties, archived for compliance, or even fed back into the masking product for automatic policy enforcement. The discovery results can be used to build a catalog of metadata, physically map locations within a data center, and even present a risk score based on location and data type. Discovery can be tuned to look in specific locations, refined to look for as few or as many data types as the user is interested in, and automated to find preselected patterns on a regular schedule.
- Credential Management: Selection, extraction, and discovery of information from different data sources all require credentialed access (typically a user name and password) to the file or database in question. The goal is to automate masking as much as possible, so it would be infeasible to expect users to provide a user name and password to begin every masking task. The masking platform needs to either securely store credentials or use credentials from an access management system like LDAP or Active Directory, and supply seamlessly them as needed.
- Data Set Management: For managing test data sets, as well as for compliance, you need to track which data you mask and where you send it. This information is used to orchestrate moving data around the organization – managing which systems get which masked data, tracking when the last update was performed, and so on. As an example, think about the propagation of medical records: an insurance company, a doctor’s office, a clinical trial organization, and the federal government, all receive different subsets of the data, with different masks applied depending on which information each needs. This is the core function of data management tools, many of which have added masking capabilities. Similarly, masking vendors have added data management capabilities in response to customer demand for complex data orchestration. The formalization of how data sets are managed is also key for both automation and visualization, two topics we will discuss below.
- Data Subsetting: For large enterprises, masking is often applied across hundreds or thousands of databases. In these cases it’s incredibly important to be as efficient as possible to avoid overtaxing databases or saturating networks with traffic. People who manage data define the smallest data subset possible that still satisfies application testers’ needs for production quality masked data. This involves cutting down the number of rows exported/viewed, and possibly reducing the number of columns. Defining a common set of columns also helps clone a single masked data set for multiple environments, reducing the computational burden of creating masked clones.
- Automation: Automation of masking, data collection, and distribution tasks are core functions of every masking platform. The automated application of masking policies, and integration with third party systems that rely on masked data, drastically reduce workload. Some systems offer very rudimentary automation capabilities, such as UNIX
cronjobs, while others have very complex features to manage remote jobs and work with agents to perform remote tasks. A thorough evaluation of vendor automation very important because high-quality automation is key to reducing management time; secondary benefits of automation include segregation of duties and distributed management.
- Reporting: As with most security platforms, especially those which support compliance regulations, reporting is a key feature. Some platforms offer basic reporting with the ability to integrate with third-party tools, while others build more elaborate capabilities into the product. Most supply pre-built compliance reports out of the box, used to demonstrate that controls are in place and performing as specified. These can be as simple as reporting when masking tasks are successful – or have failed – but often include specific information on the location and type of data, where test data sets were sent, and when the masks were applied.
- Advanced Masking: As masking is applied to large data sets, especially those with complex relationships between different data elements, it is essential that these relationships remain after the data has been masked. For example patient health data – where location, treatment, and diseases have a critical multi-column relationship with each other – the masking process must preserve this n-way relationship. Randomizing dates within a timeline is another example, where each date must be randomized while preserving the order of events relative to each other. The ability to customize application of data masks to maintain complex relationships is becoming more common, and you can tell which verticals a masking vendor serves by which specific industries they include complex masks for.
- Pre-built Masks: Prepackaged masks specifically built to satisfy regulatory requirements are now common. For example masks for PCI-DSS, FINRA, and HIPPA are offered to cover the data sets governed by those requirements. Prepackaged metadata catalogs, data locations, and masks for commercial off-the-shelf applications are starting to appear as well. These pre-built masks make it easier to deploy the products and help ensure the appropriate type of protection is applied.
- Visualization: Dashboards, network diagrams that illustrate the presence of sensitive data on servers, and flow maps of how data is being moved are an excellent way to understand the risks to information. These types of visualization tools are just making their way into masking products. Understanding where data is stored, what applications access it, and the risks it’s exposed to along the way, are very helpful for setting up your data security program. Visualization maps are typically populated from discovery scans, and may be cross-referenced with risk scores to highlight critical systems to help IT management and security practitioners make better decisions on how to mask or apply other security controls to protect data.
- Diff & Deduplication: For efficiency, it makes sense to only mask the portion of a data set that has not already been masked. For example, when creating test data, you mask and export only new rows of a database. This is sometimes called Information Lifecycle Management. Partial data set masking is becoming a key feature of some very large systems, such as Teradata and NoSQL database deployments, where efficiency and speed are critical to keeping pace with the flow of data. This can be accomplished in a variety of ways, from database queries that select only recently modified rows – for both ETL and in-place masking – to more complex features that store previous data sets and mask new additions.
- Workflow Integration: Some masking platforms integrate with workflow systems, reporting both new discoveries of sensitive data and creation of masked data sets. This helps compliance groups know their reports have been generated, application testers know their data has been updated, and alerts security teams to sensitive data in new (and possibly unapproved) locations.
- Encryption & Key Management: Some systems provide the capability to encrypt data within masked data sets, providing the same format preservation capabilities as masking while allowing recovery of the original content when needed. This option is provided through Format Preserving Encryption, usually integrated from a third-party service provider. Encryption does not provide full flexibility in application of masks, and great care must be taken in management of encryption keys, but the additional capabilities of FPE can be extremely useful.
- Validation: Validation is the verification that data has been effectively masked. Some data sets contain corrupt data which cannot be successfully masked, and other issues can interfere with normal operation, so some masking providers offer validation capabilities to confirm success and detect failure. Validation options verify that the dataset is actually masked, and can differentiate real from masked data. This feature is valuable for ensuring critically sensitive data is not exported without appropriate masking, and helps satisfy compliance mandates such as the ‘distinguishability’ requirements of PCI-DSS.
- Cloud Compatibility: Cloud computing poses several challenges to masking platforms – including network addressing, identity management, and discovery. But cloud environments present a large and new opportunity for masking vendors, as customers push increasing quantities of data into the cloud. The cloud is especially attractive for Big Data analytics, fast provisioning of test systems, and hybrid data center deployments. And masking is a natural fit for firms that need to protect data before moving sensitive data into multi-tenant environments. Currently, compatibility with cloud service models (including SaaS, PaaS, and IaaS) is rare, but we see the first features for NoSQL data management and integration with SaaS.
In the next post, we will reach the meat of this series: use cases.