Understanding and Selecting Data Masking: Technical ArchitectureBy Adrian Lane
Today we will discuss platform architectures and deployment models. Before I jump into the architectural models, it’s worth mentioning that these architectures are designed in response to how enterprises use data. Data is valuable because we use it to support business functions. Data has value in use. The more places we can leverage data to make decisions, the more valuable it is. However, as we have seen over the last decade, data propagation carries many risks. Masking architectures are designed to fit within existing data management frameworks and mitigate risks to information without sacrificing usefulness. In essence we are inserting controls into existing processes, using masking as a guardian, to identify risks and protect data as it migrates through the enterprise applications that automate business processes.
As I mentioned in the introduction, we have come a long way from masking as nothing more than a set of scripts run by an admin or database administrator. Back then you connected directly to a database, or ran scripts from the console, and manually moved files around. Today’s platforms proactively discover sensitive data and manage policies centrally, handling security and data distribution across dozens of different types of information management systems, automatically generating masked data as needed for different audiences. Masking products can stand alone, serving disparate data management systems simultaneously, or be embedded as a core function of a dedicated data management service.
- Single Server/Appliance: A single appliance or software installation that performs static ‘ETL’ data masking services. The server is wholly self-contained – performing all extraction, masking, and loading from a single location. This model is typically used in small and mid-sized enterprises. It can scale geographically, with independent servers in regional offices to handle masking functions, usually in response to specific regional regulatory requirements.
- Distributed: This option consists of a central management server with remote agents/plug-ins/appliances that perform discovery and masking functions. The central server distributes masking rules, directs endpoint functionality, catalogs locations and nature of sensitive data, and tracks masked data sets. Remote agents periodically receive updates with new masking rules from the central server, and report back sensitive data that has been discovered, along with the results of masking jobs. Scaling is by pushing processing load out to the endpoints.
- Centralized Architecture: Multiple masking servers, centrally located and managed by a single management server, are used primarily for production and management of masked data for multiple test and analytics systems.
- Proxy/Bridge Cluster: One or more appliances or agents that dynamically mask streamed content, typically deployed in front of relational databases, to provide proxy-based data masking. This model is used for real-time masking of non-static data, such as database queries or loading into NoSQL databases. Multiple appliances provide scalability and failover capabilities. This may or may not be used in a two-tier architecture.
Appliances, software, and virtual appliance options are all available. But unlike most security products, where appliances dominate the market, masking vendors generally deliver their products as software. Windows, Linux, and UNIX support is all common, as is support for many types of files and relational databases. Support for virtual appliance deployment is common among the larger vendors but not universal, so inquire about availability if that is key to your IT service model.
A key masking evolution is the ability to apply masking policies across different data management systems (file management, databases, document management, etc.) regardless of platform type (Windows vs. Linux vs. …). Modern masking platforms are essentially data management systems, with policies set at a central location and applied to multiple systems through direct connection or remote agent software. As data is collected and moved from point A to point B, one or more data masks are applied to one or more ‘columns’ of the data.
Deployment and Endpoint Options
While masking architecture is conceptually simple, there are many different deployment options, each particularly suited to protecting one or more data management systems. And given masking technologies must work on static data copies, live database repositories, and dynamically generated data (streaming data feeds, application generated content, ad hoc data queries, etc.), a wide variety of deployment options are available to accommodate the different data management environments. Most companies deploy centralized masking servers to produce safe test and analytics data, but vendors offer the flexibility to embed masking directly into other applications and environments where large-footprint masking installations or appliances are unsuitable. The following is a sample of the common deployments used for remote data collection and processing.
Agents: Agents are software components installed on a server, usually the same server that hosts the data management application. Agents have the option of being as simple or advanced as the masking vendor cares to make them. They can be nothing more than a data collector, sending data back to a remote masking server for processing, or might provide masking as data is collected. In the latter case, the agent masks data as it is received, either completely in memory or from a temporary file. Agents can be managed remotely by a masking server or directly by the data management application, effectively extending data management and collaboration system capabilities (e.g., MS SharePoint, SAP). One of the advantages of using agents at the endpoint rather than in-database stored procedures – which we will describe in a moment – is that all traces of unmasked data can be destroyed. Either by masking in ‘ephemeral’ memory, or by ensuring temporary files are overwritten, sensitive data is not leaked through temporary storage. Agents do consume local processor, memory, and storage – a significant issue for legacy platforms – but only a minor consideration for virtual machines and cloud deployments.
Web Server Plug-ins: Technically a form of agent, these plug-ins are installed as web application services, as part of an Apache/web application stack used to support the local application which manages data. Plug-ins are an efficient way to transparently implement masking within existing application environments, acting on the data stream before it reaches the application or extending the application’s functionality through Application Programming Interface (API) calls. This means plug-ins can be managed by a remote masking server through web API calls, or directly by the larger data management system. Deployment in the web server environment gives plug-ins the flexibility to perform ETL, in-place, and proxy masking. Proxy agents are most commonly used to decipher XML data streams, and are increasingly common for discovering and masking data streams prior to loading into “Big Data” analytic databases.
Stored Procedures/Triggers: A stored procedure is basically a small script that resides and runs inside a database. A trigger is a small function that manipulates data as records are inserted and updated in a database. Both these database features have been used to implement data masking. Stored procedures and triggers take advantage of the database’s natural capabilities to apply masking functions to data, seamlessly, inside the database. They can be used to mask ‘in-place’, replacing sensitive data, or periodically to create a masked ‘view’ of the original data.
ODBC/JDBC Connectors: ODBC stands for Open Database Connector, and JDBC for Java Database Connectivity Connector; each offerings a generic programatic interface to any type of relational database. These interfaces are used by applications that need to work directly with the database to query and update information. You can think about it like a telephone call between the database and the masking platform: The masking server establishes its identity to the database, asks for information, and then hangs up when the data has been transferred. As the masking platform can issue any database query, it’s ideal for pulling incremental changes from a database and filtering out unneeded data. This is critical for reducing the size of the data set, reducing the total amount of processing power required to mask, and making extraction and load operations as efficient as possible. These interfaces are most commonly used to retrieve information for ETL style deployments, but are also leveraged in conjunction with stored procedures to implement in-place masking.
That’s plenty to contemplate for today. Next I will talk in detail about management capabilities – including discovery, policies, data set management, and advanced data masking features.