So far we have talked about the need for data centric security, what that means, and which tools fit the model. Now it is time to paint a more specific picture of how to implement and deploy data centric security, so here are some concrete examples of how the tools are deployed to support a data centric model.


A gateway is typically an appliance that sits in-line with traffic and applies security as data passes. Data packets are inspected near line speed, and sensitive data is replaced or obfuscated before packets are passed on.

Gateways are commonly used used by enterprises before data is moved off-premise, such as up to the cloud or to another third-party service provider. The gateway sits inside the corporate firewall, at the ‘edge’ of the infrastructure, discovering and filtering out sensitive data. For example some firms encrypt data before it is moved into cloud storage for backups. Others filter web-based transactions inline, replacing credit card data with tokens without disrupting the web server or commerce applications. Gateways offer high-performance substitution for data in motion; but they must be able to parse the data stream to encrypt, tokenize, or mask sensitive data.

Another gateway deployment model puts appliances in front of “big data” (NoSQL databases such as Hadoop), replacing data before insertion into the cluster. But support for high “input velocity” is a key advantage of big data platforms. To avoid crippling performance at the security bottleneck, gateways must be able to perform data replacement while keeping up with the big data platform’s ingestion rate. It is not uncommon to see a cluster of appliances feeding a single NoSQL repository, or even spinning up hundreds of cloud servers on demand, to mask or tokenize data. These service must secure data very quickly, so they don’t provide deep analysis. Gateways may even need to be told the location of sensitive data within the stream to support substitution.

Hub and Spoke

ETL (Extract, Transform, and Load) has been around almost as long as relational databases. It describes a process for extracting data from one database, masking it to remove sensitive data, then loading the desensitized into another database. Over the last several years we have seen a huge resurgence of ETL, as firms look to populate test databases with non-sensitive data that still provides a reliable testbed for quality assurance efforts. A masking or tokenization ‘hub’ orchestrates data movement and implements security. Modeled on test data management systems, modern systems alter health care data and PII (Personally Identifiable Information) to support use in multiple locations with inconsistent or inadequate security. The hub-and-spoke model is typically used to create multiple data sets, rather than securing streams of data; to align with the hub-and-spoke model, encryption and tokenization are the most common methods of protection. Encryption enables trusted users to decrypt the data as needed, and masking supports analytics without providing the real (sensitive) data.

The graphic above shows ETL in its most basic form, but the old platforms have evolved into much more sophisticated data management systems. They can now discover data stored in files and databases, morph together multiple sources to create new data sets, apply different masks for different audiences, and relocate the results – as files, as JSON streams, or even inserted into a data repository. It is a form of data orchestration, moving information automatically according to policy. Plummeting compute and storage costs have made it feasible to produce and propagate multiple data sets to various audiences.

Reverse Proxy

As with the gateways described above, in the reverse-proxy model an appliance – whether virtual or physical – is inserted inline into the data flow. But reverse proxies are used specifically between users and a database. Offering much more than simple positional substitution, proxies can alter what they return to users based on the recipient and the specifics of their request. They work by intercepting and masking query results on the fly, transparently substituting masked results for the user. For example if a user queries too many credit card numbers, or if a query originates from an unapproved location, the returned data might be redacted. The proxy effectively intelligently dynamically masks data. The proxy may be an application running on the database or an appliance deployed inline between users and data to force all communications through the proxy. The huge advantage of proxies is t hat they enable data protection without needing to alter the database — they avoid additional programming and quality assurance validation processes.

This model is appropriate for PII/PHI data, when data can be managed from a central locations but external users may need access. Some firms have implemented tokenization this way, but masking and redaction are more common. The principal use case is to protect data dynamically, based on user identity and the request itself.

Other Options

Many of you have used data centric security before, and use it today, so it is worth mentioning two security platforms in wide use today which don’t quite fit our use cases. Data Loss Prevention systems (DLP), and Digital Rights Management (DRM) are forms of DCS which have each been in use over a decade. Data Loss Prevention systems are designed to detect sensitive data and ensure data usage complies with security policy – on the network, on the desktop, and in storage repositories. Digital Rights Management embeds ownership and usage rules into the data, with security policy (primarily read and write access) enforced by the applications that use the data. DLP protects at the infrastructure layer, and DRM at the application layer.

Both use encryption to protect data. Both allow users to view and edit data depending on security policies. DLP can be effectively deployed in existing IT environments, helping organizations gain control over data that is already in use. DRM typically needs to be built into applications, with security controls (e.g.,: encryption and ownership rights) applied to data as it is created. These platforms are designed to expose data (making it available to users) on demand. This means that to leverage these security controls you need to deploy these platforms everywhere you want to use data. These data centric models work for migrating on-premise systems to IaaS (Infrastructure as a Service), but do not lend themselves to the emerging use cases which this series is focused on. DLP is not easy to extend beyond corporate IT boundaries into the cloud, and both models tend to compromise the performance and scalability of NoSQL clusters. Which is another way of saying that we will continue to see DLP and DRM hybrids, but they do not address the use cases that prompted this research.


So where do you start? Data centric security requires a change in mindset and approach. It is difficult to move away from a network or firewall centric security model when the vast majority of tools are geared to these models. You need to bake DCS into your data management model.

It is helpful to consider the data lifecycle – where data is, where it moves, and how it is used – and then figure out the tools and deployment model you need to secure it. Once you know what you do with your data, you have a much better perspective on how to protect it. That entails learning where data is moving and what sensitive information you have. Once you understand how data is used, select technology and a deployment model that fits your goals. A super-simple example is storing data in the cloud: encryption is effective protection for data at rest. If you are sharing sensitive data across multiple parties with different levels of security access, a combination of redaction and masking might be the right approach.

For each stage in this lifecycle, what is the best way to provide security?

The threat of a breach may or may not provide impetus for firms to re-examine security. And even when companies do re-examine their security approach, they do not necessarily adopt a data centric security model. Focusing on data is logical, but it is an unusual way for firms to look at security. More often they ask “How do attackers get in, and how can I stop them?” If the threat du jour is phishing and malware, customers tend to respond with “So let’s stop phishing and malware”. But if the threat is SQL injection, Cross-site scripting, or weak passwords… security deployments follow the threats. It demands a larger awareness and bravery to focus on the data in the heart of the data center, rather than on the perimeter. Securing data first, to provide “security from the inside out”, requires a different mindset than the ever-popular never-ending threat/patch ping-pong. So for these use cases firms are realizing that traditional approaches won’t work, and searching for data protection options which work regardless of environment.