Trends in Data Centric Security: Deployment Models
So far we have talked about the need for data centric security, what that means, and which tools fit the model. Now it is time to paint a more specific picture of how to implement and deploy data centric security, so here are some concrete examples of how the tools are deployed to support a data centric model. Gateways A gateway is typically an appliance that sits in-line with traffic and applies security as data passes. Data packets are inspected near line speed, and sensitive data is replaced or obfuscated before packets are passed on. Gateways are commonly used used by enterprises before data is moved off-premise, such as up to the cloud or to another third-party service provider. The gateway sits inside the corporate firewall, at the ‘edge’ of the infrastructure, discovering and filtering out sensitive data. For example some firms encrypt data before it is moved into cloud storage for backups. Others filter web-based transactions inline, replacing credit card data with tokens without disrupting the web server or commerce applications. Gateways offer high-performance substitution for data in motion; but they must be able to parse the data stream to encrypt, tokenize, or mask sensitive data. Another gateway deployment model puts appliances in front of “big data” (NoSQL databases such as Hadoop), replacing data before insertion into the cluster. But support for high “input velocity” is a key advantage of big data platforms. To avoid crippling performance at the security bottleneck, gateways must be able to perform data replacement while keeping up with the big data platform’s ingestion rate. It is not uncommon to see a cluster of appliances feeding a single NoSQL repository, or even spinning up hundreds of cloud servers on demand, to mask or tokenize data. These service must secure data very quickly, so they don’t provide deep analysis. Gateways may even need to be told the location of sensitive data within the stream to support substitution. Hub and Spoke ETL (Extract, Transform, and Load) has been around almost as long as relational databases. It describes a process for extracting data from one database, masking it to remove sensitive data, then loading the desensitized into another database. Over the last several years we have seen a huge resurgence of ETL, as firms look to populate test databases with non-sensitive data that still provides a reliable testbed for quality assurance efforts. A masking or tokenization ‘hub’ orchestrates data movement and implements security. Modeled on test data management systems, modern systems alter health care data and PII (Personally Identifiable Information) to support use in multiple locations with inconsistent or inadequate security. The hub-and-spoke model is typically used to create multiple data sets, rather than securing streams of data; to align with the hub-and-spoke model, encryption and tokenization are the most common methods of protection. Encryption enables trusted users to decrypt the data as needed, and masking supports analytics without providing the real (sensitive) data. The graphic above shows ETL in its most basic form, but the old platforms have evolved into much more sophisticated data management systems. They can now discover data stored in files and databases, morph together multiple sources to create new data sets, apply different masks for different audiences, and relocate the results – as files, as JSON streams, or even inserted into a data repository. It is a form of data orchestration, moving information automatically according to policy. Plummeting compute and storage costs have made it feasible to produce and propagate multiple data sets to various audiences. Reverse Proxy As with the gateways described above, in the reverse-proxy model an appliance – whether virtual or physical – is inserted inline into the data flow. But reverse proxies are used specifically between users and a database. Offering much more than simple positional substitution, proxies can alter what they return to users based on the recipient and the specifics of their request. They work by intercepting and masking query results on the fly, transparently substituting masked results for the user. For example if a user queries too many credit card numbers, or if a query originates from an unapproved location, the returned data might be redacted. The proxy effectively intelligently dynamically masks data. The proxy may be an application running on the database or an appliance deployed inline between users and data to force all communications through the proxy. The huge advantage of proxies is t hat they enable data protection without needing to alter the database — they avoid additional programming and quality assurance validation processes. This model is appropriate for PII/PHI data, when data can be managed from a central locations but external users may need access. Some firms have implemented tokenization this way, but masking and redaction are more common. The principal use case is to protect data dynamically, based on user identity and the request itself. Other Options Many of you have used data centric security before, and use it today, so it is worth mentioning two security platforms in wide use today which don’t quite fit our use cases. Data Loss Prevention systems (DLP), and Digital Rights Management (DRM) are forms of DCS which have each been in use over a decade. Data Loss Prevention systems are designed to detect sensitive data and ensure data usage complies with security policy – on the network, on the desktop, and in storage repositories. Digital Rights Management embeds ownership and usage rules into the data, with security policy (primarily read and write access) enforced by the applications that use the data. DLP protects at the infrastructure layer, and DRM at the application layer. Both use encryption to protect data. Both allow users to view and edit data depending on security policies. DLP can be effectively deployed in existing IT environments, helping organizations gain control over data that is already in use. DRM typically needs to be built into applications, with security controls (e.g.,: encryption and ownership rights) applied to data as it is created. These platforms are designed to expose data (making it available