Welcome to part 4 of our series on Data Loss Prevention/Content Monitoring and Filtering solutions. If you’re new to the series, you should check out Part 1, Part 2, and Part 3 first. I apologize for getting distracted with some other priorities (especially the Data Security Lifecycle), I just realized it’s been about two weeks since my last DLP post in this series. Time to stick the nose to the grindstone (I grew up in a tough suburb) and crank the rest of this guide out. Last time we covered the technical architectures for detecting policy violations for data moving across the network in communications traffic, including email, instant messaging, web traffic, and so on. Today we’re going to dig in to an often overlooked, but just as valuable feature of most major DLP products- Content Discovery. As I’ve previously discussed, the most important component of a DLP/CMF solution is it’s content awareness. Once you have a good content analysis engine the potential applications increase dramatically. While catching leaks on the fly is fairly powerful, it’s only one small part of the problem. Many customers are finding that it’s just as valuable, if not more valuable, to figure out where all that data is stored in the first place. Sure, enterprise search tools might be able to help with this, but they really aren’t tuned well for this specific problem. Enterprise data classification tools can also help, but based on discussions with a number of clients they don’t tend to work well for finding specific policy violations. Thus we see many clients opting to use the content discovery features of their DLP product. Author’s Note: It’s the addition of robust content discovery that I consider the dividing line between a Data Loss Prevention solution and a Content Monitoring and Filtering solution. DLP is more network focused, while CMF begins the expansion to robust content prevention. I use the name DLP extensively since it’s the industry standard, but over time we’ll see this migrate to CMF, and eventually to Content Monitoring and Protection, as I discussed in this post. The biggest advantage of content discovery in a DLP/CMF tool is that it allows you to take a single policy and apply it across data no matter where it’s stored, how it’s shared, or how it’s used. For example, you can define a policy that requires credit card numbers to only be emailed when encrypted, never be shared via HTTP or HTTPS, only be stored on approved servers, and only be stored on workstations/laptops by employees on the accounting team. All of this is done in a single policy on the DLP/CMF management server. We can break discovery out into three major modes: Endpoint Discovery: scanning workstations and laptops for content. Storage Discovery: scanning mass storage, including file servers, SAN, and NAS. Server Discovery: application-specific scanning on stored data in email servers, document management systems, and databases (not currently a feature of most DLP products, but beginning to appear in some Database Activity Monitoring products). These types perform their analysis using three technologies: Remote Scanning: a connection is made to the server or device using a file sharing or application protocol, and scanning performed remotely. This is essentially mounting a remote drive and scanning it from a scanning server that takes policies from and sends results to the central policy server. For some vendors this is an appliance, for others it’s a server, and for smaller deployments it’s integrated into the central management server. Agent-Based Scanning: an agent is installed on the system/server to be scanned and scanning performed locally. Agents are platform specific, and use local CPU cycles, but can potentially perform significantly faster than remote scanning, especially for large repositories. For endpoints, this should be a feature of the same agent used for enforcing Data-In-Use controls. Temporal-Agent Scanning: Rather than deploying a full time agent, a memory-resident agent is installed, performs a scan, then exits without leaving anything running or stored on the local system. This offers the performance of agent-based scanning in situations where you don’t want a full-time agent running. Any of these technologies can work for any of the modes, and enterprises will typically deploy a mix depending on policy and infrastructure requirements. We currently see some technology limitations of each approach that affect deployment: Remote scanning can significantly increase network traffic and has performance limitations based on network bandwidth and target and scanner network performance. Some solutions can only scan gigabytes per day (sometimes hundreds of GB, but below TB/day), per server based on these practical limitations which may not be sufficient for very large storage. Agents, temporal or permanent, are limited by processing power and memory on the target system which often translates to restrictions on the number of policies that can be enforced, and the types of content analysis that can be used. For example, most endpoint agents are not capable of enforcing large data sets of partial document matching or database fingerprinting. This is especially true of endpoint agents which are more limited Agents don’t support all platforms. Once a policy violation is discovered, the discovery solution can take a variety of actions: Alert/Report: create an incident in the central management server just like a network violation. Wa : notify the user via email that they may be in violation of policy. Quarantine/Notify: move the file to the central management server and leave a .txt file with instructions on how to request recovery of the file. Quarantine/Encrypt: encrypt the file in place, usually leaving a plain text file on how to request decryption. Quarantine/Access Control: change the access controls to restrict access to the file. Remove/Delete: either transfer the file to the central server without notification, or just delete it. The combination of different deployment architectures, discovery techniques, and enforcement options creates a powerful combination for protecting data-at-rest and supporting compliance initiatives. For example, we’re starting to see increasing deployments of CMF to support PCI compliance- more for the ability to ensure (and