Someone call the Guinness records people- I’m actually posting the next part of this series when I said I would!

Okay, maybe there’s a deadline or something, but still…

In part 1 we discussed the value of DLP content discovery, defined it a little bit, and listed a few use cases to demonstrate it’s value. Today we’re going to delve into the technology and a few major features you should look for.

First I want to follow up on something from the last post. I reached out to one of the DLP vendors I work with, and they said they are seeing around 60% of their clients purchase discovery in their initial DLP deployment. Anecdotal conversations from other vendors/clients supports this assertion. Now we don’t know exactly how soon they roll it out, but my experience supports the position that somewhere over 50% of clients roll out some form of discovery within the first 12-18 months of their DLP deployment.

Now on to the…


Let’s start with the definition of content discovery. It’s merely the definition of DLP/CMP, but excluding the in use and in motion components:

“Products that, based on central policies, identify, monitor, and protect data at rest through deep content analysis”.

As with the rest of DLP, the key distinguishing characteristic (as opposed to other data at rest tools like content classification and e-discovery) is deep content analysis based on central policies. While covering all content analysis techniques is beyond the scope of this post, examples include partial document matching, database fingerprinting (or exact data matching), rules-based, conceptual, statistical, pre-definited categories (like PCI compliance), and combinations of the above. They offer far deeper analysis than just simple keyword and regular expression matching. Ideally, DLP content discovery should also offer preventative controls, not just policy alerts on violations. How does this work?


At the heart is the central policy server; the same system/device that manages the rest of your DLP deployment. The key three features of the central management server are policy creation, deployment management/administration, and incident handling/workflow. In large deployments you may have multiple central servers, but they all interconnect in a hierarchical deployment.

Data at rest is analyzed using one of four techniques/components:

  1. Remote scanning: either the central policy server or a dedicated scanning server that connects with storage repositories/hosts via network shares or other administrative access. Files are then scanned for content violations. Connections are often made using administrative credentials, and any content transfered between the two should be encrypted, but this may require reconfiguration of the storage repository and isn’t always possible. Most tools allow bandwidth throttling to limit network impact, and placing scanning servers closer to the storage also increases speed and limits impact. It supports scanning nearly any storage repository, but even with optimization performance will be limited due to reliance on networking.
  2. Server agent: a thin agent is installed on the server and scans content locally. Agents can be tuned to limit performance impact, and results are sent securely to the central management server. While scanning performance is higher than remote scanning, it requires platform support and local software installation.
  3. Endpoint agent: while you can scan endpoints/workstations remotely using administrative file shares, this will rapidly eat up network bandwidth. DLP solutions increasingly include endpoint agents with local discovery capabilities. These agents normally include other DLP functions, such as USB monitoring/blocking.
  4. Application integration: direct integration, often using an agent, with document management, content management, or other storage-oriented applications. This integration not only supports visibility into management content, but allows the discovery tool to understand local context and possibly enforce actions within the system.

A good content discovery tool will understand file context, not just content. For example, the tool can analyze access controls on the files and using its directory integration understand which users and groups have what access. Thus the accounting department can access corporate financials, but any files with that content allowing all-user access are identified for remediation. Engineering teams can see engineering plans, but the access controls are automatically updated to restrict access by the accounting team if engineering content shows up in the wrong repository.

From an architectural perspective you’ll want to look for solutions that support multiple options, with performance that meets your requirements.

That’s it for today. Tomorrow we’ll review enforcement options (which we’ve hinted at), management, workflow, and reporting. I’m not going to repeat everything from the big DLP whitepaper, but concentrate on aspects important to protecting data at rest.n


p style=”text-align:right;font-size:10px;”>Technorati Tags: , , , , , , , , , ,