Understanding and Selecting a DLP Solution: Part 2, Content Awareness

By Rich

Welcome to part 2 of our series on helping you better understand Data Loss Prevention solutions. In Part 1 I gave an overview of DLP, and based on follow-up questions it’s clear that one of the most confusing aspects of DLP is content awareness.

Content awareness is a high level term I use to describe the ability of a product to look into, and understand, content. A product is considered content aware if it uses one, or many, content analysis techniques. Today we’ll look at these different analysis techniques, how effective they may or may not be, and what kinds of data they work best with.

First we need to separate content from context. It’s easiest to think of content as a letter, and context as the envelope and environment around it. Context includes things like source, destination, size, recipients, sender, header information, metadata, time, format, and anything else aside from the content of the letter itself. Context is highly useful and any DLP solution should include contextual analysis as part of the overall solution.

But context alone isn’t sufficient. One early data protection solution could track files based on which server they came from, where they were going, and what actions users attempted on the file. While it could stop a file from a server designated “sensitive” from being emailed out from a machine with the data protection software installed, it would miss untracked versions of the file, movement from systems without the software installed, and a whole host of other routes that weren’t even necessarily malicious. This product lacked content awareness and its utility for protecting data was limited (it has since added content awareness, one reason I won’t name the product).

The advantage of content awareness is that while we use context, we’re not restricted by it. If I want to protect a piece of sensitive data I want to protect it everywhere- not only when it’s in a flagged envelope. I care about protecting the data, not the envelope, so it makes a lot more sense to open the letter, read it, and then decide how to treat it.

Of course that’s a lot harder and more time consuming. That’s why content awareness is the single most important piece of technology in a DLP solution. Opening an envelope and reading a letter is a lot slower than just reading the label- assuming you can even understand the handwriting and language.

The first step in content analysis is capturing the envelope and opening it. I’ll skip the capturing part for now- we’ll talk about that later- and assume we can get the envelope to the content analysis engine. The engine then needs to parse the context (we’ll need that for the analysis) and dig into the content. For a plain text email this is easy, but when you want to look inside binary files it gets a little more complicated. All DLP solutions solve this using file cracking. File cracking is the technology used to read and understand the file, even if the content is buried multiple levels down. For example, it’s not unusual for the file cracker to read an Excel spreadsheet embedded in a Word file that’s zipped. The product needs to unzip the file, read the Word doc, analyze it, find the Excel data, read that, and analyze it. Other situations get far more complex, like a .pdf embedded in a CAD file. Many of the products on the market today support around 300 file types, embedded content, multiple languages, double byte character sets (for Asian languages), and can pull plain text from unidentified file types. Quite a few use the Autonomy or Verity content engines to help with file cracking, but all the serious tools have their own set of proprietary capabilities around that. Some tools can support analysis of encrypted data if they have the recovery keys for enterprise encryption, and most can identify standard encryption and use that as a contextual rule to block/quarantine content.

Rather than just talking about how hard this is and seeing how far I drag out an analogy, let’s jump in and look at the different content analysis techniques used today:

1. Rules-Based/Regular Expressions: This is the most common analysis technique available in both DLP products, and other tools with DLP-like features. It analyzes the content for specific rules- such as 16 digit numbers that meet credit card checksum requirements, medical billing codes, and other textual analysis. Most DLP solutions enhance basic regular expressions with their own additional analysis rules (e.g. a name in proximity to an address near a credit card number).

What content it’s best for: As a first-pass filter, or simply identified pieces of structured data like credit card numbers, social security numbers, and healthcare codes/records.

Strengths: Rules process quickly and can easily be configured. Most products ship with initial rules sets. The technology is well understood and easy to incorporate into a variety of products.

Weaknesses: Prone to high false positive rates. Offer very little protection for unstructured content like sensitive intellectual property.

2. Database Fingerprinting: Sometimes called Exact Data Matching. This technique takes either a database dump or live data (via ODBC connection) from a database and only looks for exact matches. For example, you could generate a policy to look only for credit card numbers in your customer base, thus ignoring your own employees buying online. More advanced tools look for combinations of information, such as the magic combination of first name or initial, with last name, with credit card or social security number, that triggers a California SB 1386 disclosure. Make sure you understand the performance and security implications of nightly extractions vs. live database connections.

What content it’s best for: Structured data from databases.

Strengths: Very low false positives (close to 0). Allows you to protect customer/sensitive data while ignoring other, similar, data used by employees (like their personal credit cards for online orders).

Weaknesses: Nightly dumps won’t contain transaction data since the last extraction. Live connections can affect database performance. Large databases will affect product performance.

3. Exact File Matching: With this technique you take a hash of a file and monitor for any files that match that exact fingerprint. Some consider this to be a contextual analysis technique since the file contents themselves are not analyzed.

What content it’s best for: Media files and other binaries where textual analysis isn’t necessarily possible.

Strengths: Works on any file type, low (effectively no) false positives.

Weaknesses: Trivial to evade. Worthless for content that’s edited, such as standard office documents and edited media files.

4. Partial Document Matching: This technique looks for a complete or partial match on protected content. Thus you could build a policy to protect a sensitive document, and the DLP solution will look for both the complete text of the document, as well as excerpts as small as a few sentences. For example, you could load up a business plan for new product and the DLP solution would alert if an employee pasted a single paragraph into an Instant Message. Most solutions are based on a technique known as cyclical hashing, where you take a hash of a portion of the content, offset a predetermined number of characters, then take another hash, and keep going until the document is completely loaded as a series of overlapping hash values. Outbound content is run through the same hash technique, and the hash values compared for matches. I’ve simplified this a lot, and the top vendors add a fair bit of other analysis on top of the cyclical hashing, such as removing whitespace, looking at word proximities, and other linguistic analysis that’s over my pay grade.

What content it’s best for: Protecting sensitive documents, or similar content with text such as CAD files (with text labels) and source code. Unstructured content that’s known to be sensitive.

Strengths: Ability to protect unstructured data. Generally low false positives (some vendors will say zero false positives, but any common sentence/text in a protected document can trigger alerts). Doesn’t rely on complete matching of large documents, can find policy violations on even a partial match.

Weaknesses: Performance limitations on the total volume of content that can be protected. Common phrases/verbiage in a protected document may trigger false positives. Must know exactly which documents you want to protect. Trivial to avoid (cannot even handle ROT13 ‘encryption’).

5. Statistical Analysis: Use of machine learning, Bayesian analysis, and other statistical techniques to analyze a corpus of content and find policy violations on content that resembles the protected content. I’m lumping a bunch of methods into this broad category, so you stat heads and CTOs please don’t get too upset. These are very similar to techniques used to block spam.

What content it’s best for: Unstructured content where a deterministic technique, like partial document matching, will be ineffective. For example, a repository of engineering plans that’s impractical to load for partial document matching due to high volatility of the information, or massive volume.

Strengths: Can work with more nebulous content where you may not be able to isolate exact documents for matching. Can enforce policies such as, “alert on anything outbound that resembles the documents in this directory”.

Weaknesses: Prone to false positives and false negatives. Requires a large corpus of source content, the bigger the better.

6. Conceptual/Lexicon: This technique uses a combination of dictionaries, rules, and other analysis to protect nebulous content that resembles an “idea”. Okay, it’s easier to give an example- a policy that alerts on traffic that resembles insider trading, which uses key phrases, word counts, and positions to find violations. Other examples are sexual harassment, running a private business from a work account, and job hunting.

What content it’s best for: Completely unstructured ideas that defy simple categorization based on matching known documents, databases, or other registered sources.

Strengths: Not all corporate policies or content can be described using specific examples; conceptual analysis can find loosely defined policy violations other techniques don’t even try to monitor for.

Weaknesses: In most cases these are not user-definable and the rule sets must be built by the DLP vendor and take significant effort (= more $$$). Because of the loose nature of the rules, this technique is very prone to false positives and false negatives.

7. Categories: Pre-built categories with rules and dictionaries for common types of sensitive data, such as credit card numbers/PCI protection, HIPAA, etc.

What content it’s good for: Anything that neatly fits a provided category. Typically easy to describe content related to privacy, regulations, or industry-specific guidelines.

Strengths: Dirt-simple to configure. Saves significant policy generation time. Category policies can form the basis of more advanced, enterprise-specific policies. For many organizations, categories can meet a large percentage of their data protection needs.

Weaknesses: One size fits all might not work. Only good for easily categorized rules/content.

These 7 techniques (well, really 6) form the basis of most of the DLP products on the market. Not all products include all techniques, and there can be significant differences between implementations. Most products can also chain techniques- building complex policies with combinations of content analysis techniques and contextual analysis.

When we get to the product selection part of this series we’ll talk about how to compare the effectiveness of the different products. The short answer is I think that even the best engineer in the world can’t predict exactly which product will work best on your live content, and the only way to know for sure is to test.

Hopefully I’ve given you a better idea of how these work, and the different detection options out there. If there’s something I missed, or you have any questions, drop me a line in the comments.

No Related Posts


Thanks for challenging me, I was getting worried no one was reading this critically (and BTW, I do plan on doing this series for DAM as well).

Take a look at the Data Security Lifecycle- rather than calling this "context", I’‘m calling it Logical Controls. I also think it’s the weakest area of data protection and one ripe for innovation. It’s also really freaking hard to figure out those rules.

I’‘m in full agreement that we need to put business context around data and start making security decisions including that input. Even the best DLP solutions are still stuck with basic descriptive rules that are very limited in what they can find. I actually wrote a model for one way to think about moving towards more dynamic security decisions, but it’s locked in the G archives (it’s called Dynamic Trust).

Logical controls are the key, if we can just figure them out, and in those cases Context is much more important.

By rmogull

I think DLP is a much better characterization of the problem set that ‘Data Protection’, ‘Extrusion Prevention’, ‘Information Security’ or the oft popular ‘Information Risk Management’.  And I really appreciate that you have clearly articulated that data location and data state varies, and with it the security model varies, and this correctly captures the dimensionality of the problems at hand.  And I am philosophically in agreement about this being an effective way to discover business problems.  I do have a couple of nits to pick …

Content Awareness is a tricky thing.  A single data element, let’s use a Credit Card number as an example, has a relatively clear threat model and some fairly straight forward security precautions. If it is in motion, or if it is at rest, the model is probably encryption. In use, it needs to be unencrypted but safe from snooping, with limited access only by those persons or processes required to complete a transaction.  And remember, regardless of state, it is not so difficult to discover a CC#. Other types of data become far more complex to protect but no less valuable.  Pricing information, Cost of Goods Sold, technology acquisition analysis and balance sheet entries do not have simple threat and protection models as the audience for the information, and the number of uses, is far broader.  Understanding an appropriate use of that data is not as simple, and Context becomes very important in determining mis-use. 

The point I am trying to make is that the definition of Context is too narrow, and the variables of source, destination, user and time do not shed light on certain issues.  I think there should be an eighth, either Functional or State based analysis.  Some operations may be permissible, but only under certain circumstances or state variables are present.  An example might be only altering one data element (reset a user password) when another data element (work order) exists.  The examples I often use are end of period adjustments, wherein the context for the update is on older data, but within a specific time frame, during a specific window of time, for a specific data value, with specific contra-entries.  How do you know if this is fraudulent or not?  How do you detect if it is in motion without structure or state?  There may not be a deterministic answer, but a very high likelihood, if established checks and balances are followed by understanding the process, or state, or function that is being followed. When you use information about the business process, and not just individual context items (meta-data?), I contend make a big difference in detecting security & privacy issues.  The value of content analysis only becomes truly apparent in context to the operation and security model.  It is implied that the policy engine that drives the data collection, inspection and enforcement will need to percolate this information.

Distillation of content down to basic elements is usually, but not always, the best way to analyze content.  I know it makes the programming easier, but sometimes compound content element analysis, or a combination of content & attribute analysis, or content and context analysis, is far more powerful.  I will leave the discussion of behavior based analysis at the non-network level for some other occasion …

Thanks again for a great series of posts.

By Adrian Lane

[...] Understanding and Selecting a DLP Solution: Part 2, Content Awareness [...]

By Liquidmatrix Security Digest » Security Brie

If you like to leave comments, and aren’t a spammer, register for the site and email us at and we’ll turn off moderation for your account.