Welcome to part 2 of our series on helping you better understand Data Loss Prevention solutions. In Part 1 I gave an overview of DLP, and based on follow-up questions it’s clear that one of the most confusing aspects of DLP is content awareness.

Content awareness is a high level term I use to describe the ability of a product to look into, and understand, content. A product is considered content aware if it uses one, or many, content analysis techniques. Today we’ll look at these different analysis techniques, how effective they may or may not be, and what kinds of data they work best with.

First we need to separate content from context. It’s easiest to think of content as a letter, and context as the envelope and environment around it. Context includes things like source, destination, size, recipients, sender, header information, metadata, time, format, and anything else aside from the content of the letter itself. Context is highly useful and any DLP solution should include contextual analysis as part of the overall solution.

But context alone isn’t sufficient. One early data protection solution could track files based on which server they came from, where they were going, and what actions users attempted on the file. While it could stop a file from a server designated “sensitive” from being emailed out from a machine with the data protection software installed, it would miss untracked versions of the file, movement from systems without the software installed, and a whole host of other routes that weren’t even necessarily malicious. This product lacked content awareness and its utility for protecting data was limited (it has since added content awareness, one reason I won’t name the product).

The advantage of content awareness is that while we use context, we’re not restricted by it. If I want to protect a piece of sensitive data I want to protect it everywhere- not only when it’s in a flagged envelope. I care about protecting the data, not the envelope, so it makes a lot more sense to open the letter, read it, and then decide how to treat it.

Of course that’s a lot harder and more time consuming. That’s why content awareness is the single most important piece of technology in a DLP solution. Opening an envelope and reading a letter is a lot slower than just reading the label- assuming you can even understand the handwriting and language.

The first step in content analysis is capturing the envelope and opening it. I’ll skip the capturing part for now- we’ll talk about that later- and assume we can get the envelope to the content analysis engine. The engine then needs to parse the context (we’ll need that for the analysis) and dig into the content. For a plain text email this is easy, but when you want to look inside binary files it gets a little more complicated. All DLP solutions solve this using file cracking. File cracking is the technology used to read and understand the file, even if the content is buried multiple levels down. For example, it’s not unusual for the file cracker to read an Excel spreadsheet embedded in a Word file that’s zipped. The product needs to unzip the file, read the Word doc, analyze it, find the Excel data, read that, and analyze it. Other situations get far more complex, like a .pdf embedded in a CAD file. Many of the products on the market today support around 300 file types, embedded content, multiple languages, double byte character sets (for Asian languages), and can pull plain text from unidentified file types. Quite a few use the Autonomy or Verity content engines to help with file cracking, but all the serious tools have their own set of proprietary capabilities around that. Some tools can support analysis of encrypted data if they have the recovery keys for enterprise encryption, and most can identify standard encryption and use that as a contextual rule to block/quarantine content.

Rather than just talking about how hard this is and seeing how far I drag out an analogy, let’s jump in and look at the different content analysis techniques used today:

1. Rules-Based/Regular Expressions: This is the most common analysis technique available in both DLP products, and other tools with DLP-like features. It analyzes the content for specific rules- such as 16 digit numbers that meet credit card checksum requirements, medical billing codes, and other textual analysis. Most DLP solutions enhance basic regular expressions with their own additional analysis rules (e.g. a name in proximity to an address near a credit card number).

What content it’s best for: As a first-pass filter, or simply identified pieces of structured data like credit card numbers, social security numbers, and healthcare codes/records.

Strengths: Rules process quickly and can easily be configured. Most products ship with initial rules sets. The technology is well understood and easy to incorporate into a variety of products.

Weaknesses: Prone to high false positive rates. Offer very little protection for unstructured content like sensitive intellectual property.

2. Database Fingerprinting: Sometimes called Exact Data Matching. This technique takes either a database dump or live data (via ODBC connection) from a database and only looks for exact matches. For example, you could generate a policy to look only for credit card numbers in your customer base, thus ignoring your own employees buying online. More advanced tools look for combinations of information, such as the magic combination of first name or initial, with last name, with credit card or social security number, that triggers a California SB 1386 disclosure. Make sure you understand the performance and security implications of nightly extractions vs. live database connections.

What content it’s best for: Structured data from databases.

Strengths: Very low false positives (close to 0). Allows you to protect customer/sensitive data while ignoring other, similar, data used by employees (like their personal credit cards for online orders).

Weaknesses: Nightly dumps won’t contain transaction data since the last extraction. Live connections can affect database performance. Large databases will affect product performance.

3. Exact File Matching: With this technique you take a hash of a file and monitor for any files that match that exact fingerprint. Some consider this to be a contextual analysis technique since the file contents themselves are not analyzed.

What content it’s best for: Media files and other binaries where textual analysis isn’t necessarily possible.

Strengths: Works on any file type, low (effectively no) false positives.

Weaknesses: Trivial to evade. Worthless for content that’s edited, such as standard office documents and edited media files.

4. Partial Document Matching: This technique looks for a complete or partial match on protected content. Thus you could build a policy to protect a sensitive document, and the DLP solution will look for both the complete text of the document, as well as excerpts as small as a few sentences. For example, you could load up a business plan for new product and the DLP solution would alert if an employee pasted a single paragraph into an Instant Message. Most solutions are based on a technique known as cyclical hashing, where you take a hash of a portion of the content, offset a predetermined number of characters, then take another hash, and keep going until the document is completely loaded as a series of overlapping hash values. Outbound content is run through the same hash technique, and the hash values compared for matches. I’ve simplified this a lot, and the top vendors add a fair bit of other analysis on top of the cyclical hashing, such as removing whitespace, looking at word proximities, and other linguistic analysis that’s over my pay grade.

What content it’s best for: Protecting sensitive documents, or similar content with text such as CAD files (with text labels) and source code. Unstructured content that’s known to be sensitive.

Strengths: Ability to protect unstructured data. Generally low false positives (some vendors will say zero false positives, but any common sentence/text in a protected document can trigger alerts). Doesn’t rely on complete matching of large documents, can find policy violations on even a partial match.

Weaknesses: Performance limitations on the total volume of content that can be protected. Common phrases/verbiage in a protected document may trigger false positives. Must know exactly which documents you want to protect. Trivial to avoid (cannot even handle ROT13 ‘encryption’).

5. Statistical Analysis: Use of machine learning, Bayesian analysis, and other statistical techniques to analyze a corpus of content and find policy violations on content that resembles the protected content. I’m lumping a bunch of methods into this broad category, so you stat heads and CTOs please don’t get too upset. These are very similar to techniques used to block spam.

What content it’s best for: Unstructured content where a deterministic technique, like partial document matching, will be ineffective. For example, a repository of engineering plans that’s impractical to load for partial document matching due to high volatility of the information, or massive volume.

Strengths: Can work with more nebulous content where you may not be able to isolate exact documents for matching. Can enforce policies such as, “alert on anything outbound that resembles the documents in this directory”.

Weaknesses: Prone to false positives and false negatives. Requires a large corpus of source content, the bigger the better.

6. Conceptual/Lexicon: This technique uses a combination of dictionaries, rules, and other analysis to protect nebulous content that resembles an “idea”. Okay, it’s easier to give an example- a policy that alerts on traffic that resembles insider trading, which uses key phrases, word counts, and positions to find violations. Other examples are sexual harassment, running a private business from a work account, and job hunting.

What content it’s best for: Completely unstructured ideas that defy simple categorization based on matching known documents, databases, or other registered sources.

Strengths: Not all corporate policies or content can be described using specific examples; conceptual analysis can find loosely defined policy violations other techniques don’t even try to monitor for.

Weaknesses: In most cases these are not user-definable and the rule sets must be built by the DLP vendor and take significant effort (= more $$$). Because of the loose nature of the rules, this technique is very prone to false positives and false negatives.

7. Categories: Pre-built categories with rules and dictionaries for common types of sensitive data, such as credit card numbers/PCI protection, HIPAA, etc.

What content it’s good for: Anything that neatly fits a provided category. Typically easy to describe content related to privacy, regulations, or industry-specific guidelines.

Strengths: Dirt-simple to configure. Saves significant policy generation time. Category policies can form the basis of more advanced, enterprise-specific policies. For many organizations, categories can meet a large percentage of their data protection needs.

Weaknesses: One size fits all might not work. Only good for easily categorized rules/content.

These 7 techniques (well, really 6) form the basis of most of the DLP products on the market. Not all products include all techniques, and there can be significant differences between implementations. Most products can also chain techniques- building complex policies with combinations of content analysis techniques and contextual analysis.

When we get to the product selection part of this series we’ll talk about how to compare the effectiveness of the different products. The short answer is I think that even the best engineer in the world can’t predict exactly which product will work best on your live content, and the only way to know for sure is to test.

Hopefully I’ve given you a better idea of how these work, and the different detection options out there. If there’s something I missed, or you have any questions, drop me a line in the comments.