Understanding and Selecting a DLP Solution: Part 2, Content Awareness
Welcome to part 2 of our series on helping you better understand Data Loss Prevention solutions. In Part 1 I gave an overview of DLP, and based on follow-up questions it’s clear that one of the most confusing aspects of DLP is content awareness. Content awareness is a high level term I use to describe the ability of a product to look into, and understand, content. A product is considered content aware if it uses one, or many, content analysis techniques. Today we’ll look at these different analysis techniques, how effective they may or may not be, and what kinds of data they work best with. First we need to separate content from context. It’s easiest to think of content as a letter, and context as the envelope and environment around it. Context includes things like source, destination, size, recipients, sender, header information, metadata, time, format, and anything else aside from the content of the letter itself. Context is highly useful and any DLP solution should include contextual analysis as part of the overall solution. But context alone isn’t sufficient. One early data protection solution could track files based on which server they came from, where they were going, and what actions users attempted on the file. While it could stop a file from a server designated “sensitive” from being emailed out from a machine with the data protection software installed, it would miss untracked versions of the file, movement from systems without the software installed, and a whole host of other routes that weren’t even necessarily malicious. This product lacked content awareness and its utility for protecting data was limited (it has since added content awareness, one reason I won’t name the product). The advantage of content awareness is that while we use context, we’re not restricted by it. If I want to protect a piece of sensitive data I want to protect it everywhere- not only when it’s in a flagged envelope. I care about protecting the data, not the envelope, so it makes a lot more sense to open the letter, read it, and then decide how to treat it. Of course that’s a lot harder and more time consuming. That’s why content awareness is the single most important piece of technology in a DLP solution. Opening an envelope and reading a letter is a lot slower than just reading the label- assuming you can even understand the handwriting and language. The first step in content analysis is capturing the envelope and opening it. I’ll skip the capturing part for now- we’ll talk about that later- and assume we can get the envelope to the content analysis engine. The engine then needs to parse the context (we’ll need that for the analysis) and dig into the content. For a plain text email this is easy, but when you want to look inside binary files it gets a little more complicated. All DLP solutions solve this using file cracking. File cracking is the technology used to read and understand the file, even if the content is buried multiple levels down. For example, it’s not unusual for the file cracker to read an Excel spreadsheet embedded in a Word file that’s zipped. The product needs to unzip the file, read the Word doc, analyze it, find the Excel data, read that, and analyze it. Other situations get far more complex, like a .pdf embedded in a CAD file. Many of the products on the market today support around 300 file types, embedded content, multiple languages, double byte character sets (for Asian languages), and can pull plain text from unidentified file types. Quite a few use the Autonomy or Verity content engines to help with file cracking, but all the serious tools have their own set of proprietary capabilities around that. Some tools can support analysis of encrypted data if they have the recovery keys for enterprise encryption, and most can identify standard encryption and use that as a contextual rule to block/quarantine content. Rather than just talking about how hard this is and seeing how far I drag out an analogy, let’s jump in and look at the different content analysis techniques used today: 1. Rules-Based/Regular Expressions: This is the most common analysis technique available in both DLP products, and other tools with DLP-like features. It analyzes the content for specific rules- such as 16 digit numbers that meet credit card checksum requirements, medical billing codes, and other textual analysis. Most DLP solutions enhance basic regular expressions with their own additional analysis rules (e.g. a name in proximity to an address near a credit card number). What content it’s best for: As a first-pass filter, or simply identified pieces of structured data like credit card numbers, social security numbers, and healthcare codes/records. Strengths: Rules process quickly and can easily be configured. Most products ship with initial rules sets. The technology is well understood and easy to incorporate into a variety of products. Weaknesses: Prone to high false positive rates. Offer very little protection for unstructured content like sensitive intellectual property. 2. Database Fingerprinting: Sometimes called Exact Data Matching. This technique takes either a database dump or live data (via ODBC connection) from a database and only looks for exact matches. For example, you could generate a policy to look only for credit card numbers in your customer base, thus ignoring your own employees buying online. More advanced tools look for combinations of information, such as the magic combination of first name or initial, with last name, with credit card or social security number, that triggers a California SB 1386 disclosure. Make sure you understand the performance and security implications of nightly extractions vs. live database connections. What content it’s best for: Structured data from databases. Strengths: Very low false positives (close to 0). Allows you to protect customer/sensitive data while ignoring other, similar, data used by employees (like their personal credit cards for online orders). Weaknesses: Nightly dumps won’t contain transaction data since the last extraction. Live connections can affect database performance. Large databases will affect product performance. 3. Exact File Matching: With this technique you take a hash of a file and monitor