Data Labels Suck

By Rich

I had a weird discussion with someone who was firmly convinced that you couldn’t possibly have data security without starting with classification and labels. Maybe they read it in a book or something.

The thing is, the longer I research and talk to people about data security, the more I think labels and classification are little more than a way to waste time or spend a lot of money on consulting. Here’s why:

  1. By the time you manually classify something, it’s something (or someplace) else.
  2. Labels aren’t necessarily accurate.
  3. Labels don’t change as the data changes.
  4. Labels don’t reflect changing value in different business contexts.
  5. Labels rarely transfer with data as it moves into different formats.

Labels are fine in completely static environments, but how often do you have one of those? The only time I find them remotely useful is in certain databases, as part of the schema.

Any data of value moves, transforms, and changes so often that there’s no possible way any static label can be effective as a security control. It stuns me that people still think they can run around and add something to document metadata to properly protect it. That’s why I’m a big fan of DLP, as flawed as it may be. It makes way more sense to me to look inside the box and figure out what something is, instead of assuming the label on the outside is correct. Even the DoD crowd struggles mightily with accurate labels, and it’s deeply embedded into their culture.

Never trust a label. It’s a rough guide, not a security control.

No Related Posts

Seems to me that labels can be useful in indicating different levels of scrutiny, enforcement, quarantine, etc. You use the DLP capabilities to actually look at the data and the small number of labels to figure out what to do with it. Maybe that’s what @ChrisWalsh said, but I’m a bit slow on the uptake.

By Mike Rothman

@Kees: Love the dissertation simile - great combination of cynicism and accuracy.

@Nick: Right on.  A simple scheme with 2 or 3 buckets and a small number of rules to decide what goes in which bucket.  Back it by checking on actual behavior and correcting/coaching/disciplining people (and adjusting your scheme) as needed.  Neither rocket science nor panacea, but good operational practice.

By Chris Walsh

Your comment about manual classification is spot on. I don’t think that classification is the same as labeling, and the main point that Aaron Turner and I have been making for the past 18 months or so is that you must have at the very least, at the time of creation, a method by which you can declare something should be “protected” or “not”. Atop that, you can make more granular your classification scheme if you like (we recommend three to four buckets, no more).

Ex post facto classification of legacy data is a huge problem, and we believe that automation gets you just so far. There are several thoughts as to getting the rest of the way, and we’ve talked about those in several formats - but generally we agree that classification itself is tough and labeling can be harder. But neither is, on its face, dumber than DLP (and both are significantly smarter than, say, IPS).

Classification and labels will help but not solve the problem, and they’re not the solution any more than DLP products are. Reliance on either labels or classification or DLP products to fix the problem is not the right approach.

In talking with big-time data shops we have learned that the best approach is to train users to handle data correctly, and in order to accomplish that you need to train users how to recognize sensitive information (another large organization leader said that sensitive info is like obscene materials - as Justice Stewart stated - really difficult to define, but he knew it when he saw it). Awareness and motivation for users will be the only way to start scaling the approach.  Also, the first step has to be elimination of as many records as possible before embarking on any sort of classification/labeling.

As for the moving and morphing data, of course - but if you haven’t labeled the stuff at the get-go, you’ll never know whether the data was intended to be protected or treated as public. The other comments about data discovery and real world applicability? Hear, hear.

By nickselby

I’ve never understood data labeling. How does anybody get the electrons to sit still long enough to put a sticker on them? And then bits go and flip and darn it all if you don’t have to go and re-label them all over again. So annoying. :)

“It can also include a series of very manual…” blah blah - screw that. Seriously. :) If people are still basing their practices off the Orange Book, circa 1983, then that’s a whole other kind of serious problem!

People need to work harder at being lazy. Simply classification schemes save many headaches… oh, and btw, don’t forget to differentiate classification from authorization… a big mistake I’ve seen made a lot. :)


By Ben

Data classification is useful particularly when combined with data discovery tools as a means to support business process owners in thinking about the value of the information on which they depend to get their job done.

A data classification process is a little bit like a PhD dissertation: it is much more important that it is written than that is is read.

By Kees Leune


I *am* taking issue with classification- not the high level part, but the process of actually going out and classifying data. Perhaps you’ve never worked someplace that did that (which would be good), but I hear it all the time. People literally thinking that to protect data they need to run around and classify, and label, each file.

I’m not the one creating this split, *that’s how it is in the real world*. You haven’t been exposed to it, and that’s great, but I’ve seen far too many failed classification and labeling efforts. Believe me, I wish this was some theoretical analyst crap I made up.

As for information vs. data- information is data with value. 16 digits is data. 16 digits with a name, CVV code, and expiration date is information. Value is often determined by the context.

I think you still aren’t grokking what I’m talking about- classification *is not* merely the high level process of putting together levels and assigning types of data to those levels. It can also include a series of very manual, and usually ineffective, processes beyond that. I’m not adding complexity, I’m adding clarity, and you can’t simplify this to just what you did in your past.

By Rich

Rich, dude, I come back to my point nearly 18 months ago - you’re not taking issue with classification, just with labeling (this time at least your subject line reflects that). However, you said in your post “...I think labels and classification are little more than a way to waste time…” This directly contradicts what you’ve now just said, that “The first step of information classification- determining the relative value of information asset types, is important and [you] recommend it all the time.”

I think your most important comment, though, is the last line of your last response. “[S]top confusing high level classification and prioritization with low level data classification and labeling and we can move on.” Why in the world are you trying to create 2 levels of classification? Forget about all the labeling stuff, I think we’re in full agreement there. Is the fundamental disconnect here that you are differentiating between “information” and “data”? Why would you do that? How are you defining them that they are different?

Your credit card DLP example is an interesting starting point… please explain if you think the credit card string of numbers is information or data (or is it both?). From there maybe I can figure out why you think there’s a difference between the two. I still think you’re overcomplicating things unnecessarily.

As for DLP, I’m not touching that… I will, however, point out, again, that the basis of your DLP training ties directly into your classification scheme. :)


By Ben


You still aren’t understanding, so let me simplify this as much as I can.

Information classification is the process of determining relative value of assets by asset type. Data classification is the process of analyzing *a single piece of data*. Labeling is the mechanism to tag that piece of data with the classification level. You still think I’m lumping it all together? If so, read that again.

The first step of information classification- determining the relative value of information asset types, is important and I recommend it all the time. That’s not data classification, the two are different.

Even asset classification is sometimes useful/possible- e.g. classifying an entire database or application, and the involved servers, as important.

What’s totally unrealistic is manual classifying, then tagging/labeling, large amounts of individual data (e.g. documents or rows in a DB). I’ve never seen it work since it’s far too prone to error, and at best gives a false sense of security.

Now, onto automation. Go play with a DLP solution- I can say, “find all credit card numbers in all documents” and it will do it. I can say, “don’t let someone transfer a file with credit card numbers onto a USB device” and it will do it. No, the data isn’t self describing and defending, but rather than relying on static tags we rely on *what’s in the file itself*.

So stop confusing high level classification and prioritization with low level data classification and labeling and we can move on.

By Rich

Actually, Rich, I’ve made no assumptions here. My comments apply equally to data or systems. Where you seem to be confused is that you think classification == labeling. You do *not* need to do explicit labeling (of systems OR data). You need a classification scheme that is simple and straightforward that is immediately translatable into level of protection. It’s a manual process today because the data doesn’t (yet) speak for itself - we have to apply analysis at some level.

Your whole take on this topic strikes me as being completely backwards. Classification is the cornerstone of information risk management. If you cannot determine the relative importance of an asset, then you cannot determine the appropriate level of controls. Data is just as much an asset as anything else (probably more so, really).

You’ve talked at length in the past about “self-describing data” - show me how you do that in real, practical, applied terms. Don’t disparage current practices /that work/ unless you can provide a concrete solution to the contrary. When you do, I’ll then point out that the self-description is still based on a classification scheme that you’ve defined somewhere in policy.


By Ben


You don’t seem to understand the differences between system and data classification, and asset and data labeling.

Data classification and labeling refer to the process of manually classifying, then labeling, a piece of data- typically on a per document basis for unstructured data, or a per-row basis in databases.

This is very different than coarse classification of systems, subnets, assets, and such which is an effective tool in helping orient your overall security strategy.

I’m not overcomplicating the simple, you just assumed I was talking about something else.

By Rich

If you like to leave comments, and aren’t a spammer, register for the site and email us at and we’ll turn off moderation for your account.