Login  |  Register  |  Contact

Data Labels Suck

I had a weird discussion with someone who was firmly convinced that you couldn’t possibly have data security without starting with classification and labels. Maybe they read it in a book or something.

The thing is, the longer I research and talk to people about data security, the more I think labels and classification are little more than a way to waste time or spend a lot of money on consulting. Here’s why:

  1. By the time you manually classify something, it’s something (or someplace) else.
  2. Labels aren’t necessarily accurate.
  3. Labels don’t change as the data changes.
  4. Labels don’t reflect changing value in different business contexts.
  5. Labels rarely transfer with data as it moves into different formats.

Labels are fine in completely static environments, but how often do you have one of those? The only time I find them remotely useful is in certain databases, as part of the schema.

Any data of value moves, transforms, and changes so often that there’s no possible way any static label can be effective as a security control. It stuns me that people still think they can run around and add something to document metadata to properly protect it. That’s why I’m a big fan of DLP, as flawed as it may be. It makes way more sense to me to look inside the box and figure out what something is, instead of assuming the label on the outside is correct. Even the DoD crowd struggles mightily with accurate labels, and it’s deeply embedded into their culture.

Never trust a label. It’s a rough guide, not a security control.

—Rich

No Related Posts
Previous entry: Dark Reading Column: Cloud Security | | Next entry: The Securosis and Threatpost Black Hat Disaster Recovery Breakfast

Comments:

If you like to leave comments, and aren't a spammer, register for the site and email us at info@securosis.com and we'll turn off moderation for your account.

By Bill Nye  on  07/08  at  04:20 PM

Data labels and classification aren’t security controls, they are a dependency of security controls. The sensitivity of the data still matters, wherever that data may be. How else do you assess risk to that data? If you have a way of indicating trust levels within those classifications on specific instances of data, then through commonly known methods you can track where that data moves from initial pristine copies. If you’re an organization that can’t accurately classify your data as sensitive and recognize the ramifications of mishandling that data, you shouldn’t have that data in the first place.

I recently wrote quite a lot of code for a specialized tool that does just this. Data labels and classification run far deeper than you’re considering here.

By Rich  on  07/08  at  05:47 PM

Bill,

I propose using real time analysis over data labels… tools that can understand the current context, as well as the content. Then you have a better chance of knowing what’s in a file, rather than relying on a static label.

Most organizations I’ve worked with aren’t close to being able to classify all their data. For them, it makes sense to develop a high-level classification scheme, but let tools do most of the classification.

There will still be a role for more manual controls, and labels may have limited uses (as I mentioned in the post), but I jsut don’t see how static labels can work in the long term.

The exception is some gov environments. But I’m interested in how you implemented this, assuming you can share at all, and if it’s the kind of thing that can generalize to other organizations?

By Ben  on  07/09  at  07:00 AM

Ugh. I can’t believe you’re bringing up this old ax again. Didn’t we settle this something like 18 months ago? To go back to those old days, your problem was always with the labels themselves and not classification in general. Your idea about dynamic classification, etc, is all good and fine, but show me something that exists in the real world.

More importantly, you seem to grossly overcomplicate the simple here. As Bill says above, you absolutely must have a basic classification scheme to separate “really important” from “everyday important.” I completely disagree that this cannot be done statically. It’s very straight-forward to put a stake in the ground and define what is “really important” in general terms and then let everything else automatically fall-through to “everyday important.” The rule should be written such that it errs on the side of caution, catching more than may be strictly needed, which is ok.

The fact of the matter is that certain controls are simply too expensive to apply to *all* data sets. Thus, you *must* have a way to differentiate between data that needs the more expensive controls and data that can be protected at “normal” levels. You then define a baseline for protection as well as a “super-baseline” of protection and go off and running. Again, though, it’s ok if you have to apply the “super-baseline” to more than is necessary, just so long as it’s not everything.

Bottom line: I’ve done this, in real life, it wasn’t confusing, and it wasn’t difficult. Once the classification rule was in place, everything else followed. Note that we did *not* focus on labeling in the traditional sense because there was nothing to physically label. You seem to consistently inextricably link these two things together (classification and labeling) when it does not stand to follow.

-ben

By Rich  on  07/09  at  08:38 AM

Ben,

You don’t seem to understand the differences between system and data classification, and asset and data labeling.

Data classification and labeling refer to the process of manually classifying, then labeling, a piece of data- typically on a per document basis for unstructured data, or a per-row basis in databases.

This is very different than coarse classification of systems, subnets, assets, and such which is an effective tool in helping orient your overall security strategy.

I’m not overcomplicating the simple, you just assumed I was talking about something else.

By Ben  on  07/09  at  09:18 AM

Actually, Rich, I’ve made no assumptions here. My comments apply equally to data or systems. Where you seem to be confused is that you think classification == labeling. You do *not* need to do explicit labeling (of systems OR data). You need a classification scheme that is simple and straightforward that is immediately translatable into level of protection. It’s a manual process today because the data doesn’t (yet) speak for itself - we have to apply analysis at some level.

Your whole take on this topic strikes me as being completely backwards. Classification is the cornerstone of information risk management. If you cannot determine the relative importance of an asset, then you cannot determine the appropriate level of controls. Data is just as much an asset as anything else (probably more so, really).

You’ve talked at length in the past about “self-describing data” - show me how you do that in real, practical, applied terms. Don’t disparage current practices /that work/ unless you can provide a concrete solution to the contrary. When you do, I’ll then point out that the self-description is still based on a classification scheme that you’ve defined somewhere in policy.

-ben

By Rich  on  07/09  at  09:58 AM

Ben,

You still aren’t understanding, so let me simplify this as much as I can.

Information classification is the process of determining relative value of assets by asset type. Data classification is the process of analyzing *a single piece of data*. Labeling is the mechanism to tag that piece of data with the classification level. You still think I’m lumping it all together? If so, read that again.

The first step of information classification- determining the relative value of information asset types, is important and I recommend it all the time. That’s not data classification, the two are different.

Even asset classification is sometimes useful/possible- e.g. classifying an entire database or application, and the involved servers, as important.

What’s totally unrealistic is manual classifying, then tagging/labeling, large amounts of individual data (e.g. documents or rows in a DB). I’ve never seen it work since it’s far too prone to error, and at best gives a false sense of security.

Now, onto automation. Go play with a DLP solution- I can say, “find all credit card numbers in all documents” and it will do it. I can say, “don’t let someone transfer a file with credit card numbers onto a USB device” and it will do it. No, the data isn’t self describing and defending, but rather than relying on static tags we rely on *what’s in the file itself*.

So stop confusing high level classification and prioritization with low level data classification and labeling and we can move on.

By Ben  on  07/09  at  10:53 AM

Rich, dude, I come back to my point nearly 18 months ago - you’re not taking issue with classification, just with labeling (this time at least your subject line reflects that). However, you said in your post “...I think labels and classification are little more than a way to waste time…” This directly contradicts what you’ve now just said, that “The first step of information classification- determining the relative value of information asset types, is important and [you] recommend it all the time.”

I think your most important comment, though, is the last line of your last response. “[S]top confusing high level classification and prioritization with low level data classification and labeling and we can move on.” Why in the world are you trying to create 2 levels of classification? Forget about all the labeling stuff, I think we’re in full agreement there. Is the fundamental disconnect here that you are differentiating between “information” and “data”? Why would you do that? How are you defining them that they are different?

Your credit card DLP example is an interesting starting point… please explain if you think the credit card string of numbers is information or data (or is it both?). From there maybe I can figure out why you think there’s a difference between the two. I still think you’re overcomplicating things unnecessarily.

As for DLP, I’m not touching that… I will, however, point out, again, that the basis of your DLP training ties directly into your classification scheme. :)

-ben

By Rich  on  07/09  at  03:52 PM

Ben,

I *am* taking issue with classification- not the high level part, but the process of actually going out and classifying data. Perhaps you’ve never worked someplace that did that (which would be good), but I hear it all the time. People literally thinking that to protect data they need to run around and classify, and label, each file.

I’m not the one creating this split, *that’s how it is in the real world*. You haven’t been exposed to it, and that’s great, but I’ve seen far too many failed classification and labeling efforts. Believe me, I wish this was some theoretical analyst crap I made up.

As for information vs. data- information is data with value. 16 digits is data. 16 digits with a name, CVV code, and expiration date is information. Value is often determined by the context.

I think you still aren’t grokking what I’m talking about- classification *is not* merely the high level process of putting together levels and assigning types of data to those levels. It can also include a series of very manual, and usually ineffective, processes beyond that. I’m not adding complexity, I’m adding clarity, and you can’t simplify this to just what you did in your past.

By Kees Leune  on  07/09  at  04:08 PM

Data classification is useful particularly when combined with data discovery tools as a means to support business process owners in thinking about the value of the information on which they depend to get their job done.

A data classification process is a little bit like a PhD dissertation: it is much more important that it is written than that is is read.

By Ben  on  07/09  at  04:20 PM

I’ve never understood data labeling. How does anybody get the electrons to sit still long enough to put a sticker on them? And then bits go and flip and darn it all if you don’t have to go and re-label them all over again. So annoying. :)

“It can also include a series of very manual…” blah blah - screw that. Seriously. :) If people are still basing their practices off the Orange Book, circa 1983, then that’s a whole other kind of serious problem!

People need to work harder at being lazy. Simply classification schemes save many headaches… oh, and btw, don’t forget to differentiate classification from authorization… a big mistake I’ve seen made a lot. :)

-ben

By nickselby  on  07/14  at  02:53 PM

Your comment about manual classification is spot on. I don’t think that classification is the same as labeling, and the main point that Aaron Turner and I have been making for the past 18 months or so is that you must have at the very least, at the time of creation, a method by which you can declare something should be “protected” or “not”. Atop that, you can make more granular your classification scheme if you like (we recommend three to four buckets, no more).

Ex post facto classification of legacy data is a huge problem, and we believe that automation gets you just so far. There are several thoughts as to getting the rest of the way, and we’ve talked about those in several formats - but generally we agree that classification itself is tough and labeling can be harder. But neither is, on its face, dumber than DLP (and both are significantly smarter than, say, IPS).

Classification and labels will help but not solve the problem, and they’re not the solution any more than DLP products are. Reliance on either labels or classification or DLP products to fix the problem is not the right approach.

In talking with big-time data shops we have learned that the best approach is to train users to handle data correctly, and in order to accomplish that you need to train users how to recognize sensitive information (another large organization leader said that sensitive info is like obscene materials - as Justice Stewart stated - really difficult to define, but he knew it when he saw it). Awareness and motivation for users will be the only way to start scaling the approach.  Also, the first step has to be elimination of as many records as possible before embarking on any sort of classification/labeling.

As for the moving and morphing data, of course - but if you haven’t labeled the stuff at the get-go, you’ll never know whether the data was intended to be protected or treated as public. The other comments about data discovery and real world applicability? Hear, hear.

By Chris Walsh  on  07/24  at  12:39 PM

@Kees: Love the dissertation simile - great combination of cynicism and accuracy.

@Nick: Right on.  A simple scheme with 2 or 3 buckets and a small number of rules to decide what goes in which bucket.  Back it by checking on actual behavior and correcting/coaching/disciplining people (and adjusting your scheme) as needed.  Neither rocket science nor panacea, but good operational practice.

By Mike Rothman  on  07/24  at  02:41 PM

Seems to me that labels can be useful in indicating different levels of scrutiny, enforcement, quarantine, etc. You use the DLP capabilities to actually look at the data and the small number of labels to figure out what to do with it. Maybe that’s what @ChrisWalsh said, but I’m a bit slow on the uptake.

Name:

Email:

Remember my personal information

Notify me of follow-up comments?