Data Classification Is Dead

I know what’s running through your head right now.

“WTF?!? Mogull’s totally lost it. Isn’t he that data/information-centric security dude?”

Yes I am (the info-centric guy, not the insane bit), and here’s the thing:

The concept that you can run around, analyze, and tag your data throughout the enterprise, then keep it current through changing business contexts and requirements, is totally ridiculous. Sure, we have tools today that can scan our environment and tag files based on policies, but that just applies a static classification in a dynamic environment. I have yet to talk with a customer that really does enterprise-wide data classification successfully except for a few, discrete bits of data (like credit card numbers). The truth is that’s data identification, not data classification.

Enterprise content is just too volatile for static tags to really represent its value.

Even those of you in defense/intelligence don’t really do granular data classification. You just hit things with a big sledgehammer. “Is it Top Secret? Then we keep it totally isolated. What, this bit isn’t Top Secret but it’s on a Top Secret server? Frack it, we’ll just make it all Top Secret and be done with it. Need to pull it out? Go fill out this form.”

This post was inspired by a conversation yesterday where another information-centric wonk criticized the idea that data can be self-describing in any meaningful way, part of my principles of information centric security. While he caught the first point, he missed my meaning in the second point (policies and controls must account for business context) which means that the data self describes in such a way that business context can then be applied to determine value in that situation. I know it sounds like science fiction, but we’re starting to see real-world scenarios, and I’ll be the first to admit this is going to be a big area of advance over the next few years.

Now there is one piece of data classification that isn’t dead (I like sensational headlines just like the next person). That’s the business process of prioritizing information. That’s where you sit down with business executives and determine what information is more valuable than other information for your organization. It will drive all the protective strategies and dynamic protections we talk about when applying information-centric security. That’s absolutely vital to successful information security.

Thus we prioritize and identify information, but this is different than data classification, which is the concept that after these two steps, we can apply static labels as a way of protecting information.

That, my friend, is not only dead, it was never really alive.

10 Comments

PCI Blog - Compliance Demystified » Blog Arc 2008-05-07

[...] going to step out on a limb here and contradict what others have been saying about data classification. Data classification is not [...]

rybolov 2008-05-01

Hi Rich I'‘ve been thinking about this for a whole week now. Yeah, it took that long to distill the idea down to something concise. I think what you really want to say is that you only really need to determine what is the most critical 25% of the data types that you have, and that an exhaustive classification exercise does not return much value for the effort expended on it. Let's be honest here, you'‘re taking the high-water mark anyway, once you'‘ve figured out what that is, the other data types don'‘t matter. In other words, when it comes to data classification, 25% is "close enough for Government work"!

rmogull 2008-05-01

@Ben- exactly, except the DLP stuff is coming along reasonably well. @Roman- sounds kind of ideal. I'‘ve worked with gov clients that have to manually declassify anything out of a TS system back down. You can do it, but a manual process. @rybolov- absolutely.

Roman 2008-04-30

Just to pick at a couple nits as well.. You said: "Even those of you in defense/intelligence don't *really* do granular data classification. You just hit things with a big sledgehammer. ‘‘Is it Top Secret? Then we keep it totally isolated. What, this bit isn't Top Secret but it's on a Top Secret server? Frack it, we'll just make it all Top Secret and be done with it. Need to pull it out? Go fill out this form.'‘" Actually, just because data is on a system that is accredited to process top secret doesn'‘t mean we'‘ve made all the data on it top secret. All data, in whatever form, is to be labelled with the proper classification. Just because a system is accredited for TS doesn'‘t mean it even contains any TS; all the data may well be unclassified (a la ‘‘brand new workstation'‘). In fact, there is a serious push regarding ‘‘overclassification' of data, along with the fact that only certain individuals are allowed to classify data in the first place (see ‘‘Original Classification Authority'‘). Now, one item that I don'‘t recall exactly from netsec podcast #103 was where someone (Paul? Martin?) mentioned how one can have classified or sensitive data on a system, but then connect it to the ‘‘Net and it gets stolen, so much for data classification. I can only imagine this sort of scenario in the commercial world, considering that the military/intel communities don'‘t just accredit the system; the accreditation incorporates the entire network, and you do not connect networks of differing classifications. If done right (barring mistakes), the only data allowed on a system directly connected to the ‘‘Net is unclassified. My apologies if this is not as coherent as it is in my own mind; it's late and I listened to the podcast this morning.

Network Security Podcast » Blog Archive &raq 2008-04-29

[...] Data Classification is Dead - Rich says so! [...]

rmogull 2008-04-25

Here's how I see the difference: With information classification, we still start with prioritization (hopefully) and then run around trying to tag each bit of data, server, whatever by hand. Sometimes we use tools, but most of those pretty much look for keywords. I just don'‘t think that's viable, and I know a lot of people that talk about it, and none that do it comprehensively. On the other side, we can prioritize our data, then build content-based controls (like DLP, DAM, etc.) to enforce the security without relying on metatags. And yes Ben, I'‘m picking at nits :)

Ben 2008-04-25

So, to bring this altogether, it's not so much data classification that's dead, but traditional data labeling. I think we can all agree with that. How many places are actually labeling their data in this manner, anyway? The last few places I'‘ve been, we maintained a list/chart/spreadsheet of data types/examples and their corresponding classification. I like the idea of content-based controls and automating the enforcement. However, something or someone is still going to have to input the label somewhere, even if it's just an additional row, or a header in a microformat, at least until AI can read and recognize data effectively (DLP-style, perhaps, but even smarter).

Rob Lewis 2008-04-24

Good move writing this while Hoff is out of the country, or we would all be wading through an 8 page response by now! :) We intuitively want to classify and label something, and the data is the first thing we tend to look to because the IT world is object-centric. While data labels can be static, users in various roles can be dynamic. This sentence: “That's the business process of prioritizing information. That's where you sit down with business executives and determine what information is more valuable than other information for your organization”, is really part of creating the business rules. If you think about it, these rules are often made up using terms such as John, the CSO, the human resources dept., or the outside contractor Dave, of company ABC. In other words, they are made up of users, groups or roles. Yet we then proceed to classify data using the object-centric approach and wonder why there is a disconnect between business rules and security rules. If a staff person changes positions, and/or roles, then how does static labelling respond to that change. To support business rules, the data labelling (and the security rules) should follow the trust ranking placed on the user-role.

Ben 2008-04-24

Ummm… you lost me… so, we need to "prioritize and identify information" and that will somehow magically make it clear how to protect it… but we aren'‘t going to label that data in any way… when I think of data classification, I always think of it as being the process of identifying and prioritizing, with the labeling being how you then document the output of that process… I mean, aren'‘t you really picking at nits here? You still need the sensitivity of the data described, you'‘re just bristling at the current SOP for labeling? Nice try, though? :)

Interesting Bits - April 24th 2008-04-23

Available - Realtime IT Compliance - Rebbecca Harold did a webcast for ISSA that is now available. Security4all: The dangers of Web 2.0: information gathering tactics 101 - Benny Ketelslegers has a