I tend to avoid “security jazz” blog posts – esoteric arguments contrasting what we should be doing in security against what we do today. These rants don’t really help IT professionals get their jobs done so I skip them. But this is going to be such a post because I need to talk about big data security approaches. Many of you will to stop reading at this point. But for you data architects, CISOs, and security product development teams learning about how to plan for big data security (particularly those of you who have been asking me lately) and wanting to understand the arcane research that influences my recommendations, read on.

I got started on this topic by considering what big data security will look like in coming years. I was reacting to the apparently random recommendations in the general security press. I eventually decided that this is simply unknown. I can’t fairly slam the press for their apparently moronic recommendations, because I cannot be sure they will not be correct in the future. Stock picking monkeys have made fools of professional traders, and it is likely to happen again with big data security predictions. As big data continues its metamorphosis – in data storage, data and node management, system orchestration, and query methods – the ways we secure these clusters will change. A series of industry research papers (PDF), blog posts, and academic research projects on big data convince me that we are still very early in big data’s evolution. In each case we see some evolutionary changes (such as the Berkeley AMPLab’s Spark product), as well as some total rethinks of how to do analysis with big data (such as Google’s Pregel).

I am raising this topic on here because I think merits an open discussion. I am being asked frequently how to approach big data security, and given that big data currently looks like Hadoop and Cassandra, there are specific actionable steps that make sense for these types of clusters. But for someone architecting security products, this model might well be obsolete by the time the product goes live. Based upon research findings from last year things like masking, encryption, tokenization, identity management, and API security all make sense in Hadoop. When I speak with vendors who are looking to design big-data-specific security products, I need to caveat all recommendations with “as far as we know today”. I certainly cannot say that in 5 years anyone will still be using Hadoop. I guess Hadoop will still be a big player, but who knows? It could be Dremel, a SQL-like system, in which case we will be able apply many techniques we have already evolved for relational stores. If fashion dictates a Pregel-like ant swarm of worker threads, not so much.

Here is where I come to the predictions and recommendations. I would like to recommend that you embed as much security into the application layer as you can. That’s the best place to control access and control who can see what. The application is the gateway to the data, where you can abstract away many underlying data management layer complexities to focus on user rights and business logic enforcement. Application-layer controls also scale security with the application. These are reasons I think (Updated) Intel Mashery, Axway Vordel, and CA Layer7 are important. But we cannot yet tell where big data is going – we don’t know what applications, orchestration, queries, data storage, or even architectures will look like going forward – so it is impossible to know whether any security model we design will be absurd in a few years. The safe approach, based upon the uncertainty of big data environments, would be to protect at the data layer. That means using encryption, masking, and tokenization technologies that don’t expose sensitive data to big data environments. Making that work currently requires big data security clusters fronting big data analytics clusters – not terribly efficient, and you need another cluster (perhaps twice as many, depending on scale).

Then I realize that IT folks, trying to get their jobs done, will ignore all this overly abstract mumbo-jumbo and fall back on what they are comfortable with: the encapsulation/walled garden model of security. Put a firewall “in front” of the cluster, sealing it off (virtually) from the rest of IT. Hard firewall shell on the outside, chewy lack of security on the inside. At this point we appreciate the Jacquith/Hoff Security Hamster Sine Wave of Pain model as a useful tool. You can show how each of these choices is right … and wrong. We will play catch-up because we have no choice in the matter.