It’s been three and a half years since we published our research paper on Securing Big Data. That research paper has been one of the more popular papers we’ve ever written. And it’s no wonder as NoSQL adoption was faster than we expected; we see hundreds of new projects popping up, leveraging the scale, analytics and low cost of these platforms. It’s not hyperbole to claim it has revolutionized the database market over the last 5 years, and community support behind these platforms – and especially Hadoop – is staggering.

At the time we wrote the last paper security, Hadoop – much less the other platforms – was something of a barren wasteland. They did not include basic controls for data protection, most third party tools could not scale along with NoSQL and thus were of little use to developers, and leaders of NoSQL firms directed resources to improving performance and scalability, not security. Heck, in 2012 the version of Hadoop I evaluated did not even require and administrative password!

But when it comes to NoSQL security, and Hadoop specifically, things have changed dramatically. As we advise clients on how to implement security controls, there are many new options to consider. And while there remains some gaps in monitoring and assessment capabilities, Hadoop has (mostly) reached security parity with the relational platforms of old. We can’t call it a barren wasteland any longer, so to accurately advise people on approaches and tools to leverage, we can no longer refer them back to that original paper.

So we are kicking off a new research series to refresh this paper. Most of the content will be new. And this time we will do this a little bit differently that the last time. First, we are going to provide less background on what makes NoSQL different than relational databases, as most people in IT are now comfortable with the architectural and functional distinctions between the two. Second, most of our recommendations will still apply to NoSQL platforms in general, but this research will be more focused on Hadoop as we get a majority of questions on Hadoop security despite dozens of alternatives. Finally, as there are lots more aspects to talk about, we’ll weave preventative and detective controls into a more operational (i.e.: day to day management) model for both data and database infrastructure.

Here is how we are laying out the series:

Hadoop Architecture and Assembly — The goal with this post is to succinctly outline what Hadoop and similar styles of NoSQL clusters look like, how they are assembled and how they are used. In this light we get a better idea of the security challenges and what sort of protections need to be leveraged. As developers and data scientists stretch systems from a performance and scalability standpoint, and custom assemblage of open source and commercial products, there really is no such thing as a standard Hadoop deployment. So with these considerations in mind we will map out threats to the cluster.

Use Cases & Security Architectures — This post will discuss the strategic considerations for deploying security for big data. Depending upon which model you choose, you change where certain types of threats are addressed, and consequentially what tools you will rely upon to provide security. Or stated another way, the security model you choose will dictate what security technologies you need to prevent and detect threats. There are several approaches that organizations take to secure Hadoop and other NoSQL clusters. These range from securing the network around the cluster, Identity Management, to maintaining security controls on each node within the cluster, or even taking a data centric approach to security. We’ll go over the major trends we see today, and discuss the advantages and pitfalls of each approach.

Building Security Into the Cluster — Here is where we discuss how all of the pieces fit together. There are many security controls available, and each address a specific threat vector an attacker may employ. We’ll focus on security controls you want to build into your cluster from the start: identity, authorization, transport layer security, application security and data encryption. This will focus on the base security controls that allow you to define how the cluster should be used from a security standpoint.

Operational Security — Here we will focus on the day to day security controls for monitoring ongoing security and discovering user behavior and ongoing security operations. Aspects like configuration management, patching, logging, monitoring, and node validation. We’ll even discuss integrating a DevOps approach to cluster administration to improve speed and consistency.

Commercial Hadoop and NoSQL variants — Hadoop is the dominant flavor of ‘big data’ in use today. In this section we will discuss what the commercial Hadoop platform vendors are doing to promote security for their customers with a blend of open source, home grown and 3rd party security product support. There is no reason to roll you’re own security out of necessity as commercial variants often add on their own products or provide bundles for you. Each offers unique capabilities and each has a vision of what their customers should focus on, so we will cover some of the current offerings. We will also offer some advice on the application of security to non-Hadoop platforms. While Hadoop is the most commonly used platform, there are specialized flavors of NoSQL that are eminently appropriate for certain business challenges and are in wide use. Some even use HDFS or other Hadoop components that allow the use of the same security controls across different clusters. We will close out this section discussing where the security controls we have already discussed can be deployed in non-Hadoop environments where appropriate.

As with our original paper, this is not intended to be an exhaustive look at all potential security options, but to get the IT and development teams who run these clusters basic security controls in place.

Up next, Hadoop Architecture and Assembly.