Our goal for this post is to succinctly outline what Hadoop (and most NoSQL) clusters look like, how they are assembled, and how they are used. This provides better understanding of the security challenges, and what sort of protections need to be leveraged to secure them. Developers and data scientists continue to stretch system performance and scalability, using customized combinations of open source and commercial products, so there is really no such thing as a ‘standard’ Hadoop deployment. With these considerations in mind, it is time to map out threats to the cluster.

NoSQL databases enable companies to collect, manage, and analyze incredibly large data sets. Thousands of firms are working on big data projects, from small startups to large enterprises. Since our original paper in 2012 the rate of adoption has only increased; platforms such as Hadoop, Cassandra, Mongo, and RIAK are now commonplace, with some firms supporting multiple installations. In just a couple years they went from “rogue IT” to “core systems”. Most firms recognized the value of “big data”, acknowledged these platforms are essential, and tasked IT teams with bringing them “under IT governance”. Most firms today are taking their first steps to retrofit security and governance controls onto Hadoop.

Let’s dig into how all the pieces fit together:

Architecture and Data Flow

Hadoop has been wildly successful because it scales well, can be configured to handle a wide variety of use cases, and is very inexpensive compared to relational and data warehouse alternatives. Which is all another way of saying it’s cheap, fast, and flexible. To show why and how it scales, let’s take a look at a Hadoop cluster architecture:

There are several things to note here. The architecture promotes scaling and performance. It provides parallel processing, and additional nodes provide ‘horizontal’ scalability. This architecture is also inherently multi-tenant, supporting multiple applications across one or more file groups. But there are a lot of moving parts; each node communicates with its peers to ensure that data is properly replicated, nodes are on-line and functional, storage is optimized, and application requests are being processed. We’ll dig into specific threats to Hadoop clusters later in this series.

Hadoop Stack

To appreciate Hadoop’s flexibility, you need to understand that a cluster can be fully customized. It is useful to think about the Hadoop framework as a ‘stack’, much like a LAMP stack, but much less standardized. While Pig and Hive are commonly used, the ability to mix and match components makes deployments much more diverse. For example, Sqoop and Yarn are alternative data access services. You can select different big data environments to support columnar, graph, document, XML, or multidimensional data. And over the last couple years MapReduce has largely given way to SQL query engines – with Spark, Drill, Impala, and Hive all accommodating increasing use of SQL-style queries. This modularity offers great flexibility to assemble and tailor clusters to behave and perform exactly as desired. But it also makes security more difficult – each option brings its own security options and issues.

The beauty part is that you can set up a cluster to satisfy your usability, scalability, and performance goals. You can tailor it to specific types of data, or add modules to facilitate analysis of certain data sets. But that flexibility brings complexity. Each module runs a specific version of code, has its own configuration, and may require independent authentication to work in the cluster. Many pieces must work in tandem here to process data, so each requires its own security review.

Some of you reading this are already familiar with the architecture and component stack of a Hadoop cluster. You may be asking, “Why we are we going through these basics?”. To understand threats and appropriate responses, you need to first understand how all the pieces of the cluster work together. Each component interface is a trust relationship, and each relationships is a target. Each component offers attacker a specific set of potential exploits, and defenders have a corresponding set of options for attack detection and prevention. Understanding architecture and cluster composition is the first step to putting together your security strategy.

Our next post will present several strategies used to secure big data. Each model includes basic benefits and requires supplementary security tools. After selecting a strategy, you can put together a collection of security controls to meet your objectives.