Before I dig into today’s post I want to share a couple observations. First, my new copy of the Harvard Business Review just arrived. The cover story is “Getting Control of Big Data”. It’s telling that HBR thinks big data is a trend important enough to warrant a full spread, and feel business managers need to understand big data and the benefits and risks it poses to business. As soon as I finish this post I intend to dive into these articles. Now that I have just about finished this research effort, I look forward to contrasting what I have discovered with their perspective.

Second, when we talk about big data security, we are really referring to both data and infrastructure security. We want to protect the application (or database, if you prefer that term) that manages data, with the end goal of protecting the information under management. If an attacker can access data directly, bypassing the database management system, they will. Barring a direct path to the information, they will look for weaknesses in or ways to subvert the database application. So it’s important to remember that when we talk about database security we mean both data and infrastructure protection.

Finally, a point about clarity. Big data security is one of the tougher topics to describe, especially as we here at Securosis prefer to describe things in black and white terms for the sake of clarity. But for just about every rule we establish and every emphatic statement we make, we have to acknowledge exceptions. Given the variety of different big data distributions and add-on capabilities, you can likely find a single instance of every security control described in today’s post. But it’s usually a single security control, like encryption, with the other security controls absent from the various packages. Nothing offers even a partial suite of solutions, much less a comprehensive offering.


Today I want to discuss operational security of big data environments. Unlike yesterday’s post that discussed architectural security issues endemic to the platform, it is now time to address security controls of an operational nature. That includes “turning the dials” things like configuration management and access controls, as well as “bolt-on” capabilities such as auditing and security gateways. We see the greatest impact in these areas, and vendors jumping in with security offerings to fill the gaps.

Normally when we consider how to secure data repositories, we consider the following major areas:

  • Encryption: The standard for protecting data at rest is encryption to protect data from undesired access. And just because folks don’t use archiving features to back up data does not mean a rogue DBA or cloud service manager won’t. I think two or three of the more obscure NoSQL variants provides encryption for data at rest, but most do not. And the majority of available encryption products offer neither sufficient horizontal scalability nor adequate transparency for use with big data. This is a critical issue.
  • Administrative data access: Each node has an admin, and each admin can read the node’s data if they choose. As with encryption, we need a boundary or facility to provide separation of duties between different administrators. The requirement is the same as on relational platforms – but big data platforms lack their array of built-in facilities, documentation, and third party tools to address requirements. Unwanted direct access to data files or data node processes can be addressed through a combination of access controls, separation of roles, and encryption technologies, but out-of-the box data is only as secure as your least trustworthy administrator. It’s up to the system designer to select controls to close this gap.
  • Configuration and patch management: When managing a cluster of servers, it’s common to have nodes running different configurations and patch levels. And if you’re using dissimilar platforms to support the cluster you need to figure out what how to handle management. Existing configuration management tools work for underlying platforms, and HDFS Federation will help with cluster management, but careful planning is still necessary. I will go more detail about how in the next post, when I make recommendations. The cluster may tolerate nodes cycling without loss of data service interruption, but reboots can still cause serious performance issues, depending on which nodes are affected and how the cluster is configured. The upshot is that people don’t patch, fearing user complaints. Perhaps you have heard that one before.
  • Authentication of applications/clients: Hadoop uses Kerberos to authenticate users and add-on services to the HDFS cluster. But a rogue client can be inserted onto the network if a Kerberos ticket is stolen or duplicated. This is more of a concern when embedding credentials in virtual and cloud environments, where it’s relatively easy to introduce an exact replica of a client app or service. A clone of a node is often all that’s needed to introduce a corrupted node or service into a cluster, it’s easy to impersonate or a service in the cluster, but it requires an attacker to compromise the management plane of your environment, or obtain a backup, of a client. Regardless of it being a pain to set up, strong authentication through Kerberos is one of your principle security tools, it helps solve the critical problem of who can access hadoop services.
  • Audit and logging: One area with a variety of add-on capabilities is logging. Scribe and LogStash are open source tools that integrate into most big data environments, as do a number of commercial products. So you just need to find a compatible tool, install it, integrate it with other systems such as SIEM or log management, and then actually review the results. Without actually looking at the data and developing policies to detect fraud, logging is not useful.
  • Monitoring, filtering, and blocking: There are no built-in monitoring tools to look for misuse or block malicious queries. In fact I don’t believe anyone has ever described what a malicious query might look like in a big data environment – other than crappy MapReduce scripts written by bad programmers. It’s assumed that you’ll authenticate clients through Kerberos if you care about security, and MapReduce access is gated by digest authentication. There are several monitoring tools for big data environments, but most review data and user requests at the API layer. The problem then is these solutions require a ‘choke point’ or path through which all client connections must pass. Our own David Mortman called these ‘after-market speed regulators’, as the security bottleneck inherently limits performance. Most deliver the basic security value they claim, but require you to alter the deployment model or not to scale – or both.
  • API security: I am loath to include this item, simply because the subject demands an entire white paper of its own. Further, I have not yet decided whether it belongs in the architectural or operational section, as it doesn’t quite fit either. Integration with directory services, mapping OAuth tokens to API services, filtering requests, input validation, managing policies across nodes, and so on. Heck, even some of the APIs work without authentication, so people still haven’t yet admitted there is a problem. Again, there are a handful of off-the-shelf solutions to help address API security issues, but most are based on a gateway that funnels users through a single interface for all requests. Regardless, there are many important issues to consider, but they are beyond the scope of this paper.

In summary, there are various bits and pieces here to build with, but the selection of relevant and usable general security tools is decidedly sparse. Using a combination of built-in authentication services and add-on security products you can address the most glaring security weaknesses without killing performance or scalability – so we will discuss how next.

Share: