Securing Big Data: Operational Security Issues
Before I dig into today’s post I want to share a couple observations. First, my new copy of the Harvard Business Review just arrived. The cover story is “Getting Control of Big Data”. It’s telling that HBR thinks big data is a trend important enough to warrant a full spread, and feel business managers need to understand big data and the benefits and risks it poses to business. As soon as I finish this post I intend to dive into these articles. Now that I have just about finished this research effort, I look forward to contrasting what I have discovered with their perspective. Second, when we talk about big data security, we are really referring to both data and infrastructure security. We want to protect the application (or database, if you prefer that term) that manages data, with the end goal of protecting the information under management. If an attacker can access data directly, bypassing the database management system, they will. Barring a direct path to the information, they will look for weaknesses in or ways to subvert the database application. So it’s important to remember that when we talk about database security we mean both data and infrastructure protection. Finally, a point about clarity. Big data security is one of the tougher topics to describe, especially as we here at Securosis prefer to describe things in black and white terms for the sake of clarity. But for just about every rule we establish and every emphatic statement we make, we have to acknowledge exceptions. Given the variety of different big data distributions and add-on capabilities, you can likely find a single instance of every security control described in today’s post. But it’s usually a single security control, like encryption, with the other security controls absent from the various packages. Nothing offers even a partial suite of solutions, much less a comprehensive offering. Today I want to discuss operational security of big data environments. Unlike yesterday’s post that discussed architectural security issues endemic to the platform, it is now time to address security controls of an operational nature. That includes “turning the dials” things like configuration management and access controls, as well as “bolt-on” capabilities such as auditing and security gateways. We see the greatest impact in these areas, and vendors jumping in with security offerings to fill the gaps. Normally when we consider how to secure data repositories, we consider the following major areas: Encryption: The standard for protecting data at rest is encryption to protect data from undesired access. And just because folks don’t use archiving features to back up data does not mean a rogue DBA or cloud service manager won’t. I think two or three of the more obscure NoSQL variants provides encryption for data at rest, but most do not. And the majority of available encryption products offer neither sufficient horizontal scalability nor adequate transparency for use with big data. This is a critical issue. Administrative data access: Each node has an admin, and each admin can read the node’s data if they choose. As with encryption, we need a boundary or facility to provide separation of duties between different administrators. The requirement is the same as on relational platforms – but big data platforms lack their array of built-in facilities, documentation, and third party tools to address requirements. Unwanted direct access to data files or data node processes can be addressed through a combination of access controls, separation of roles, and encryption technologies, but out-of-the box data is only as secure as your least trustworthy administrator. It’s up to the system designer to select controls to close this gap. Configuration and patch management: When managing a cluster of servers, it’s common to have nodes running different configurations and patch levels. And if you’re using dissimilar platforms to support the cluster you need to figure out what how to handle management. Existing configuration management tools work for underlying platforms, and HDFS Federation will help with cluster management, but careful planning is still necessary. I will go more detail about how in the next post, when I make recommendations. The cluster may tolerate nodes cycling without loss of data service interruption, but reboots can still cause serious performance issues, depending on which nodes are affected and how the cluster is configured. The upshot is that people don’t patch, fearing user complaints. Perhaps you have heard that one before. Authentication of applications/clients: Hadoop uses Kerberos to authenticate users and add-on services to the HDFS cluster. But a rogue client can be inserted onto the network if a Kerberos ticket is stolen or duplicated. This is more of a concern when embedding credentials in virtual and cloud environments, where it’s relatively easy to introduce an exact replica of a client app or service. A clone of a node is often all that’s needed to introduce a corrupted node or service into a cluster, it’s easy to impersonate or a service in the cluster, but it requires an attacker to compromise the management plane of your environment, or obtain a backup, of a client. Regardless of it being a pain to set up, strong authentication through Kerberos is one of your principle security tools, it helps solve the critical problem of who can access hadoop services. Audit and logging: One area with a variety of add-on capabilities is logging. Scribe and LogStash are open source tools that integrate into most big data environments, as do a number of commercial products. So you just need to find a compatible tool, install it, integrate it with other systems such as SIEM or log management, and then actually review the results. Without actually looking at the data and developing policies to detect fraud, logging is not useful. Monitoring, filtering, and blocking: There are no built-in monitoring tools to look for misuse or block malicious queries. In fact I don’t believe anyone has ever described what a malicious query might look like in a big data environment – other than crappy MapReduce