Our previous two posts outlined several security issues inherent to big data architecture, and operational security issues common to big data clusters. With those in mind, how can one go about securing a big data cluster? What tools and techniques should you employ?
Before we can answer those questions we need some ground rules, because not all ‘solutions’ are created equally. Many vendors claim to offer big data security, but they are really just selling the same products they offer for other back office systems and relational databases. Those products might work in a big data cluster, but only by compromising the big data model to make it fit the restricted envelope of what they can support. Their constraints on scalability, coverage, management, and deployment are all at odds with the essential big data features we have discussed. Any security product for big data needs a few characteristics:
- It must not compromise the basic functionality of the cluster
- It should scale in the same manner as the cluster
- It should not compromise the essential characteristics of big data
- It should address – or at least mitigate – a security threat to big data environments or data stored within the cluster.
So how can we secure big data repositories today? The following is a list of common challenges, with security measures to address them:
- User access: We use identity and access management systems to control users, including both regular and administrator access.
- Separation of duties: We use a combination of authentication, authorization, and encryption to provide separation of duties between administrative personnel. We use application space, namespace, or schemata to logically segregate user access to a subset of the data under management.
- Indirect access: To close “back doors” – access to data outside permitted interfaces – we use a combination of encryption, access control, and configuration management.
- User activity: We use logging and user activity monitoring (where available) to alert on suspicious activity and enable forensic analysis.
- Data protection: Removal of sensitive information prior to insertion and data masking (via tools) are common strategies for reducing risk. But the majority of big data clusters we are aware of already store redundant copies of sensitive data. This means the data stored on disk must be protected against unauthorized access, and data encryption is the de facto method of protecting sensitive data at rest. In keeping with the requirements above, any encryption solution must scale with the cluster, must not interfere with MapReduce capabilities, and must not store keys on hard drives along with the encrypted data – keys must be handled by a secure key manager.
- Eavesdropping: We use SSL and TLS encryption to protect network communications. Hadoop offers SSL, but its implementation is limited to client connections. Cloudera offers good integration of TLS; otherwise look for third party products to close this gap.
- Name and data node protection: By default Hadoop HTTP web consoles (JobTracker, NameNode, TaskTrackers, and DataNodes) allow access without any form of authentication. The good news is that Hadoop RPC and HTTP web consoles can be configured to require Kerberos authentication. Bi-directional authentication of nodes is built into Hadoop, and available in some other big data environments as well. Hadoop’s model is built on Kerberos to authenticate applications to nodes, nodes to applications, and client requests for MapReduce and similar functions. Care must be taken to secure granting and storage of Kerberos tickets, but this is a very effective method for controlling what nodes and applications can participate on the cluster.
- Application protection: Big data clusters are built on web-enabled platforms – which means that remote injection, cross-site scripting, buffer overflows, and logic attacks against and through client applications are all possible avenues of attack for access to the cluster. Countermeasures typically include a mixture of secure code development practices (such as input validation, and address space randomization), network segmentation, and third-party tools (including Web Application Firewalls, IDS, authentication, and authorization). Some platforms offer built-in features to bolster application protection, such as YARN’s web application proxy service.
- Archive protection: As backups are largely an intractable problem for big data, we don’t need to worry much about traditional backup/archive security. But just because legitimate users cannot perform conventional backups does not mean an attacker would not create at least a partial backup. We need to secure the management plane to keep unwanted copies of data or data nodes from being propagated. Access controls, and possibly network segregation, are effective countermeasures against attackers trying to gain administrative access, and encryption can help protect data in case other protections are defeated.
In the end, our big data security recommendations boil down to a handful of standard tools which can be effective in setting a secure baseline for big data environments:
- Use Kerberos: This is effective method for keeping rogue nodes and applications off your cluster. And it can help protect web console access, making administrative functions harder to compromise. We know Kerberos is a pain to set up, and (re-)validation of new nodes and applications takes work. But without bi-directional trust establishment it is too easy to fool Hadoop into letting malicious applications into the cluster, or into accepting introduce malicious nodes – which can then add, alter, or extract data. Kerberos is one of the most effective security controls at your disposal, and it’s built into the Hadoop infrastructure, so use it.
- File layer encryption: File encryption addresses two attacker methods for circumventing normal application security controls. Encryption protects in case malicious users or administrators gain access to data nodes and directly inspect files, and it also renders stolen files or disk images unreadable. Encryption protects against two of the most serious threats. Just as importantly, it meets our requirements for big data security tools – it is transparent to both Hadoop and calling applications, and scales out as the cluster grows. Open source products are available for most Linux systems; commercial products additionally offer external key management, trusted binaries, and full support. This is a cost-effective way to address several data security threats.
- Management: Deployment consistency is difficult to ensure in a multi-node environment. Patching, application configuration, updating the Hadoop stack, collecting trusted machine images, certificates, and platform discrepancies, all contribute to what can easily become a management nightmare. The good news is that most of you will be deploying in cloud and virtual environments. You can leverage tools from your cloud provider, hypervisor vendor, and third parties (such as Chef and Puppet) to automate pre-deployment tasks. Machine images, patches, and configuration should be fully automated and updated prior to deployment. You can even run validation tests, collect encryption keys, and request access tokens before nodes are accessible to the cluster. Building the scripts takes some time up front but pays for itself in reduced management time later, and additionally ensures that each node comes up with baseline security in place.
- Log it!: Big data is a natural fit for collecting and managing log data. Many web companies started with big data specifically to manage log files. Why not add logging onto your existing cluster? It gives you a place to look when something fails, or if someone thinks perhaps you have been hacked. Without an event trace you are blind. Logging MR requests and other cluster activity is easy to do, and increases storage and processing demands by a small fraction, but the data is indispensable when you need it.
- Secure communication: Implement secure communication between nodes, and between nodes and applications. This requires an SSL/TLS implementation that actually protects all network communications rather than just a subset. Cloudera appears to get this right, and some cloud providers offer secure communication options as well; otherwise you will likely need to integrate these services into your application stack.
When we speak with big data architects and managers, we hear their most popular security model is to hide the cluster within their infrastructure, where attackers don’t notice it. But these repositories are now common, and lax security makes them a very attractive target. Consider the recommendations above a minimum for preventative security. And these are the easy security measures – they are simple, cost-effective, and scalable, and they addresses specific security deficiencies with big data clusters. Nothing suggested here impairs performance, scalability, or functionality. Yes, it’s more work to set up, but relatively simple to manage and maintain.
Several other threats are less easily addressed. API security, monitoring, user authentication, granular data access, and gating MapReduce requests, are all issues without simple answers. In some cases, such as with application security, there are simply too many variables – with security controls requiring tradeoffs that are entirely dependent upon the particular environment. Some controls, including certain types of encryption and activity monitoring, require modifications to deployments or data choke points that can slow data input to a snail’s pace. In many cases we have nothing to recommend – technologies have not yet evolved sufficiently to handle real-world big data requirements. That does not mean you cannot fill the gaps with your own solutions. But for some problems, there are no simple solutions. Our recommendation can greatly improve overall security of big data clusters, and we highlight outstanding operational and architectural security issues to help you develop your own security measures.
This concludes our short introduction to big data security. As always, please comment if you feel we missed something or feel we have gotten anything wrong. Big data security is both complex and rapidly changing; we know there are many different perspectives and encourage you to share yours.