Securing Hadoop: Technical Recommendations
Before we wrap up this series on securing Hadoop databases, I am happy to announce that Vormetric has asked to license this content, and Hortonworks is also evaluating a license as well. It’s community support that allows us to bring you this research free of charge. Also, I’ve received a couple email and twitter responses to the content; if you have more input to offer, now is the time to send it along to be evaluated with the rest of the feedback as we will assembled the final paper in the coming week. And with that, onto the recommendations. The following are our security recommendations to address security issues with Hadoop and NoSQL database clusters. The last time we made recommendations we joked that many security tools broke Hadoop scalability; you’re cluster was secure because it was likely no one would use it. Fast forward four years and both commercial and open source technologies have advanced considerably, not only addressing threats you’re worried about, but were designed specifically for Hadoop. This means the possibility a security tool will compromise cluster performance and scalability are low, and that integration hassles of old are mostly behind us. In fact, it’s because of the rapid technical advancements in the open source community that we have done an about-face on where to look for security capabilities. We are no longer focused on just 3rd party security tools, but largely the open source community, who helped close the major gaps in Hadoop security. That said, many of these capabilities are new, and like most new things, lack a degree of maturity. You still need to go through a tool selection process based upon your needs, and then do the integration and configuration work. Requirements As security in and around Hadoop is still relatively young, it is not a forgone conclusion that all security tools will work with a clustered NoSQL database. We still witness instances where vendors parade the same old products they offer for other back-office systems and relational databases. To ensure you are not duped by security vendors you still need to do your homework: Evaluate products to ensure they are architecturally and environmentally consistent with the cluster architecture — not in conflict with the essential characteristics of Hadoop. Any security control used for NoSQL must meet the following requirements: 1. It must not compromise the basic functionality of the cluster. 2. It should scale in the same manner as the cluster. 3. It should address a security threat to NoSQL databases or data stored within the cluster. Our Recommendations In the end, our big data security recommendations boil down to a handful of standard tools which can be effective in setting a secure baseline for Hadoop environments: Use Kerberos for node authentication: We believed – at the outset of this project – that we would no longer recommend Kerberos. Implementation and deployment challenges with Kerberos suggested customers would go in a different direction. We were 100% wrong. Our research showed that adoption has increased considerably over the last 24 months, specifically in response to the enterprise distributions of Hadoop have streamlined the integration of Kerberos, making it reasonably easy to deploy. Now, more than ever, Kerberos is being used as a cornerstone of cluster security. It remains effective for validating nodes and – for some – authenticating users. But other security controls piggy-back off Kerberos as well. Kerberos is one of the most effective security controls at our disposal, it’s built into the Hadoop infrastructure, and enterprise bundles make it accessible so we recommend you use it. Use file layer encryption: Simply stated, this is how you will protect data. File encryption protects against two attacker techniques for circumventing application security controls: Encryption protects data if malicious users or administrators gain access to data nodes and directly inspect files, and renders stolen files or copied disk images unreadable. Oh, and if you need to address compliance or data governance requirements, data encryption is not optional. While it may be tempting to rely upon encrypted SAN/NAS storage devices, they don’t provide protection from credentialed user access, granular protection of files or multi-key support. And file layer encryption provides consistent protection across different platforms regardless of OS/platform/storage type, with some products even protecting encryption operations in memory. Just as important, encryption meets our requirements for big data security — it is transparent to both Hadoop and calling applications, and scales out as the cluster grows. But you have a choice to make: Use open source HDFS encryption, or a third party commercial product. Open source products are freely available, and has open source key management support. But keep in mind that HDFS encryption engine only protects data on HDFS, leaving other types of files exposed. Commercial variants that work at the file system layer cover all files. Second, they lack some support for external key management, trusted binaries, and full support that commercial products do. Free is always nice, but for many of those we polled, complete coverage and support tilted the balance for enterprise customers. Regardless of which option you choose, this is a mandatory security control. Use key management: File layer encryption is not effective if an attacker can access encryption keys. Many big data cluster administrators store keys on local disk drives because it’s quick and easy, but it’s also insecure as keys can be collected by the platform administrator or an attacker. And we are seeing Keytab file sitting around unprotected in file systems. Use key management service to distribute keys and certificates; and manage different keys for each group, application, and user. This requires additional setup and possibly commercial key management products to scale with your big data environment, but it’s critical. Most of the encryption controls we recommend depend on key/certificate security. Use Apache Ranger: In the original version of this research we were most worried about the use of a dozen modules with Hadoop, all deployed with ad-hoc configuration, hidden within the complexities of the cluster, each offering up a unique attack surface to potential attackers. Deployment validation