Before we wrap up this series on securing Hadoop databases, I am happy to announce that Vormetric has asked to license this content, and Hortonworks is also evaluating a license as well. It’s community support that allows us to bring you this research free of charge. Also, I’ve received a couple email and twitter responses to the content; if you have more input to offer, now is the time to send it along to be evaluated with the rest of the feedback as we will assembled the final paper in the coming week. And with that, onto the recommendations.

The following are our security recommendations to address security issues with Hadoop and NoSQL database clusters. The last time we made recommendations we joked that many security tools broke Hadoop scalability; you’re cluster was secure because it was likely no one would use it. Fast forward four years and both commercial and open source technologies have advanced considerably, not only addressing threats you’re worried about, but were designed specifically for Hadoop. This means the possibility a security tool will compromise cluster performance and scalability are low, and that integration hassles of old are mostly behind us.

In fact, it’s because of the rapid technical advancements in the open source community that we have done an about-face on where to look for security capabilities. We are no longer focused on just 3rd party security tools, but largely the open source community, who helped close the major gaps in Hadoop security. That said, many of these capabilities are new, and like most new things, lack a degree of maturity. You still need to go through a tool selection process based upon your needs, and then do the integration and configuration work.


As security in and around Hadoop is still relatively young, it is not a forgone conclusion that all security tools will work with a clustered NoSQL database. We still witness instances where vendors parade the same old products they offer for other back-office systems and relational databases. To ensure you are not duped by security vendors you still need to do your homework: Evaluate products to ensure they are architecturally and environmentally consistent with the cluster architecture — not in conflict with the essential characteristics of Hadoop.

Any security control used for NoSQL must meet the following requirements: 1. It must not compromise the basic functionality of the cluster. 2. It should scale in the same manner as the cluster. 3. It should address a security threat to NoSQL databases or data stored within the cluster.

Our Recommendations

In the end, our big data security recommendations boil down to a handful of standard tools which can be effective in setting a secure baseline for Hadoop environments:

  1. Use Kerberos for node authentication: We believed – at the outset of this project – that we would no longer recommend Kerberos. Implementation and deployment challenges with Kerberos suggested customers would go in a different direction. We were 100% wrong. Our research showed that adoption has increased considerably over the last 24 months, specifically in response to the enterprise distributions of Hadoop have streamlined the integration of Kerberos, making it reasonably easy to deploy. Now, more than ever, Kerberos is being used as a cornerstone of cluster security. It remains effective for validating nodes and – for some – authenticating users. But other security controls piggy-back off Kerberos as well. Kerberos is one of the most effective security controls at our disposal, it’s built into the Hadoop infrastructure, and enterprise bundles make it accessible so we recommend you use it.
  2. Use file layer encryption: Simply stated, this is how you will protect data. File encryption protects against two attacker techniques for circumventing application security controls: Encryption protects data if malicious users or administrators gain access to data nodes and directly inspect files, and renders stolen files or copied disk images unreadable. Oh, and if you need to address compliance or data governance requirements, data encryption is not optional. While it may be tempting to rely upon encrypted SAN/NAS storage devices, they don’t provide protection from credentialed user access, granular protection of files or multi-key support. And file layer encryption provides consistent protection across different platforms regardless of OS/platform/storage type, with some products even protecting encryption operations in memory. Just as important, encryption meets our requirements for big data security — it is transparent to both Hadoop and calling applications, and scales out as the cluster grows. But you have a choice to make: Use open source HDFS encryption, or a third party commercial product. Open source products are freely available, and has open source key management support. But keep in mind that HDFS encryption engine only protects data on HDFS, leaving other types of files exposed. Commercial variants that work at the file system layer cover all files. Second, they lack some support for external key management, trusted binaries, and full support that commercial products do. Free is always nice, but for many of those we polled, complete coverage and support tilted the balance for enterprise customers. Regardless of which option you choose, this is a mandatory security control.
  3. Use key management: File layer encryption is not effective if an attacker can access encryption keys. Many big data cluster administrators store keys on local disk drives because it’s quick and easy, but it’s also insecure as keys can be collected by the platform administrator or an attacker. And we are seeing Keytab file sitting around unprotected in file systems. Use key management service to distribute keys and certificates; and manage different keys for each group, application, and user. This requires additional setup and possibly commercial key management products to scale with your big data environment, but it’s critical. Most of the encryption controls we recommend depend on key/certificate security.
  4. Use Apache Ranger: In the original version of this research we were most worried about the use of a dozen modules with Hadoop, all deployed with ad-hoc configuration, hidden within the complexities of the cluster, each offering up a unique attack surface to potential attackers. Deployment validation remains at the top of our list of concerns, but Apache Ranger provides a consistent management plane for setting configurations and usage policies to protect data within the cluster. You’ll still need to address issues of patching the Hadoop stack, application configuration, managing trusted machine images, and platform discrepancies. Some of those interviewed used automation scripts and source code control, others leveraged more tradition patch management systems to keep track of revisions, while still others have a management nightmare on their hands. We also recommend use of automation tools, such as Chef and Puppet, to orchestrate pre-deployment tasks in configuration, assembling from trusted images, patching, issuing keys and even running tools like vulnerability scanners prior to deployment. Building the scripts and setting up these services takes time up front, but pays for itself in reduced management time and effort later, and ensures that each node comes online with baseline security in place.
  5. Use Logging and monitoring: To perform forensic analysis, diagnose failures, or investigate unusual behavior, you need a record of activity. You can leverage built in functions of Hadoop to create event logs, and even leverage the cluster itself to store events. Tools like LogStash, Log4J and Kafka help with streaming, management and searching. And there are plug-ins available to stream standardized syslog feeds to supporting SIEM platforms or even Splunk. We also recommend the use of more context aware monitoring tools; in 2012 none of the activity monitoring tools worked with big data platforms, but now they are. These capabilities usually plug into supporting modules like Hive and collect the query, parameters and specifics about the user/application that issues the query. This approach goes beyond basic logging as it can detect misuse and even alter the results a user sees, These also can feed events into native logs, SIEM or even Database Activity Monitoring tools.
  6. Use secure communication: Implement secure communication between nodes, and between nodes and applications. This requires an SSL/TLS implementation that actually protects all network communications rather than just a subset. This imposes a small performance penalty on transfer of large data sets around the cluster, but the burden is shared across all nodes. The real issue for most is setup, certificate issuance and configuration.

The the use of encryption, authentication and platform management tools will greatly improve the security of Hadoop clusters, and close off all of the easy paths attackers use to steal information or compromise functions. For some challenges, such as authentication, Hadoop provides excellent integration with Active Directory and LDAP services. For authorization, there are modules and services that support fine-grained control over data access, thankfully moving beyond simple role based access controls and making application developers jobs far easier. The Hadoop community has largely embraced security, and done so far faster than we imagined possible on 2012.

On security implementation strategies, when we speak with Hadoop architects and IT managers, we still hear that their most popular security model is to hide the entire cluster with network segmentation, and hope attackers can’t get by the application. But that’s not such a bad thing as almost every one we spoke with has continuously evolved other areas of security within their cluster. And much like Hadoop itself, administrators and cluster architects are getting far more sophisticated with security. Most we spoke have road-mapped all of these recommended controls, and are taking that next step to fulfill compliance obligations. Consider the recommendations above a minimum set of preventative security measures. These are easy to recommend — they are simple, cost-effective, and scalable, and they addresses real security deficiencies with big data clusters. Nothing suggested here harms performance, scalability, or functionality. Yes, they are more work to set up, but relatively simple to manage and maintain.

We hope you find this paper useful. If you have any questions or want to discuss specifics of your situation, feel free to send us a note at