Securing Hadoop: Technical Recommendations

Before we wrap up this series on securing Hadoop databases, I am happy to announce that Vormetric has asked to license this content, and Hortonworks is also evaluating a license as well. It’s community support that allows us to bring you this research free of charge. Also, I’ve received a couple email and twitter responses to the content; if you have more input to offer, now is the time to send it along to be evaluated with the rest of the feedback as we will assembled the final paper in the coming week. And with that, onto the recommendations. The following are our security recommendations to address security issues with Hadoop and NoSQL database clusters. The last time we made recommendations we joked that many security tools broke Hadoop scalability; you’re cluster was secure because it was likely no one would use it. Fast forward four years and both commercial and open source technologies have advanced considerably, not only addressing threats you’re worried about, but were designed specifically for Hadoop. This means the possibility a security tool will compromise cluster performance and scalability are low, and that integration hassles of old are mostly behind us. In fact, it’s because of the rapid technical advancements in the open source community that we have done an about-face on where to look for security capabilities. We are no longer focused on just 3rd party security tools, but largely the open source community, who helped close the major gaps in Hadoop security. That said, many of these capabilities are new, and like most new things, lack a degree of maturity. You still need to go through a tool selection process based upon your needs, and then do the integration and configuration work. Requirements As security in and around Hadoop is still relatively young, it is not a forgone conclusion that all security tools will work with a clustered NoSQL database. We still witness instances where vendors parade the same old products they offer for other back-office systems and relational databases. To ensure you are not duped by security vendors you still need to do your homework: Evaluate products to ensure they are architecturally and environmentally consistent with the cluster architecture — not in conflict with the essential characteristics of Hadoop. Any security control used for NoSQL must meet the following requirements: 1. It must not compromise the basic functionality of the cluster. 2. It should scale in the same manner as the cluster. 3. It should address a security threat to NoSQL databases or data stored within the cluster. Our Recommendations In the end, our big data security recommendations boil down to a handful of standard tools which can be effective in setting a secure baseline for Hadoop environments: Use Kerberos for node authentication: We believed – at the outset of this project – that we would no longer recommend Kerberos. Implementation and deployment challenges with Kerberos suggested customers would go in a different direction. We were 100% wrong. Our research showed that adoption has increased considerably over the last 24 months, specifically in response to the enterprise distributions of Hadoop have streamlined the integration of Kerberos, making it reasonably easy to deploy. Now, more than ever, Kerberos is being used as a cornerstone of cluster security. It remains effective for validating nodes and – for some – authenticating users. But other security controls piggy-back off Kerberos as well. Kerberos is one of the most effective security controls at our disposal, it’s built into the Hadoop infrastructure, and enterprise bundles make it accessible so we recommend you use it. Use file layer encryption: Simply stated, this is how you will protect data. File encryption protects against two attacker techniques for circumventing application security controls: Encryption protects data if malicious users or administrators gain access to data nodes and directly inspect files, and renders stolen files or copied disk images unreadable. Oh, and if you need to address compliance or data governance requirements, data encryption is not optional. While it may be tempting to rely upon encrypted SAN/NAS storage devices, they don’t provide protection from credentialed user access, granular protection of files or multi-key support. And file layer encryption provides consistent protection across different platforms regardless of OS/platform/storage type, with some products even protecting encryption operations in memory. Just as important, encryption meets our requirements for big data security — it is transparent to both Hadoop and calling applications, and scales out as the cluster grows. But you have a choice to make: Use open source HDFS encryption, or a third party commercial product. Open source products are freely available, and has open source key management support. But keep in mind that HDFS encryption engine only protects data on HDFS, leaving other types of files exposed. Commercial variants that work at the file system layer cover all files. Second, they lack some support for external key management, trusted binaries, and full support that commercial products do. Free is always nice, but for many of those we polled, complete coverage and support tilted the balance for enterprise customers. Regardless of which option you choose, this is a mandatory security control. Use key management: File layer encryption is not effective if an attacker can access encryption keys. Many big data cluster administrators store keys on local disk drives because it’s quick and easy, but it’s also insecure as keys can be collected by the platform administrator or an attacker. And we are seeing Keytab file sitting around unprotected in file systems. Use key management service to distribute keys and certificates; and manage different keys for each group, application, and user. This requires additional setup and possibly commercial key management products to scale with your big data environment, but it’s critical. Most of the encryption controls we recommend depend on key/certificate security. Use Apache Ranger: In the original version of this research we were most worried about the use of a dozen modules with Hadoop, all deployed with ad-hoc configuration, hidden within the complexities of the cluster, each offering up a unique attack surface to potential attackers. Deployment validation

Read Post

Totally Transparent Research is the embodiment of how we work at Securosis. It’s our core operating philosophy, our research policy, and a specific process. We initially developed it to help maintain objectivity while producing licensed research, but its benefits extend to all aspects of our business.

Going beyond Open Source Research, and a far cry from the traditional syndicated research model, we think it’s the best way to produce independent, objective, quality research.

Here’s how it works:

  • Content is developed ‘live’ on the blog. Primary research is generally released in pieces, as a series of posts, so we can digest and integrate feedback, making the end results much stronger than traditional “ivory tower” research.
  • Comments are enabled for posts. All comments are kept except for spam, personal insults of a clearly inflammatory nature, and completely off-topic content that distracts from the discussion. We welcome comments critical of the work, even if somewhat insulting to the authors. Really.
  • Anyone can comment, and no registration is required. Vendors or consultants with a relevant product or offering must properly identify themselves. While their comments won’t be deleted, the writer/moderator will “call out”, identify, and possibly ridicule vendors who fail to do so.
  • Vendors considering licensing the content are welcome to provide feedback, but it must be posted in the comments - just like everyone else. There is no back channel influence on the research findings or posts.
    Analysts must reply to comments and defend the research position, or agree to modify the content.
  • At the end of the post series, the analyst compiles the posts into a paper, presentation, or other delivery vehicle. Public comments/input factors into the research, where appropriate.
  • If the research is distributed as a paper, significant commenters/contributors are acknowledged in the opening of the report. If they did not post their real names, handles used for comments are listed. Commenters do not retain any rights to the report, but their contributions will be recognized.
  • All primary research will be released under a Creative Commons license. The current license is Non-Commercial, Attribution. The analyst, at their discretion, may add a Derivative Works or Share Alike condition.
  • Securosis primary research does not discuss specific vendors or specific products/offerings, unless used to provide context, contrast or to make a point (which is very very rare).
    Although quotes from published primary research (and published primary research only) may be used in press releases, said quotes may never mention a specific vendor, even if the vendor is mentioned in the source report. Securosis must approve any quote to appear in any vendor marketing collateral.
  • Final primary research will be posted on the blog with open comments.
  • Research will be updated periodically to reflect market realities, based on the discretion of the primary analyst. Updated research will be dated and given a version number.
    For research that cannot be developed using this model, such as complex principles or models that are unsuited for a series of blog posts, the content will be chunked up and posted at or before release of the paper to solicit public feedback, and provide an open venue for comments and criticisms.
  • In rare cases Securosis may write papers outside of the primary research agenda, but only if the end result can be non-biased and valuable to the user community to supplement industry-wide efforts or advances. A “Radically Transparent Research” process will be followed in developing these papers, where absolutely all materials are public at all stages of development, including communications (email, call notes).
    Only the free primary research released on our site can be licensed. We will not accept licensing fees on research we charge users to access.
  • All licensed research will be clearly labeled with the licensees. No licensed research will be released without indicating the sources of licensing fees. Again, there will be no back channel influence. We’re open and transparent about our revenue sources.

In essence, we develop all of our research out in the open, and not only seek public comments, but keep those comments indefinitely as a record of the research creation process. If you believe we are biased or not doing our homework, you can call us out on it and it will be there in the record. Our philosophy involves cracking open the research process, and using our readers to eliminate bias and enhance the quality of the work.

On the back end, here’s how we handle this approach with licensees:

  • Licensees may propose paper topics. The topic may be accepted if it is consistent with the Securosis research agenda and goals, but only if it can be covered without bias and will be valuable to the end user community.
  • Analysts produce research according to their own research agendas, and may offer licensing under the same objectivity requirements.
  • The potential licensee will be provided an outline of our research positions and the potential research product so they can determine if it is likely to meet their objectives.
  • Once the licensee agrees, development of the primary research content begins, following the Totally Transparent Research process as outlined above. At this point, there is no money exchanged.
  • Upon completion of the paper, the licensee will receive a release candidate to determine whether the final result still meets their needs.
  • If the content does not meet their needs, the licensee is not required to pay, and the research will be released without licensing or with alternate licensees.
  • Licensees may host and reuse the content for the length of the license (typically one year). This includes placing the content behind a registration process, posting on white paper networks, or translation into other languages. The research will always be hosted at Securosis for free without registration.

Here is the language we currently place in our research project agreements:

Content will be created independently of LICENSEE with no obligations for payment. Once content is complete, LICENSEE will have a 3 day review period to determine if the content meets corporate objectives. If the content is unsuitable, LICENSEE will not be obligated for any payment and Securosis is free to distribute the whitepaper without branding or with alternate licensees, and will not complete any associated webcasts for the declining LICENSEE. Content licensing, webcasts and payment are contingent on the content being acceptable to LICENSEE. This maintains objectivity while limiting the risk to LICENSEE. Securosis maintains all rights to the content and to include Securosis branding in addition to any licensee branding.

Even this process itself is open to criticism. If you have questions or comments, you can email us or comment on the blog.