Securosis

Research

Securing Hadoop: Operational Security Issues

Beyond the architectural security issues endemic to Hadoop and NoSQL platforms discussed in the last post, IT teams expect some common security processes and supporting tools familiar from other data management platforms. That includes “turning the dials” on configuration management, vulnerability assessment, and maintaining patch levels across a complex assembly of supporting modules. The day-to-day processes IT managers follow to ensure typical application platforms are properly configured have evolved over years – core platform capabilities, community contributions, and commercial third-party support to fill in gaps. Best practices, checklists, and validation tools to verify things like admin rights are sufficiently tight, and that nodes are patched against known and perhaps even unknown vulnerabilities. Hadoop security has come a long way in just a few years, but it still lacks the maturity in day to day operational security offerings, and it is here that we find most firms continue to struggle. The following is an overview of the most common threats to data management systems, where operational controls offer preventative security measures to close off most common attacks. Again we will discuss the challenges, then map them to mitigation options. Authentication and authorization: Identity and authentication are central to any security effort – without them we cannot determine who should get access to data. Fortunately the greatest gains in NoSQL security have been in identity and access management. This is largely thanks to providers of enterprise Hadoop distributions, who have performed much of the integration and setup work. We have evolved from simple in-database authentication and crude platform identity management to much better integrated LDAP, Active Directory, Kerberos, and X.509 based authentication options. Leveraging those capabilities we can use established roles for authorization mapping, and sometimes extend to fine-grained authorization services with Apache Sentry, or custom authorization mapping controlled from within the calling application the database. Administrative data access: Most organizations have platform administrators and NoSQL database administrators, both with access to the cluster’s files. To provide separation of duties – to ensure administrators cannot view content – a facility is needed to segregate administrative roles and keep unwanted access to a minimum. Direct access to files or data is commonly addressed through a combination of role based-authorization, access control lists, file permissions, and segregation of administrative roles – such as with separate administrative accounts, bearing different roles and credentials. This provides basic protection, but cannot protect archived or snapshotted content. Stronger security requires a combination of data encryption and key management services, with unique keys for each application or cluster. This prevents different tenants (applications) in a shared cluster from viewing each other’s data. Configuration and Patch Management: With a cluster of servers, which may have hundreds of nodes, it is common to run different configurations and patch levels at one time. As nodes are added we see configuration skew. Keeping track of revisions is difficult. Existing configuration management tools can cover the underlying platforms, and HDFS Federation will help with cluster management, but they both leave a lot to be desired – including issuing encryption keys, avoiding ad hoc configuration changes, ensuring file permissions are set correctly, and ensuring TLS is correctly configured. NoSQL systems do not yet have counterparts for the configuration management tools available for relational platforms, and even commercial Hadoop distributions offer scant advice on recommended configurations and pre-deployment checklists. But administrators still need to ensure configuration scripts, patches, and open source code revisions are consistent. So we see NoSQL databases deployed on virtual servers and cloud instances, with home-grown pre-deployment scripts. Alternatively a “golden master” node may embody extensive configuration and validation, propagated automatically to new nodes before they can be added into the cluster. Software Bundles: The application and Hadoop stacks are assembled from many different components. Underlying platforms and file systems also vary – with their own configuration settings, ownership rights, and patch levels. We see organizations increasingly using source code control systems to handle open source version management and application stack management. Container technologies also help developers bundle up consistent application deployments. Authentication of applications and nodes: If an attacker can add a new node they control to the cluster, they can exfiltrate data from the cluster. To authenticate nodes (rather than users) before they can join a cluster, most firms we spoke with either employ X.509 certificates or Kerberos. Both can authenticate users as well, but we draw this distinction to underscore the threat of rogue applications or nodes being added to the cluster. Deployment of these services brings risks as well. For example if a Kerberos keytab file can be accessed or duplicated – perhaps using credentials extracted from virtual image files or snapshots – a node’s identity can be forged. Certificate-based identity options implicitly complicate setup and deployment, but properly deployed they can provide strong authentication and stronger security. Audit and Logging: If you suspect someone has breached your cluster, can you detect it, or trace back to the root cause? You need an activity record, which is usually provided by event logging. A variety of add-on logging capabilities are available, both open source and commercial. Scribe and LogStash are open source tools which integrate into most big data environments, as do a number of commercial products. You can leverage the existing cluster to store logs, build an independent cluster, or even leverage other dedicated platforms like a SIEM or Splunk. That said, some logging options do not provide an auditor sufficient information to determine exactly what actions occurred. You will need to verify that your logs are capturing both the correct event types and user actions. A user ID and IP address are insufficient – you also need to know what queries were issued. Monitoring, filtering, and blocking: There are no built-in monitoring tools to detect misuse or block malicious queries. There isn’t even yet a consensus on what a malicious big data query looks like – aside from crappy MapReduce scripts written by bad programmers. We are just seeing the first viable releases of Hadoop activity monitoring tools. No longer the “after-market speed regulators” they once were, current tools typically embedded into a

Share:
Read Post
dinosaur-sidebar

Totally Transparent Research is the embodiment of how we work at Securosis. It’s our core operating philosophy, our research policy, and a specific process. We initially developed it to help maintain objectivity while producing licensed research, but its benefits extend to all aspects of our business.

Going beyond Open Source Research, and a far cry from the traditional syndicated research model, we think it’s the best way to produce independent, objective, quality research.

Here’s how it works:

  • Content is developed ‘live’ on the blog. Primary research is generally released in pieces, as a series of posts, so we can digest and integrate feedback, making the end results much stronger than traditional “ivory tower” research.
  • Comments are enabled for posts. All comments are kept except for spam, personal insults of a clearly inflammatory nature, and completely off-topic content that distracts from the discussion. We welcome comments critical of the work, even if somewhat insulting to the authors. Really.
  • Anyone can comment, and no registration is required. Vendors or consultants with a relevant product or offering must properly identify themselves. While their comments won’t be deleted, the writer/moderator will “call out”, identify, and possibly ridicule vendors who fail to do so.
  • Vendors considering licensing the content are welcome to provide feedback, but it must be posted in the comments - just like everyone else. There is no back channel influence on the research findings or posts.
    Analysts must reply to comments and defend the research position, or agree to modify the content.
  • At the end of the post series, the analyst compiles the posts into a paper, presentation, or other delivery vehicle. Public comments/input factors into the research, where appropriate.
  • If the research is distributed as a paper, significant commenters/contributors are acknowledged in the opening of the report. If they did not post their real names, handles used for comments are listed. Commenters do not retain any rights to the report, but their contributions will be recognized.
  • All primary research will be released under a Creative Commons license. The current license is Non-Commercial, Attribution. The analyst, at their discretion, may add a Derivative Works or Share Alike condition.
  • Securosis primary research does not discuss specific vendors or specific products/offerings, unless used to provide context, contrast or to make a point (which is very very rare).
    Although quotes from published primary research (and published primary research only) may be used in press releases, said quotes may never mention a specific vendor, even if the vendor is mentioned in the source report. Securosis must approve any quote to appear in any vendor marketing collateral.
  • Final primary research will be posted on the blog with open comments.
  • Research will be updated periodically to reflect market realities, based on the discretion of the primary analyst. Updated research will be dated and given a version number.
    For research that cannot be developed using this model, such as complex principles or models that are unsuited for a series of blog posts, the content will be chunked up and posted at or before release of the paper to solicit public feedback, and provide an open venue for comments and criticisms.
  • In rare cases Securosis may write papers outside of the primary research agenda, but only if the end result can be non-biased and valuable to the user community to supplement industry-wide efforts or advances. A “Radically Transparent Research” process will be followed in developing these papers, where absolutely all materials are public at all stages of development, including communications (email, call notes).
    Only the free primary research released on our site can be licensed. We will not accept licensing fees on research we charge users to access.
  • All licensed research will be clearly labeled with the licensees. No licensed research will be released without indicating the sources of licensing fees. Again, there will be no back channel influence. We’re open and transparent about our revenue sources.

In essence, we develop all of our research out in the open, and not only seek public comments, but keep those comments indefinitely as a record of the research creation process. If you believe we are biased or not doing our homework, you can call us out on it and it will be there in the record. Our philosophy involves cracking open the research process, and using our readers to eliminate bias and enhance the quality of the work.

On the back end, here’s how we handle this approach with licensees:

  • Licensees may propose paper topics. The topic may be accepted if it is consistent with the Securosis research agenda and goals, but only if it can be covered without bias and will be valuable to the end user community.
  • Analysts produce research according to their own research agendas, and may offer licensing under the same objectivity requirements.
  • The potential licensee will be provided an outline of our research positions and the potential research product so they can determine if it is likely to meet their objectives.
  • Once the licensee agrees, development of the primary research content begins, following the Totally Transparent Research process as outlined above. At this point, there is no money exchanged.
  • Upon completion of the paper, the licensee will receive a release candidate to determine whether the final result still meets their needs.
  • If the content does not meet their needs, the licensee is not required to pay, and the research will be released without licensing or with alternate licensees.
  • Licensees may host and reuse the content for the length of the license (typically one year). This includes placing the content behind a registration process, posting on white paper networks, or translation into other languages. The research will always be hosted at Securosis for free without registration.

Here is the language we currently place in our research project agreements:

Content will be created independently of LICENSEE with no obligations for payment. Once content is complete, LICENSEE will have a 3 day review period to determine if the content meets corporate objectives. If the content is unsuitable, LICENSEE will not be obligated for any payment and Securosis is free to distribute the whitepaper without branding or with alternate licensees, and will not complete any associated webcasts for the declining LICENSEE. Content licensing, webcasts and payment are contingent on the content being acceptable to LICENSEE. This maintains objectivity while limiting the risk to LICENSEE. Securosis maintains all rights to the content and to include Securosis branding in addition to any licensee branding.

Even this process itself is open to criticism. If you have questions or comments, you can email us or comment on the blog.