Securosis

Research

Securing Big Data: Recommendations and Open Issues

Our previous two posts outlined several security issues inherent to big data architecture, and operational security issues common to big data clusters. With those in mind, how can one go about securing a big data cluster? What tools and techniques should you employ? Before we can answer those questions we need some ground rules, because not all ‘solutions’ are created equally. Many vendors claim to offer big data security, but they are really just selling the same products they offer for other back office systems and relational databases. Those products might work in a big data cluster, but only by compromising the big data model to make it fit the restricted envelope of what they can support. Their constraints on scalability, coverage, management, and deployment are all at odds with the essential big data features we have discussed. Any security product for big data needs a few characteristics: It must not compromise the basic functionality of the cluster It should scale in the same manner as the cluster It should not compromise the essential characteristics of big data It should address – or at least mitigate – a security threat to big data environments or data stored within the cluster. So how can we secure big data repositories today? The following is a list of common challenges, with security measures to address them: User access: We use identity and access management systems to control users, including both regular and administrator access. Separation of duties: We use a combination of authentication, authorization, and encryption to provide separation of duties between administrative personnel. We use application space, namespace, or schemata to logically segregate user access to a subset of the data under management. Indirect access: To close “back doors” – access to data outside permitted interfaces – we use a combination of encryption, access control, and configuration management. User activity: We use logging and user activity monitoring (where available) to alert on suspicious activity and enable forensic analysis. Data protection: Removal of sensitive information prior to insertion and data masking (via tools) are common strategies for reducing risk. But the majority of big data clusters we are aware of already store redundant copies of sensitive data. This means the data stored on disk must be protected against unauthorized access, and data encryption is the de facto method of protecting sensitive data at rest. In keeping with the requirements above, any encryption solution must scale with the cluster, must not interfere with MapReduce capabilities, and must not store keys on hard drives along with the encrypted data – keys must be handled by a secure key manager. Eavesdropping: We use SSL and TLS encryption to protect network communications. Hadoop offers SSL, but its implementation is limited to client connections. Cloudera offers good integration of TLS; otherwise look for third party products to close this gap. Name and data node protection: By default Hadoop HTTP web consoles (JobTracker, NameNode, TaskTrackers, and DataNodes) allow access without any form of authentication. The good news is that Hadoop RPC and HTTP web consoles can be configured to require Kerberos authentication. Bi-directional authentication of nodes is built into Hadoop, and available in some other big data environments as well. Hadoop’s model is built on Kerberos to authenticate applications to nodes, nodes to applications, and client requests for MapReduce and similar functions. Care must be taken to secure granting and storage of Kerberos tickets, but this is a very effective method for controlling what nodes and applications can participate on the cluster. Application protection: Big data clusters are built on web-enabled platforms – which means that remote injection, cross-site scripting, buffer overflows, and logic attacks against and through client applications are all possible avenues of attack for access to the cluster. Countermeasures typically include a mixture of secure code development practices (such as input validation, and address space randomization), network segmentation, and third-party tools (including Web Application Firewalls, IDS, authentication, and authorization). Some platforms offer built-in features to bolster application protection, such as YARN’s web application proxy service. Archive protection: As backups are largely an intractable problem for big data, we don’t need to worry much about traditional backup/archive security. But just because legitimate users cannot perform conventional backups does not mean an attacker would not create at least a partial backup. We need to secure the management plane to keep unwanted copies of data or data nodes from being propagated. Access controls, and possibly network segregation, are effective countermeasures against attackers trying to gain administrative access, and encryption can help protect data in case other protections are defeated. In the end, our big data security recommendations boil down to a handful of standard tools which can be effective in setting a secure baseline for big data environments: Use Kerberos: This is effective method for keeping rogue nodes and applications off your cluster. And it can help protect web console access, making administrative functions harder to compromise. We know Kerberos is a pain to set up, and (re-)validation of new nodes and applications takes work. But without bi-directional trust establishment it is too easy to fool Hadoop into letting malicious applications into the cluster, or into accepting introduce malicious nodes – which can then add, alter, or extract data. Kerberos is one of the most effective security controls at your disposal, and it’s built into the Hadoop infrastructure, so use it. File layer encryption: File encryption addresses two attacker methods for circumventing normal application security controls. Encryption protects in case malicious users or administrators gain access to data nodes and directly inspect files, and it also renders stolen files or disk images unreadable. Encryption protects against two of the most serious threats. Just as importantly, it meets our requirements for big data security tools – it is transparent to both Hadoop and calling applications, and scales out as the cluster grows. Open source products are available for most Linux systems; commercial products additionally offer external key management, trusted binaries, and full support. This is a cost-effective way to address several

Share:
Read Post

Totally Transparent Research is the embodiment of how we work at Securosis. It’s our core operating philosophy, our research policy, and a specific process. We initially developed it to help maintain objectivity while producing licensed research, but its benefits extend to all aspects of our business.

Going beyond Open Source Research, and a far cry from the traditional syndicated research model, we think it’s the best way to produce independent, objective, quality research.

Here’s how it works:

  • Content is developed ‘live’ on the blog. Primary research is generally released in pieces, as a series of posts, so we can digest and integrate feedback, making the end results much stronger than traditional “ivory tower” research.
  • Comments are enabled for posts. All comments are kept except for spam, personal insults of a clearly inflammatory nature, and completely off-topic content that distracts from the discussion. We welcome comments critical of the work, even if somewhat insulting to the authors. Really.
  • Anyone can comment, and no registration is required. Vendors or consultants with a relevant product or offering must properly identify themselves. While their comments won’t be deleted, the writer/moderator will “call out”, identify, and possibly ridicule vendors who fail to do so.
  • Vendors considering licensing the content are welcome to provide feedback, but it must be posted in the comments - just like everyone else. There is no back channel influence on the research findings or posts.
    Analysts must reply to comments and defend the research position, or agree to modify the content.
  • At the end of the post series, the analyst compiles the posts into a paper, presentation, or other delivery vehicle. Public comments/input factors into the research, where appropriate.
  • If the research is distributed as a paper, significant commenters/contributors are acknowledged in the opening of the report. If they did not post their real names, handles used for comments are listed. Commenters do not retain any rights to the report, but their contributions will be recognized.
  • All primary research will be released under a Creative Commons license. The current license is Non-Commercial, Attribution. The analyst, at their discretion, may add a Derivative Works or Share Alike condition.
  • Securosis primary research does not discuss specific vendors or specific products/offerings, unless used to provide context, contrast or to make a point (which is very very rare).
    Although quotes from published primary research (and published primary research only) may be used in press releases, said quotes may never mention a specific vendor, even if the vendor is mentioned in the source report. Securosis must approve any quote to appear in any vendor marketing collateral.
  • Final primary research will be posted on the blog with open comments.
  • Research will be updated periodically to reflect market realities, based on the discretion of the primary analyst. Updated research will be dated and given a version number.
    For research that cannot be developed using this model, such as complex principles or models that are unsuited for a series of blog posts, the content will be chunked up and posted at or before release of the paper to solicit public feedback, and provide an open venue for comments and criticisms.
  • In rare cases Securosis may write papers outside of the primary research agenda, but only if the end result can be non-biased and valuable to the user community to supplement industry-wide efforts or advances. A “Radically Transparent Research” process will be followed in developing these papers, where absolutely all materials are public at all stages of development, including communications (email, call notes).
    Only the free primary research released on our site can be licensed. We will not accept licensing fees on research we charge users to access.
  • All licensed research will be clearly labeled with the licensees. No licensed research will be released without indicating the sources of licensing fees. Again, there will be no back channel influence. We’re open and transparent about our revenue sources.

In essence, we develop all of our research out in the open, and not only seek public comments, but keep those comments indefinitely as a record of the research creation process. If you believe we are biased or not doing our homework, you can call us out on it and it will be there in the record. Our philosophy involves cracking open the research process, and using our readers to eliminate bias and enhance the quality of the work.

On the back end, here’s how we handle this approach with licensees:

  • Licensees may propose paper topics. The topic may be accepted if it is consistent with the Securosis research agenda and goals, but only if it can be covered without bias and will be valuable to the end user community.
  • Analysts produce research according to their own research agendas, and may offer licensing under the same objectivity requirements.
  • The potential licensee will be provided an outline of our research positions and the potential research product so they can determine if it is likely to meet their objectives.
  • Once the licensee agrees, development of the primary research content begins, following the Totally Transparent Research process as outlined above. At this point, there is no money exchanged.
  • Upon completion of the paper, the licensee will receive a release candidate to determine whether the final result still meets their needs.
  • If the content does not meet their needs, the licensee is not required to pay, and the research will be released without licensing or with alternate licensees.
  • Licensees may host and reuse the content for the length of the license (typically one year). This includes placing the content behind a registration process, posting on white paper networks, or translation into other languages. The research will always be hosted at Securosis for free without registration.

Here is the language we currently place in our research project agreements:

Content will be created independently of LICENSEE with no obligations for payment. Once content is complete, LICENSEE will have a 3 day review period to determine if the content meets corporate objectives. If the content is unsuitable, LICENSEE will not be obligated for any payment and Securosis is free to distribute the whitepaper without branding or with alternate licensees, and will not complete any associated webcasts for the declining LICENSEE. Content licensing, webcasts and payment are contingent on the content being acceptable to LICENSEE. This maintains objectivity while limiting the risk to LICENSEE. Securosis maintains all rights to the content and to include Securosis branding in addition to any licensee branding.

Even this process itself is open to criticism. If you have questions or comments, you can email us or comment on the blog.