Securosis

Research

Securing Big Data: Architectural Issues

In the previous post we went to some length to define what big data is – because the architectural model is critical to understanding how it poses different security challenges than traditional databases, data warehouses, and massively parallel processing environments. What distinguishes big data environments is their fundamentally different deployment model. Highly distributed and elastic data repositories enabled by the Hadoop File System. A distributed file system provides many of the essential characteristics (distributed redundant storage across resources) and enables massively parallel computation. But specific aspects of how each layer of the stack integrates – such as how data nodes communicate with clients and resource management facilities – raise many concerns. For those of you not familiar with the model, this is the Hadoop architecture. Architectural Issues Distributed nodes: The idea that “Moving Computation is Cheaper than Moving Data” is key to the model. Data is processed anywhere processing resources are available, enabling massively parallel computation. It also creates a complicated environment with a large attack surface, and it’s harder to verify consistency of security across all nodes in a highly distributed cluster of possibly heterogeneous platforms. ‘Sharded’ data: Data within big data clusters is fluid, with multiple copies moving to and from different nodes to ensure redundancy and resiliency. This automated movement makes it very difficult to know precisely where data is located at any moment in time, or how many copies are available. This runs counter to traditional centralized data security models, when data is wrapped in various protections until it’s used for processing. Big data is replicated in many places and moves as needed. The ‘containerized’ data security model is missing – as are many other relational database concepts. Write once, read many: Big data clusters handle data differently than other data management systems. Rather than the classical “Insert, Update, Select, and Delete” set of basic operations, they focus on write (Insert) and read (Select). Some big data environments don’t offer delete or update capabilities at all. It’s a ‘write once, read many’ model, which is excellent for performance. And it’s a great way to collect a sequence of events and track changes over time, but removing and overwriting sensitive data can be problematic. Data management is optimized for performance of insertion and query processing, at the expense of content manipulation. Inter-node communication: Hadoop and the vast majority of available add-ons that extend core functions don’t communicate securely. TLS and SSL are rarely available. When they are – as with HDFS proxies – they only cover client-to-proxy communication, not proxy-to-node sessions. Cassandra does offer well-engineered TLS, but it’s the exception. Data access/ownership: Role-based access is central to most database security schemes. Relational and quasi-relational platforms include roles, groups, schemas, label security, and various other facilities for limiting user access, based on identity, to an authorized subset of the available data set. Most big data environments offer access limitations at the schema level, but no finer granularity than that. It is possible to mimic these more advanced capabilities in big data environments, but that requires the application designer to build these functions into applications and data storage. Client interaction: Clients interact with resource managers and nodes. While gateway services for loading data can be defined, clients communicate directly with both the master/name server and individual data nodes. The tradeoff this imposes is limited ability to protect nodes from clients, clients from nodes, and even name servers from nodes. Worse, the distribution of self-organizing nodes runs counter to many security tools such as gateways/firewalls/monitoring which require a ‘chokepoint’ deployment architecture. Security gateways assume linear processing, and become clunky or or overly restrictive in peer-to-peer clusters. NoSecurity: Finally, and perhaps most importantly, big data stacks build in almost no security. As of this writing – aside from service-level authorization, access control integration, and web proxy capabilities from YARN – no facilities are available to protect data stores, applications, or core Hadoop features. All big data installations are built upon a web services model, with few or no facilities for countering common web threats, (i.e. anything on the OWASP Top Ten) so most big data installations are vulnerable to well known attacks. There are a couple other issues with securing big data on an architectural level, which are not issues specifically with big data, but with security products in general. To add security capabilities into a big data environment, they need to scale with the data. Most ‘bolt-on’ security does not scale this way, and simply cannot keep up. Because the security controls are not built into the products, there is a classic impedance mismatch between NoSQL environments and aftermarket security tools. Most security vendors have adapted their existing offerings as well as they can – usually working at data load time – but only a handful of traditional security products can dynamically scale along with a Hadoop cluster. The next post will go into day to day operational security issues. Share:

Share:
Read Post

My Security Fail (and Recovery) for the Week

I remember sitting at lunch with a friend and well-respected member of our security community as I described the architecture we used to protect our mail server. I’m not saying it’s perfect, but this person responded with, “that’s insane – I know people selling 0-days to governments that don’t go that far”. On another occasion I was talking with someone with vastly more network security knowledge and experience than me; someone who once protected a site attacked daily by very knowledgeable predators, and he was… confused as to why I architected the systems like I did. Yesterday, it saved my ass. Now I’m not 100% happy with our current security model on mail. There are aspects that are potentially flawed, but I’ve done my best to reduce the risk while still maintaining the usability we want. I actually have plans to close the last couple holes I’m not totally comfortable with, but our risk is still relatively low even with them. Here’s what happened. Back when we first set up our mail infrastructure we hit a problem with our VPN connections. Our mail is on a fully segregated network, and we had some problems with our ISP and IPSec-based VPNs even though I tried multiple back-end options. Timing-wise we hit a point where I had to move forward, so I set up PPTP and mandated strong passwords (as in I reviewed or set all of them myself). Only a handful of people have VPN access anyway, and, at the time, a properly-constructed PPTP password was still very secure. That delusion started dying this summer, and was fully buried yesterday thanks to a new, cloud-based, MS-CHAP cracking tool released by Moxie Marlinspike and David Hulton. The second I saw that article I shut down the VPN. But here’s how my paranoia saved my ass. As a fan of hyper-segregation, early on I decided to never trust the VPN. I put additional security hardware behind the VPN, with extremely restrictive policies. Connecting via the VPN gave you very little access to anything, with the mail server still completely walled off (a UTM dedicated only to the mail server). Heck, the only two things you could try to hack behind the VPN were the VPN server itself and the UTM… nothing else is directly connected to that network, and all that traffic is monitored and filtered. When I initially set things up people questioned my choice to put one security appliance behind another like that. But I didn’t want to have to rely on host security for the key assets if someone compromised anyone connected to our VPN. In this case, it worked out for me. Now I set all this up pre-cloud, but you can set up a similar architecture in many VPC or private cloud environments (you need two virtual NICs and a virtual UTM, although even simple firewall rules can go a long way to help). This is also motivation to finish the next part of my project, which involves yet another UTM and server. Share:

Share:
Read Post
dinosaur-sidebar

Totally Transparent Research is the embodiment of how we work at Securosis. It’s our core operating philosophy, our research policy, and a specific process. We initially developed it to help maintain objectivity while producing licensed research, but its benefits extend to all aspects of our business.

Going beyond Open Source Research, and a far cry from the traditional syndicated research model, we think it’s the best way to produce independent, objective, quality research.

Here’s how it works:

  • Content is developed ‘live’ on the blog. Primary research is generally released in pieces, as a series of posts, so we can digest and integrate feedback, making the end results much stronger than traditional “ivory tower” research.
  • Comments are enabled for posts. All comments are kept except for spam, personal insults of a clearly inflammatory nature, and completely off-topic content that distracts from the discussion. We welcome comments critical of the work, even if somewhat insulting to the authors. Really.
  • Anyone can comment, and no registration is required. Vendors or consultants with a relevant product or offering must properly identify themselves. While their comments won’t be deleted, the writer/moderator will “call out”, identify, and possibly ridicule vendors who fail to do so.
  • Vendors considering licensing the content are welcome to provide feedback, but it must be posted in the comments - just like everyone else. There is no back channel influence on the research findings or posts.
    Analysts must reply to comments and defend the research position, or agree to modify the content.
  • At the end of the post series, the analyst compiles the posts into a paper, presentation, or other delivery vehicle. Public comments/input factors into the research, where appropriate.
  • If the research is distributed as a paper, significant commenters/contributors are acknowledged in the opening of the report. If they did not post their real names, handles used for comments are listed. Commenters do not retain any rights to the report, but their contributions will be recognized.
  • All primary research will be released under a Creative Commons license. The current license is Non-Commercial, Attribution. The analyst, at their discretion, may add a Derivative Works or Share Alike condition.
  • Securosis primary research does not discuss specific vendors or specific products/offerings, unless used to provide context, contrast or to make a point (which is very very rare).
    Although quotes from published primary research (and published primary research only) may be used in press releases, said quotes may never mention a specific vendor, even if the vendor is mentioned in the source report. Securosis must approve any quote to appear in any vendor marketing collateral.
  • Final primary research will be posted on the blog with open comments.
  • Research will be updated periodically to reflect market realities, based on the discretion of the primary analyst. Updated research will be dated and given a version number.
    For research that cannot be developed using this model, such as complex principles or models that are unsuited for a series of blog posts, the content will be chunked up and posted at or before release of the paper to solicit public feedback, and provide an open venue for comments and criticisms.
  • In rare cases Securosis may write papers outside of the primary research agenda, but only if the end result can be non-biased and valuable to the user community to supplement industry-wide efforts or advances. A “Radically Transparent Research” process will be followed in developing these papers, where absolutely all materials are public at all stages of development, including communications (email, call notes).
    Only the free primary research released on our site can be licensed. We will not accept licensing fees on research we charge users to access.
  • All licensed research will be clearly labeled with the licensees. No licensed research will be released without indicating the sources of licensing fees. Again, there will be no back channel influence. We’re open and transparent about our revenue sources.

In essence, we develop all of our research out in the open, and not only seek public comments, but keep those comments indefinitely as a record of the research creation process. If you believe we are biased or not doing our homework, you can call us out on it and it will be there in the record. Our philosophy involves cracking open the research process, and using our readers to eliminate bias and enhance the quality of the work.

On the back end, here’s how we handle this approach with licensees:

  • Licensees may propose paper topics. The topic may be accepted if it is consistent with the Securosis research agenda and goals, but only if it can be covered without bias and will be valuable to the end user community.
  • Analysts produce research according to their own research agendas, and may offer licensing under the same objectivity requirements.
  • The potential licensee will be provided an outline of our research positions and the potential research product so they can determine if it is likely to meet their objectives.
  • Once the licensee agrees, development of the primary research content begins, following the Totally Transparent Research process as outlined above. At this point, there is no money exchanged.
  • Upon completion of the paper, the licensee will receive a release candidate to determine whether the final result still meets their needs.
  • If the content does not meet their needs, the licensee is not required to pay, and the research will be released without licensing or with alternate licensees.
  • Licensees may host and reuse the content for the length of the license (typically one year). This includes placing the content behind a registration process, posting on white paper networks, or translation into other languages. The research will always be hosted at Securosis for free without registration.

Here is the language we currently place in our research project agreements:

Content will be created independently of LICENSEE with no obligations for payment. Once content is complete, LICENSEE will have a 3 day review period to determine if the content meets corporate objectives. If the content is unsuitable, LICENSEE will not be obligated for any payment and Securosis is free to distribute the whitepaper without branding or with alternate licensees, and will not complete any associated webcasts for the declining LICENSEE. Content licensing, webcasts and payment are contingent on the content being acceptable to LICENSEE. This maintains objectivity while limiting the risk to LICENSEE. Securosis maintains all rights to the content and to include Securosis branding in addition to any licensee branding.

Even this process itself is open to criticism. If you have questions or comments, you can email us or comment on the blog.