September 25, 2012

Securing Big Data: Architectural Issues

Adrian Lane September 25, 2012 2 Comments

In the previous post we went to some length to define what big data is – because the architectural model is critical to understanding how it poses different security challenges than traditional databases, data warehouses, and massively parallel processing environments. What distinguishes big data environments is their fundamentally different deployment model. Highly distributed and elastic data repositories enabled by the Hadoop File System. A distributed file system provides many of the essential characteristics (distributed redundant storage across resources) and enables massively parallel computation. But specific aspects of how each layer of the stack integrates – such as how data nodes communicate with clients and resource management facilities – raise many concerns. For those of you not familiar with the model, this is the Hadoop architecture. Architectural Issues Distributed nodes: The idea that “Moving Computation is Cheaper than Moving Data” is key to the model. Data is processed anywhere processing resources are available, enabling massively parallel computation. It also creates a complicated environment with a large attack surface, and it’s harder to verify consistency of security across all nodes in a highly distributed cluster of possibly heterogeneous platforms. ‘Sharded’ data: Data within big data clusters is fluid, with multiple copies moving to and from different nodes to ensure redundancy and resiliency. This automated movement makes it very difficult to know precisely where data is located at any moment in time, or how many copies are available. This runs counter to traditional centralized data security models, when data is wrapped in various protections until it’s used for processing. Big data is replicated in many places and moves as needed. The ‘containerized’ data security model is missing – as are many other relational database concepts. Write once, read many: Big data clusters handle data differently than other data management systems. Rather than the classical “Insert, Update, Select, and Delete” set of basic operations, they focus on write (Insert) and read (Select). Some big data environments don’t offer delete or update capabilities at all. It’s a ‘write once, read many’ model, which is excellent for performance. And it’s a great way to collect a sequence of events and track changes over time, but removing and overwriting sensitive data can be problematic. Data management is optimized for performance of insertion and query processing, at the expense of content manipulation. Inter-node communication: Hadoop and the vast majority of available add-ons that extend core functions don’t communicate securely. TLS and SSL are rarely available. When they are – as with HDFS proxies – they only cover client-to-proxy communication, not proxy-to-node sessions. Cassandra does offer well-engineered TLS, but it’s the exception. Data access/ownership: Role-based access is central to most database security schemes. Relational and quasi-relational platforms include roles, groups, schemas, label security, and various other facilities for limiting user access, based on identity, to an authorized subset of the available data set. Most big data environments offer access limitations at the schema level, but no finer granularity than that. It is possible to mimic these more advanced capabilities in big data environments, but that requires the application designer to build these functions into applications and data storage. Client interaction: Clients interact with resource managers and nodes. While gateway services for loading data can be defined, clients communicate directly with both the master/name server and individual data nodes. The tradeoff this imposes is limited ability to protect nodes from clients, clients from nodes, and even name servers from nodes. Worse, the distribution of self-organizing nodes runs counter to many security tools such as gateways/firewalls/monitoring which require a ‘chokepoint’ deployment architecture. Security gateways assume linear processing, and become clunky or or overly restrictive in peer-to-peer clusters. NoSecurity: Finally, and perhaps most importantly, big data stacks build in almost no security. As of this writing – aside from service-level authorization, access control integration, and web proxy capabilities from YARN – no facilities are available to protect data stores, applications, or core Hadoop features. All big data installations are built upon a web services model, with few or no facilities for countering common web threats, (i.e. anything on the OWASP Top Ten) so most big data installations are vulnerable to well known attacks. There are a couple other issues with securing big data on an architectural level, which are not issues specifically with big data, but with security products in general. To add security capabilities into a big data environment, they need to scale with the data. Most ‘bolt-on’ security does not scale this way, and simply cannot keep up. Because the security controls are not built into the products, there is a classic impedance mismatch between NoSQL environments and aftermarket security tools. Most security vendors have adapted their existing offerings as well as they can – usually working at data load time – but only a handful of traditional security products can dynamically scale along with a Hadoop cluster. The next post will go into day to day operational security issues. Share:

Read Post

My Security Fail (and Recovery) for the Week

Rich September 25, 2012 No Comments

I remember sitting at lunch with a friend and well-respected member of our security community as I described the architecture we used to protect our mail server. I’m not saying it’s perfect, but this person responded with, “that’s insane – I know people selling 0-days to governments that don’t go that far”. On another occasion I was talking with someone with vastly more network security knowledge and experience than me; someone who once protected a site attacked daily by very knowledgeable predators, and he was… confused as to why I architected the systems like I did. Yesterday, it saved my ass. Now I’m not 100% happy with our current security model on mail. There are aspects that are potentially flawed, but I’ve done my best to reduce the risk while still maintaining the usability we want. I actually have plans to close the last couple holes I’m not totally comfortable with, but our risk is still relatively low even with them. Here’s what happened. Back when we first set up our mail infrastructure we hit a problem with our VPN connections. Our mail is on a fully segregated network, and we had some problems with our ISP and IPSec-based VPNs even though I tried multiple back-end options. Timing-wise we hit a point where I had to move forward, so I set up PPTP and mandated strong passwords (as in I reviewed or set all of them myself). Only a handful of people have VPN access anyway, and, at the time, a properly-constructed PPTP password was still very secure. That delusion started dying this summer, and was fully buried yesterday thanks to a new, cloud-based, MS-CHAP cracking tool released by Moxie Marlinspike and David Hulton. The second I saw that article I shut down the VPN. But here’s how my paranoia saved my ass. As a fan of hyper-segregation, early on I decided to never trust the VPN. I put additional security hardware behind the VPN, with extremely restrictive policies. Connecting via the VPN gave you very little access to anything, with the mail server still completely walled off (a UTM dedicated only to the mail server). Heck, the only two things you could try to hack behind the VPN were the VPN server itself and the UTM… nothing else is directly connected to that network, and all that traffic is monitored and filtered. When I initially set things up people questioned my choice to put one security appliance behind another like that. But I didn’t want to have to rely on host security for the key assets if someone compromised anyone connected to our VPN. In this case, it worked out for me. Now I set all this up pre-cloud, but you can set up a similar architecture in many VPC or private cloud environments (you need two virtual NICs and a virtual UTM, although even simple firewall rules can go a long way to help). This is also motivation to finish the next part of my project, which involves yet another UTM and server. Share:

Read Post

Research

Securing Big Data: Architectural Issues

My Security Fail (and Recovery) for the Week

Research

Firestarter: Multicloud Deployment Structures and Blast Radius

Firestarter: So you want to multicloud?

Firestarter: 2019: Insert Winter is Coming Meme Here

Firestarter: re:Invent Security Review

Firestarter: Hardware Hacks and Lift and Pray

Sign Up for Our Newsletter

Contact

About

Quick Links