Securing Big Data: Architectural Issues

By Adrian Lane

In the previous post we went to some length to define what big data is – because the architectural model is critical to understanding how it poses different security challenges than traditional databases, data warehouses, and massively parallel processing environments.

What distinguishes big data environments is their fundamentally different deployment model. Highly distributed and elastic data repositories enabled by the Hadoop File System. A distributed file system provides many of the essential characteristics (distributed redundant storage across resources) and enables massively parallel computation. But specific aspects of how each layer of the stack integrates – such as how data nodes communicate with clients and resource management facilities – raise many concerns.

For those of you not familiar with the model, this is the Hadoop architecture.

Hadoop Architecture

Architectural Issues

  • Distributed nodes: The idea that “Moving Computation is Cheaper than Moving Data” is key to the model. Data is processed anywhere processing resources are available, enabling massively parallel computation. It also creates a complicated environment with a large attack surface, and it’s harder to verify consistency of security across all nodes in a highly distributed cluster of possibly heterogeneous platforms.
  • ‘Sharded’ data: Data within big data clusters is fluid, with multiple copies moving to and from different nodes to ensure redundancy and resiliency. This automated movement makes it very difficult to know precisely where data is located at any moment in time, or how many copies are available. This runs counter to traditional centralized data security models, when data is wrapped in various protections until it’s used for processing. Big data is replicated in many places and moves as needed. The ‘containerized’ data security model is missing – as are many other relational database concepts.
  • Write once, read many: Big data clusters handle data differently than other data management systems. Rather than the classical “Insert, Update, Select, and Delete” set of basic operations, they focus on write (Insert) and read (Select). Some big data environments don’t offer delete or update capabilities at all. It’s a ‘write once, read many’ model, which is excellent for performance. And it’s a great way to collect a sequence of events and track changes over time, but removing and overwriting sensitive data can be problematic. Data management is optimized for performance of insertion and query processing, at the expense of content manipulation. Inter-node communication
  • Inter-node communication: Hadoop and the vast majority of available add-ons that extend core functions don’t communicate securely. TLS and SSL are rarely available. When they are – as with HDFS proxies – they only cover client-to-proxy communication, not proxy-to-node sessions. Cassandra does offer well-engineered TLS, but it’s the exception.
  • Data access/ownership: Role-based access is central to most database security schemes. Relational and quasi-relational platforms include roles, groups, schemas, label security, and various other facilities for limiting user access, based on identity, to an authorized subset of the available data set. Most big data environments offer access limitations at the schema level, but no finer granularity than that. It is possible to mimic these more advanced capabilities in big data environments, but that requires the application designer to build these functions into applications and data storage.
  • Client interaction: Clients interact with resource managers and nodes. While gateway services for loading data can be defined, clients communicate directly with both the master/name server and individual data nodes. The tradeoff this imposes is limited ability to protect nodes from clients, clients from nodes, and even name servers from nodes. Worse, the distribution of self-organizing nodes runs counter to many security tools such as gateways/firewalls/monitoring which require a ‘chokepoint’ deployment architecture. Security gateways assume linear processing, and become clunky or or overly restrictive in peer-to-peer clusters.
  • NoSecurity: Finally, and perhaps most importantly, big data stacks build in almost no security. As of this writing – aside from service-level authorization, access control integration, and web proxy capabilities from YARN – no facilities are available to protect data stores, applications, or core Hadoop features. All big data installations are built upon a web services model, with few or no facilities for countering common web threats, (i.e. anything on the OWASP Top Ten) so most big data installations are vulnerable to well known attacks.

There are a couple other issues with securing big data on an architectural level, which are not issues specifically with big data, but with security products in general. To add security capabilities into a big data environment, they need to scale with the data. Most ‘bolt-on’ security does not scale this way, and simply cannot keep up. Because the security controls are not built into the products, there is a classic impedance mismatch between NoSQL environments and aftermarket security tools. Most security vendors have adapted their existing offerings as well as they can – usually working at data load time – but only a handful of traditional security products can dynamically scale along with a Hadoop cluster.

The next post will go into day to day operational security issues.

No Related Posts

I agree wholeheartedly with Bert. Very well stated and informative blog with several great points throughout. In particular, you note that big data is ill-equipped to counter any of the web threats on the 2010 OWASP Top Ten.

Guarding against these threats requires sensitive data such as passwords, credit cards, health records and personal information be encrypted everywhere it is stored long term. It also calls for a standard encryption algorithm and strong access controls that prevent unauthorized access and good, smart key management. Gazzang, whom I work for, has several big data customers who recognize that sensitive data left unencrypted represents a significant threat to their business.

You can go a long way toward protecting big data at rest by transparently encrypting at the datanode level. Transparent encryption means admins don’t need to modify their existing Hadoop cluster, and the obfuscation happens after the reads and writes. Access to the data meanwhile, should be controlled on datanode servers.. It’s also important to ensure that your keys are stored apart from the sensitive data you’ve encrypted.

Your blog also mentions that “removing and overwriting sensitive data can be problematic.” One way to effectively kill sensitive data that you no longer need is to encrypt it and revoke the key. Of course, before performing this task, a user needs to be sure this data will never be needed again, because once the key is revoked, your data is as good as gone.

Looking forward to the next column in this series.

By David Tishgart

Adrian, Great analysis. I am really enjoying this, in the sense that I am learning a lot not that it turns out that Big Data systems are so open. And it is important to remember that these systems may easily include highly sensitive data, particularly since they often include client data or things like personally identifiable health care data.

By Bert Latamore

If you like to leave comments, and aren’t a spammer, register for the site and email us at and we’ll turn off moderation for your account.