In the previous post we went to some length to define what big data is – because the architectural model is critical to understanding how it poses different security challenges than traditional databases, data warehouses, and massively parallel processing environments.

What distinguishes big data environments is their fundamentally different deployment model. Highly distributed and elastic data repositories enabled by the Hadoop File System. A distributed file system provides many of the essential characteristics (distributed redundant storage across resources) and enables massively parallel computation. But specific aspects of how each layer of the stack integrates – such as how data nodes communicate with clients and resource management facilities – raise many concerns.

For those of you not familiar with the model, this is the Hadoop architecture.

Architectural Issues

  • Distributed nodes: The idea that “Moving Computation is Cheaper than Moving Data” is key to the model. Data is processed anywhere processing resources are available, enabling massively parallel computation. It also creates a complicated environment with a large attack surface, and it’s harder to verify consistency of security across all nodes in a highly distributed cluster of possibly heterogeneous platforms.
  • ‘Sharded’ data: Data within big data clusters is fluid, with multiple copies moving to and from different nodes to ensure redundancy and resiliency. This automated movement makes it very difficult to know precisely where data is located at any moment in time, or how many copies are available. This runs counter to traditional centralized data security models, when data is wrapped in various protections until it’s used for processing. Big data is replicated in many places and moves as needed. The ‘containerized’ data security model is missing – as are many other relational database concepts.
  • Write once, read many: Big data clusters handle data differently than other data management systems. Rather than the classical “Insert, Update, Select, and Delete” set of basic operations, they focus on write (Insert) and read (Select). Some big data environments don’t offer delete or update capabilities at all. It’s a ‘write once, read many’ model, which is excellent for performance. And it’s a great way to collect a sequence of events and track changes over time, but removing and overwriting sensitive data can be problematic. Data management is optimized for performance of insertion and query processing, at the expense of content manipulation.
  • Inter-node communication: Hadoop and the vast majority of available add-ons that extend core functions don’t communicate securely. TLS and SSL are rarely available. When they are – as with HDFS proxies – they only cover client-to-proxy communication, not proxy-to-node sessions. Cassandra does offer well-engineered TLS, but it’s the exception.
  • Data access/ownership: Role-based access is central to most database security schemes. Relational and quasi-relational platforms include roles, groups, schemas, label security, and various other facilities for limiting user access, based on identity, to an authorized subset of the available data set. Most big data environments offer access limitations at the schema level, but no finer granularity than that. It is possible to mimic these more advanced capabilities in big data environments, but that requires the application designer to build these functions into applications and data storage.
  • Client interaction: Clients interact with resource managers and nodes. While gateway services for loading data can be defined, clients communicate directly with both the master/name server and individual data nodes. The tradeoff this imposes is limited ability to protect nodes from clients, clients from nodes, and even name servers from nodes. Worse, the distribution of self-organizing nodes runs counter to many security tools such as gateways/firewalls/monitoring which require a ‘chokepoint’ deployment architecture. Security gateways assume linear processing, and become clunky or or overly restrictive in peer-to-peer clusters.
  • NoSecurity: Finally, and perhaps most importantly, big data stacks build in almost no security. As of this writing – aside from service-level authorization, access control integration, and web proxy capabilities from YARN – no facilities are available to protect data stores, applications, or core Hadoop features. All big data installations are built upon a web services model, with few or no facilities for countering common web threats, (i.e. anything on the OWASP Top Ten) so most big data installations are vulnerable to well known attacks.

There are a couple other issues with securing big data on an architectural level, which are not issues specifically with big data, but with security products in general. To add security capabilities into a big data environment, they need to scale with the data. Most ‘bolt-on’ security does not scale this way, and simply cannot keep up. Because the security controls are not built into the products, there is a classic impedance mismatch between NoSQL environments and aftermarket security tools. Most security vendors have adapted their existing offerings as well as they can – usually working at data load time – but only a handful of traditional security products can dynamically scale along with a Hadoop cluster.

The next post will go into day to day operational security issues.