Securing Hadoop: Enterprise Security For NoSQL
Hadoop is now enterprise software. There, I said it. I know lots of readers in the IT space still look at Hadoop as an interloper, or worse, part of the rogue IT problem. But better than 50% of the enterprises we spoke with are running Hadoop somewhere within the organization. A small percentage are running Mongo, Cassandra or Riak in parallel with Hadoop, for specific projects. Discussions on what ‘big data’ is, if it is a viable technology, or even if open source can be considered ‘enterprise software’ are long past. What began as proof of concept projects have matured into critical application services. And with that change, IT teams are now tasked with getting a handle on Hadoop security, to which they response with questions like “How do I secure Hadoop?” and “How do I map existing data governance policies to NoSQL databases?” Security vendors will tell you both attacks on corporate IT systems and data breaches are prevalent, so with gobs of data under management, Hadoop provides a tempting target for ‘Hackers’. All of which is true, but as of today, there really have not been major data breaches where Hadoop play a part of the story. As such this sort of ‘FUD’ carries little weight with IT operations. But make no mistake, security is a requirement! As sensitive information, customer data, medical histories, intellectual property and just about every type of data used in enterprise computing is now commonly used in Hadoop clusters, the ‘C’ word (i.e.: Compliance) has become part of their daily vocabulary. One of the big changes we’ve seen in the last couple of years with Hadoop becoming business critical infrastructure, and another – directly cause by the first – is IT is being tasked with bringing existing clusters in line with enterprise compliance requirements. This is somewhat challenging as a fresh install of Hadoop suffers all the same weak points traditional IT systems have, so it takes work to get security set up and the reports being created. For clusters that are already up and running, no need to choose technologies and a deployment roadmap that does not upset ongoing operations. On top of that, there is the additional challenge that the in-house tools you use to secure things like SAP, or the SIEM infrastructure you use for compliance reporting, may not be suitable when it comes to NoSQL. Building security into the cluster The number of security solutions that are compatible – if not outright built for – Hadoop is the biggest change since 2012. All of the major security pillars – authentication, authorization, encryption, key management and configuration management – are covered and the tools are viable. Most of the advancement have come from the firms that provide enterprise distributions of Hadoop. They have built, and in many cases contributed back to the open source community, security tools that accomplish the basics of cluster security. When you look at the threat-response models introduced in the previous two posts, every compensating security control is now available. Better still, they have done a lot of the integration legwork for services like Kerberos, taking a lot of the pain out of deployments. Here are some of the components and functions that were not available – or not viable – in 2012. LDAP/AD Integration – Technically AD and LDAP integration were available in 2012, but these services have both been advanced, and are easier to integrate than before. In fact, this area has received the most attention, and integration is as simple as a setup wizard with some of the commercial platforms. The benefits are obvious, as firms can leverage existing access and authorization schemes, and defer user and role management to external sources. Apache Ranger – Ranger is one of the more interesting technologies to come available, and it closes the biggest gap: Module security policies and configuration management. It provides a tool for cluster administrators to set policies for different modules like Hive, Kafka, HBase or Yarn. What’s more, those policies are in context to the module, so it sets policies for files and directories when in HDSF, SQL policies when in Hive, and so on. This helps with data governance and compliance as administrators set how a cluster should be used, or how data is to be accessed, in ways that simple role based access controls cannot. Apache Knox – You can think of Knox in it’s simplest form as a Hadoop firewall. More correctly, it is an API gateway. It handles HTTP and REST-ful requests, enforcing authentication and usage policies of inbound requests, and blocking everything else. Knox can be used as a virtual moat’ around a cluster, or used with network segmentation to further reduce network attack surface. Apache Atlas – Atlas is a proposed open source governence framework for Hadoop. It allows for annotation of files and tables, set relationships between data sets, and even import meta-data from other sources. These features are helpful for reporting, data discovery and for controlling access. Atlas is new and we expect it to see significant maturing in coming years, but for now it offers some valuable tools for basic data governance and reporting. Apache Ambari – Ambari is a facility for provisioning and managing Hadoop clusters. It helps admins set configurations and propagate changes to the entire cluster. During our interviews we we only spoke to two firms using this capability, but we received positive feedback by both. Additionally we spoke with a handful of companies who had written their own configuration and launch scripts, with pre-deployment validation checks, usually for cloud and virtual machine deployments. This later approach was more time consuming to create, but offered greater capabilities, with each function orchestrated within IT operational processes (e.g.: continuous deployment, failure recovery, DevOps). For most, Ambari’s ability to get you up and running quickly and provide consistent cluster management is a big win and a suitable choice. Monitoring – Hive, PIQL, Impala, Spark SQL and similar modules offer SQL or pseudo-SQL syntax. This means that the activity monitoring, dynamic masking, redaction and tokenization technologies originally developed for