Securing Big Data: Recommendations and Open Issues
Our previous two posts outlined several security issues inherent to big data architecture, and operational security issues common to big data clusters. With those in mind, how can one go about securing a big data cluster? What tools and techniques should you employ? Before we can answer those questions we need some ground rules, because not all ‘solutions’ are created equally. Many vendors claim to offer big data security, but they are really just selling the same products they offer for other back office systems and relational databases. Those products might work in a big data cluster, but only by compromising the big data model to make it fit the restricted envelope of what they can support. Their constraints on scalability, coverage, management, and deployment are all at odds with the essential big data features we have discussed. Any security product for big data needs a few characteristics: It must not compromise the basic functionality of the cluster It should scale in the same manner as the cluster It should not compromise the essential characteristics of big data It should address – or at least mitigate – a security threat to big data environments or data stored within the cluster. So how can we secure big data repositories today? The following is a list of common challenges, with security measures to address them: User access: We use identity and access management systems to control users, including both regular and administrator access. Separation of duties: We use a combination of authentication, authorization, and encryption to provide separation of duties between administrative personnel. We use application space, namespace, or schemata to logically segregate user access to a subset of the data under management. Indirect access: To close “back doors” – access to data outside permitted interfaces – we use a combination of encryption, access control, and configuration management. User activity: We use logging and user activity monitoring (where available) to alert on suspicious activity and enable forensic analysis. Data protection: Removal of sensitive information prior to insertion and data masking (via tools) are common strategies for reducing risk. But the majority of big data clusters we are aware of already store redundant copies of sensitive data. This means the data stored on disk must be protected against unauthorized access, and data encryption is the de facto method of protecting sensitive data at rest. In keeping with the requirements above, any encryption solution must scale with the cluster, must not interfere with MapReduce capabilities, and must not store keys on hard drives along with the encrypted data – keys must be handled by a secure key manager. Eavesdropping: We use SSL and TLS encryption to protect network communications. Hadoop offers SSL, but its implementation is limited to client connections. Cloudera offers good integration of TLS; otherwise look for third party products to close this gap. Name and data node protection: By default Hadoop HTTP web consoles (JobTracker, NameNode, TaskTrackers, and DataNodes) allow access without any form of authentication. The good news is that Hadoop RPC and HTTP web consoles can be configured to require Kerberos authentication. Bi-directional authentication of nodes is built into Hadoop, and available in some other big data environments as well. Hadoop’s model is built on Kerberos to authenticate applications to nodes, nodes to applications, and client requests for MapReduce and similar functions. Care must be taken to secure granting and storage of Kerberos tickets, but this is a very effective method for controlling what nodes and applications can participate on the cluster. Application protection: Big data clusters are built on web-enabled platforms – which means that remote injection, cross-site scripting, buffer overflows, and logic attacks against and through client applications are all possible avenues of attack for access to the cluster. Countermeasures typically include a mixture of secure code development practices (such as input validation, and address space randomization), network segmentation, and third-party tools (including Web Application Firewalls, IDS, authentication, and authorization). Some platforms offer built-in features to bolster application protection, such as YARN’s web application proxy service. Archive protection: As backups are largely an intractable problem for big data, we don’t need to worry much about traditional backup/archive security. But just because legitimate users cannot perform conventional backups does not mean an attacker would not create at least a partial backup. We need to secure the management plane to keep unwanted copies of data or data nodes from being propagated. Access controls, and possibly network segregation, are effective countermeasures against attackers trying to gain administrative access, and encryption can help protect data in case other protections are defeated. In the end, our big data security recommendations boil down to a handful of standard tools which can be effective in setting a secure baseline for big data environments: Use Kerberos: This is effective method for keeping rogue nodes and applications off your cluster. And it can help protect web console access, making administrative functions harder to compromise. We know Kerberos is a pain to set up, and (re-)validation of new nodes and applications takes work. But without bi-directional trust establishment it is too easy to fool Hadoop into letting malicious applications into the cluster, or into accepting introduce malicious nodes – which can then add, alter, or extract data. Kerberos is one of the most effective security controls at your disposal, and it’s built into the Hadoop infrastructure, so use it. File layer encryption: File encryption addresses two attacker methods for circumventing normal application security controls. Encryption protects in case malicious users or administrators gain access to data nodes and directly inspect files, and it also renders stolen files or disk images unreadable. Encryption protects against two of the most serious threats. Just as importantly, it meets our requirements for big data security tools – it is transparent to both Hadoop and calling applications, and scales out as the cluster grows. Open source products are available for most Linux systems; commercial products additionally offer external key management, trusted binaries, and full support. This is a cost-effective way to address several