Now that we have sketched out the elements a Hadoop cluster, and what one looks like, let’s talk threats to the databases. We want to consider both the database infrastructure itself, as well as the data under management. Given the complexity of a Hadoop cluster, the task is closer to securing an entire data center than a typical relational database. All the features that provide flexibility, scalability, performance, and openness, create specific security challenges. The following are some specific threats to clustered databases. Data access & ownership: Role-based access is central to most database security schemes, and NoSQL is no different. Relational and quasi-relational platforms include roles, groups, schemas, label security, and various other facilities for limiting user access to subsets of available data. Most big data environments now offer integration with identity stores, along with role-based facilities to divide up data access between groups of users. That said, authentication and authorization require cooperation between the application designer and the IT team managing the cluster. Leveraging existing Active Directory or LDAP services helps tremendously with defining user identities, and pre-defined roles may be available for limiting access to sensitive data. Data at rest protection: The standard for protecting data at rest is encryption, which protects against attempts to access data outside established application interfaces. With Hadoop systems we worry about people stealing archives or directly reading files from disk. Encrypted files are protected against access by users without encryption keys. Replication effectively replaces backups for big data, but beware a rogue administrator or cloud service manager creating their own backups. Encryption limits how data can be copied from the cluster. Unlike 2012, where the lack of suitable encryption was a serious issue. Apache offers HDFS encryption as an option; this is a major advance, but remember that you can only encrypt HDFS, and you’ll need to fill the gaps with key management and key storage. Several commercial Hadoop vendors offer transparent encryption, and third parties have advanced the state of the art, with transparent encryption options for both both HDFS and non-HDFS on-disk formats, especially coupled with parallel progress in key management. Inter-node communication: Hadoop and the vast majority of distributions (Cassandra, MongoDB, Couchbase, etc.) don’t communicate securely by default – they use unencrypted RPC over TCP/IP. TLS and SSL are bundled in big data distributions, but not typically used between applications and databases – and almost never for inter-node communication. This leaves data in transit, and application queries, accessible for inspection and tampering. Client interaction: Clients interact with resource managers and nodes. While gateway services can be created to load data, clients communicate directly with both resource managers and individual data nodes. Compromised clients can send malicious data or links to either service. This facilitates efficient communication but makes it difficult to protect nodes from clients, clients from nodes, and even name servers from nodes. Worse, the distribution of self-organizing nodes is a poor fit for security tools such as gateways, firewalls, and monitors. Many security tools are designed to require a choke-point or span port, which may not be available in a peer-to-peer mesh cluster. Distributed nodes: One of the reasons big data makes sense is an old truism: “moving computation is cheaper than moving data”. Data is processed wherever resources are available, enabling massively parallel computation. Unfortunately this produces complicated environments with lots of attack surface. With so many moving parts, it is difficult to verify consistency or security across a highly distributed cluster of (possibly heterogeneous) platforms. Patching, configuration management, node identity, and data at rest protection – and consistent deployment of each – are all issues. Threat-response models One or more security countermeasures are available to mitigate each threat identified above. The following diagram shows which specific options you have at your disposal to help you choose a ‘preventative’ security measure. We don’t have room to go into much detail on the tradeoffs of each response – each area really deserves its own paper. But we do want to mention a couple areas where we have seen the most change since our original research four years ago. If your goal is to protect session privacy – either between clients and data nodes, or for inter-node communication – Transport Layer Security (TLS) is your first choice. This was unheard of in 2012, but since then about 25% of the companies we spoke with have implemented SSL or TLS for inter-node communication – not just between applications and name servers. Transport encryption protects all communications from access or modification by attackers. Some firms instead use network segmentation and firewalls to ensure that attackers cannot access network traffic. This approach is less robust but much easier to implement. Some clusters were deployed to third-party cloud services, where virtualized network services make sniffing nearly impossible; these companies typically chose not to encrypt internal cluster communications. Enforcing data usage is one of the areas where we have seen the most progress, thanks to database links into existing Active Directory and LDAP identity stores. This seems obvious now but was a rarity in 2012, when data architects were focused on scalability and getting basic analytics up and running. Fortunately support for linking identity stores to Hadoop clusters has advanced considerably, making it much easier to leverage existing roles and management infrastructure. But we also have other tools at our disposal. We don’t see it often, but a handful of organizations encrypt sensitive data elements at the application layer, so information is stored as encrypted elements. This way the application manages decryption and key management functions, and can offer additional controls over who can see which information. This is very secure, but must be designed in during application design and coded into the application from the beginning. Retrofitting application-layer encryption into an existing application and database stack is highly challenging at beast, which is why we also see wide usage of masking and redaction technologies – from both enterprise Hadoop vendors and third-party security vendors. These technologies offer fine control over which data is displayed to which users, and can be easily built into existing clusters to