Securing Big Data: Security Issues with Hadoop Environments
How do I secure ābig dataā? A simple and common question. But one without a direct answer ā simple or otherwise. We know thousands of firms are working on big data projects, from small startups to large enterprises. New technologies enable any company to collect, manage, and analyze incredibly large data sets. As these systems become more common, the repositories are more likely to be stuffed with sensitive data. Only after companies are reliant on ābig dataā do they ask āHow can I secure it?ā This question comes up so much, attended by so much interest interest and confusion, that itās time for an open discussion on big data security. We want to cover several areas to help people get a better handle on the challenges. Specifically, we want to cover three things: Why Itās Different Architecturally: Whatās different about these systems, both in how they process information and how they are deployed? We will list some of the specific architectural differences and discuss how how they impact data and database security. Why Itās Different Operationally: We will go into detail on operational security issues with big data platforms. We will offer perspective on the challenges in securing big data and the deficiencies of the systems used to manage it ā particularly their lack of native security features. Recommendations and Open Issues: We will outline strategies for securing these data repositories, with tactical recommendations for securing certain facets of these environments. We will also highlight some gaps where no good solutions exist. Getting back to our initial question ā how to secure big data ā what is so darn difficult about answering? For starters, āWhat is big data?ā Before we can offer advice on securing anything, we need to agree what weāre talking about. We canāt discus big data security without an idea of what ābig dataā means. But there is a major problem: the term is so overused that it has become almost meaningless. When we talk to customers, developers, vendors, and members of the media, they all have their have their own idea of what ābig dataā is ā but unfortunately they are all different. Itās a complex subject and even the wiki definition fails to capture the essence. Like art, everyone knows it when they see it, but nobody can agree on a definition. Defining Big Data What we know is that big data systems can store very large amounts of data; can manage that data across many systems; and provide some facility for data queries, data consistency, and systems management. So does ābig dataā mean any giant data repository? No. We are not talking about giant mainframe environments. Weāre not talking about Grid clusters, massively parallel databases, SAN arrays, cloud-in-a-box, or even traditional data warehouses. We have had the capability to create very large data repositories and databases for decades. The challenge is not to manage a boatload of data ā many platforms can do that. And itās not just about analysis of very large data sets. Various data management platforms provide the capability to analyze large amounts of data, but their cost and complexity make them non-viable for most applications. The big data revolution is not about new thresholds of scalability for storage and analysis. Can we define big data as a specific technology? Can we say that big data is any Hadoop HDFS/Lustre/Google GFS/shard storage system? No ā again, big data is more than managing a big data set. Is big data any MapReduce cluster? Probably not, because itās more than how you query large data sets. Heck, even PL/SQL subsystems in Oracle can be set up to work like MapReduce. Is big data an application? Actually, itās all of these things and more. When we talk to developers, the people actually building big data systems and applications, we get a better idea of what weāre talking about. The design simplicity of these these platforms is what attracts developers. They are readily available, and their (relatively) low cost of deployment makes them accessible to a wider range of users. With all these traits combined, large-scale data analysis becomes cost-effective. Big data is not a specific technology ā itās defined more by a collection of attributes and capabilities. Sound familiar? Itās more than a little like the struggle to define cloud computing, so weāll steal from the NIST cloud computing definition and start with some essential characteristics. We define big data as any data repository with the following characteristics: Handles large amounts (petabyte or more) of data Distributed, redundant data storage Parallel task processing Provides data processing (MapReduce or equivalent) capabilities Central management and orchestration Inexpensive ā relatively Hardware agnostic Accessible ā both (relatively) easy to use, and available as a commercial or open source product Extensible ā basic capabilities can be augmented and altered In a nutshell: big, cheap, and easy data management. The ābig dataā revolution is built on these three pillars ā the ability to scale data stores at greatly reduced cost is makes it all possible. Itās data analytics available to the masses. It may or may not have traditional ādatabaseā capabilities (indexing, transactional consistency, or relational mapping). It may or may not be fault tolerant. It may or may not have failover capabilities (redundant control nodes). It may or may not allow complex data types. It may or may not provide real-time results to queries. But big data offers all those other characteristics, and it turns out that they ā even without traditional database features ā are enough to get useful work done. So does big data mean the Hadoop framework? Yes. The Hadoop framework (e.g. HDFS, MapReduce, YARN, Common) is the poster child for big data, and it offers all the characteristics we outlined. Most big data systems actually use one or more Hadoop components, and extend some or all of its basic functionality. Amazonās SimpleDB also satisfies the requirements, although it is architected differently than Hadoop. Googleās proprietary BigTable architecture is very similar to Hadoop, but we exclude