How do I secure “big data”? A simple and common question. But one without a direct answer – simple or otherwise.

We know thousands of firms are working on big data projects, from small startups to large enterprises. New technologies enable any company to collect, manage, and analyze incredibly large data sets. As these systems become more common, the repositories are more likely to be stuffed with sensitive data. Only after companies are reliant on “big data” do they ask “How can I secure it?”

This question comes up so much, attended by so much interest interest and confusion, that it’s time for an open discussion on big data security. We want to cover several areas to help people get a better handle on the challenges. Specifically, we want to cover three things:

  • Why It’s Different Architecturally: What’s different about these systems, both in how they process information and how they are deployed? We will list some of the specific architectural differences and discuss how how they impact data and database security.
  • Why It’s Different Operationally: We will go into detail on operational security issues with big data platforms. We will offer perspective on the challenges in securing big data and the deficiencies of the systems used to manage it – particularly their lack of native security features.
  • Recommendations and Open Issues: We will outline strategies for securing these data repositories, with tactical recommendations for securing certain facets of these environments. We will also highlight some gaps where no good solutions exist.

Getting back to our initial question – how to secure big data – what is so darn difficult about answering? For starters, “What is big data?” Before we can offer advice on securing anything, we need to agree what we’re talking about. We can’t discus big data security without an idea of what “big data” means. But there is a major problem: the term is so overused that it has become almost meaningless. When we talk to customers, developers, vendors, and members of the media, they all have their have their own idea of what “big data” is – but unfortunately they are all different. It’s a complex subject and even the wiki definition fails to capture the essence. Like art, everyone knows it when they see it, but nobody can agree on a definition.

Defining Big Data

What we know is that big data systems can store very large amounts of data; can manage that data across many systems; and provide some facility for data queries, data consistency, and systems management.

So does “big data” mean any giant data repository? No. We are not talking about giant mainframe environments. We’re not talking about Grid clusters, massively parallel databases, SAN arrays, cloud-in-a-box, or even traditional data warehouses. We have had the capability to create very large data repositories and databases for decades. The challenge is not to manage a boatload of data – many platforms can do that. And it’s not just about analysis of very large data sets. Various data management platforms provide the capability to analyze large amounts of data, but their cost and complexity make them non-viable for most applications. The big data revolution is not about new thresholds of scalability for storage and analysis.

Can we define big data as a specific technology? Can we say that big data is any Hadoop HDFS/Lustre/Google GFS/shard storage system? No – again, big data is more than managing a big data set. Is big data any MapReduce cluster? Probably not, because it’s more than how you query large data sets. Heck, even PL/SQL subsystems in Oracle can be set up to work like MapReduce. Is big data an application? Actually, it’s all of these things and more.

When we talk to developers, the people actually building big data systems and applications, we get a better idea of what we’re talking about. The design simplicity of these these platforms is what attracts developers. They are readily available, and their (relatively) low cost of deployment makes them accessible to a wider range of users. With all these traits combined, large-scale data analysis becomes cost-effective. Big data is not a specific technology – it’s defined more by a collection of attributes and capabilities.

Sound familiar? It’s more than a little like the struggle to define cloud computing, so we’ll steal from the NIST cloud computing definition and start with some essential characteristics.

We define big data as any data repository with the following characteristics:

  • Handles large amounts (petabyte or more) of data
  • Distributed, redundant data storage
  • Parallel task processing
  • Provides data processing (MapReduce or equivalent) capabilities
  • Central management and orchestration
  • Inexpensive – relatively
  • Hardware agnostic
  • Accessible – both (relatively) easy to use, and available as a commercial or open source product
  • Extensible – basic capabilities can be augmented and altered

In a nutshell: big, cheap, and easy data management. The “big data” revolution is built on these three pillars – the ability to scale data stores at greatly reduced cost is makes it all possible. It’s data analytics available to the masses. It may or may not have traditional ‘database’ capabilities (indexing, transactional consistency, or relational mapping). It may or may not be fault tolerant. It may or may not have failover capabilities (redundant control nodes). It may or may not allow complex data types. It may or may not provide real-time results to queries. But big data offers all those other characteristics, and it turns out that they – even without traditional database features – are enough to get useful work done.

So does big data mean the Hadoop framework? Yes. The Hadoop framework (e.g. HDFS, MapReduce, YARN, Common) is the poster child for big data, and it offers all the characteristics we outlined. Most big data systems actually use one or more Hadoop components, and extend some or all of its basic functionality. Amazon’s SimpleDB also satisfies the requirements, although it is architected differently than Hadoop. Google’s proprietary BigTable architecture is very similar to Hadoop, but we exclude proprietary systems which are not widely available. Our definition leaves some grey areas – we are open to suggestions – but it helps understand market trends. For the remainder of this series, unless we say otherwise, we will focus on the Hadoop framework and related extensions (Cassandra, MongoDB, Couch, Riak, etc.) as together they represent the majority of customer use cases.

It’s helpful to think about the Hadoop framework as a ‘stack’, much like a LAMP stack. Normally these pieces are grouped together, but you can mix and match, or add onto the stack, as needed. For example, there are optional data access services like Sqoop and Hive. Lustre, GFS, and GPFS are data storage alternatives to HDFS. Or you can extend HDFS functionality with tools like Scribe. The entire stack can be configured and extended as needed. This modular approach offers great flexibility but makes security more difficult, as we will see in upcoming posts.

Next we will look at deployment models and architectural security issues. We may have big, cheap, and easy data management, but the security capabilities are lagging behind.