Securing Big Data: Security Issues with Hadoop Environments

How do I secure “big data”? A simple and common question. But one without a direct answer – simple or otherwise.

We know thousands of firms are working on big data projects, from small startups to large enterprises. New technologies enable any company to collect, manage, and analyze incredibly large data sets. As these systems become more common, the repositories are more likely to be stuffed with sensitive data. Only after companies are reliant on “big data” do they ask “How can I secure it?”

This question comes up so much, attended by so much interest interest and confusion, that it’s time for an open discussion on big data security. We want to cover several areas to help people get a better handle on the challenges. Specifically, we want to cover three things:

Why It’s Different Architecturally: What’s different about these systems, both in how they process information and how they are deployed? We will list some of the specific architectural differences and discuss how how they impact data and database security.
Why It’s Different Operationally: We will go into detail on operational security issues with big data platforms. We will offer perspective on the challenges in securing big data and the deficiencies of the systems used to manage it – particularly their lack of native security features.
Recommendations and Open Issues: We will outline strategies for securing these data repositories, with tactical recommendations for securing certain facets of these environments. We will also highlight some gaps where no good solutions exist.

Getting back to our initial question – how to secure big data – what is so darn difficult about answering? For starters, “What is big data?” Before we can offer advice on securing anything, we need to agree what we’re talking about. We can’t discus big data security without an idea of what “big data” means. But there is a major problem: the term is so overused that it has become almost meaningless. When we talk to customers, developers, vendors, and members of the media, they all have their have their own idea of what “big data” is – but unfortunately they are all different. It’s a complex subject and even the wiki definition fails to capture the essence. Like art, everyone knows it when they see it, but nobody can agree on a definition.

Defining Big Data

What we know is that big data systems can store very large amounts of data; can manage that data across many systems; and provide some facility for data queries, data consistency, and systems management.

So does “big data” mean any giant data repository? No. We are not talking about giant mainframe environments. We’re not talking about Grid clusters, massively parallel databases, SAN arrays, cloud-in-a-box, or even traditional data warehouses. We have had the capability to create very large data repositories and databases for decades. The challenge is not to manage a boatload of data – many platforms can do that. And it’s not just about analysis of very large data sets. Various data management platforms provide the capability to analyze large amounts of data, but their cost and complexity make them non-viable for most applications. The big data revolution is not about new thresholds of scalability for storage and analysis.

Can we define big data as a specific technology? Can we say that big data is any Hadoop HDFS/Lustre/Google GFS/shard storage system? No – again, big data is more than managing a big data set. Is big data any MapReduce cluster? Probably not, because it’s more than how you query large data sets. Heck, even PL/SQL subsystems in Oracle can be set up to work like MapReduce. Is big data an application? Actually, it’s all of these things and more.

When we talk to developers, the people actually building big data systems and applications, we get a better idea of what we’re talking about. The design simplicity of these these platforms is what attracts developers. They are readily available, and their (relatively) low cost of deployment makes them accessible to a wider range of users. With all these traits combined, large-scale data analysis becomes cost-effective. Big data is not a specific technology – it’s defined more by a collection of attributes and capabilities.

Sound familiar? It’s more than a little like the struggle to define cloud computing, so we’ll steal from the NIST cloud computing definition and start with some essential characteristics.

We define big data as any data repository with the following characteristics:

Handles large amounts (petabyte or more) of data
Distributed, redundant data storage
Parallel task processing
Provides data processing (MapReduce or equivalent) capabilities
Central management and orchestration
Inexpensive – relatively
Hardware agnostic
Accessible – both (relatively) easy to use, and available as a commercial or open source product
Extensible – basic capabilities can be augmented and altered

In a nutshell: big, cheap, and easy data management. The “big data” revolution is built on these three pillars – the ability to scale data stores at greatly reduced cost is makes it all possible. It’s data analytics available to the masses. It may or may not have traditional ‘database’ capabilities (indexing, transactional consistency, or relational mapping). It may or may not be fault tolerant. It may or may not have failover capabilities (redundant control nodes). It may or may not allow complex data types. It may or may not provide real-time results to queries. But big data offers all those other characteristics, and it turns out that they – even without traditional database features – are enough to get useful work done.

So does big data mean the Hadoop framework? Yes. The Hadoop framework (e.g. HDFS, MapReduce, YARN, Common) is the poster child for big data, and it offers all the characteristics we outlined. Most big data systems actually use one or more Hadoop components, and extend some or all of its basic functionality. Amazon’s SimpleDB also satisfies the requirements, although it is architected differently than Hadoop. Google’s proprietary BigTable architecture is very similar to Hadoop, but we exclude proprietary systems which are not widely available. Our definition leaves some grey areas – we are open to suggestions – but it helps understand market trends. For the remainder of this series, unless we say otherwise, we will focus on the Hadoop framework and related extensions (Cassandra, MongoDB, Couch, Riak, etc.) as together they represent the majority of customer use cases.

It’s helpful to think about the Hadoop framework as a ‘stack’, much like a LAMP stack. Normally these pieces are grouped together, but you can mix and match, or add onto the stack, as needed. For example, there are optional data access services like Sqoop and Hive. Lustre, GFS, and GPFS are data storage alternatives to HDFS. Or you can extend HDFS functionality with tools like Scribe. The entire stack can be configured and extended as needed. This modular approach offers great flexibility but makes security more difficult, as we will see in upcoming posts.

Next we will look at deployment models and architectural security issues. We may have big, cheap, and easy data management, but the security capabilities are lagging behind.

3 Replies to “Securing Big Data: Security Issues with Hadoop Environments”

Adrian Lane September 30, 2012 at 3:43 pm

@John — I think you’re right that the velocity of input data is a common distinguishing trait. I’ll add to the essential characteristics when I put the paper together.

@Bert — You hit on a couple important points, and why classifying big data is so difficult. It’s certainly not tied to the relational model and free of the data definition burden. That said, there are relational variants of big data clusters. In fact, perhaps I need to add to the essential characteristics that big data is _not_ defined by the storage model. We know key-value stores are common, but there are also wide column, XML, document, object and grid storage options, and even multi-model systems like Alchemy. That’s for raising the issue.

As far as Teradata — I was on the fence about this platform for a while. It’s quasi-relational and offers the same parallel search capabilities. I came to the conclusion that it fails the data redundancy test, as it does not provide that degree of resiliency and does not work on the assumption nodes will wail during operation. It can be argued it fails the (relatively) low-cost and extensibility tests as well.

Thanks for the comments.

-Adrian

Bert Latamore September 24, 2012 at 5:46 pm

One thing I would like to suggest adding to your definition is that “Big Data” heavily implies the inclusion of unstructured and/or semi-structured data types such as relationship data that are very hard or impossible to include in traditional SQL structured databases. This allows users to ask new kinds of questions such as “Who are the primary influencers among this group of people?” or “Where does our customer support process break down, causing customers to leave us rather than renewing?” These and others like them are important questions in business, and this is the source of much of the value of Big Data. It will revolutionize how businesses and government entities operate and relate to the public. And this is what a Teradata database, for instance, is not Big Data.

John Piekos September 20, 2012 at 3:51 pm

Hi Adrian,

One aspect of your Big Data definition that I think you overlooked is Velocity. In addition to big data being “big” (petabytes or more, as you say), often this data is coming in extremely fast, ten’s of thousands to millions per second. The ability of big data systems to ingest these feeds is crucial for businesses to react in “near real-time” to changing conditions.

Nice summary – I look forward to future posts.

Thanks,

John Piekos
http://www.voltdb.com

Blog

Securing Big Data: Security Issues with Hadoop Environments

Defining Big Data

Comments

3 Replies to “Securing Big Data: Security Issues with Hadoop Environments”

Leave a Reply Cancel reply

Research

Firestarter: Multicloud Deployment Structures and Blast Radius

Firestarter: So you want to multicloud?

Firestarter: 2019: Insert Winter is Coming Meme Here

Firestarter: re:Invent Security Review

Firestarter: Hardware Hacks and Lift and Pray

Sign Up for Our Newsletter

Contact

About

Quick Links

Blog

Securing Big Data: Security Issues with Hadoop Environments

Defining Big Data

Comments

Reader interactions

3 Replies to “Securing Big Data: Security Issues with Hadoop Environments”

Leave a Reply Cancel reply

Research

Firestarter: Multicloud Deployment Structures and Blast Radius

Firestarter: So you want to multicloud?

Firestarter: 2019: Insert Winter is Coming Meme Here

Firestarter: re:Invent Security Review

Firestarter: Hardware Hacks and Lift and Pray

Sign Up for Our Newsletter

Contact

About

Quick Links