Security Analytics with Big Data: Defining Big DataBy Adrian Lane
Today we pick up our Security Analytics with Big Data series where we left off. But first it’s worth reiterating that this series was originally intended to describe how big data made security analytics better. But when we started to interview customers it became clear that they are just as concerned with how big data can make their existing infrastructure better. They want to know how big data can augment SIEM and the impact of this transition on their organization. It has taken some time to complete our interviews with end users and vendors to determine current needs and capabilities. And the market is moving fast – vendors are pushing to incorporate big data into their platforms and leverage the new capabilities. I think we have a good handle on the state of the market, but as always we welcome comments and input.
So far we have outlined the reasons big data is being looked at as a transformative technology for SIEM, as well as common use cases, with the latter post showing how customer desires differ from what we have come to expect. My original outline addressed a central question: “How is big data analysis different from traditional SIEM?”, but it has since become clear that we need to fully describe what big data is first.
This post demystifies big data by explaining what it is and what it isn’t. The point of this post is to help potential buyers like you compare what big data is with what your SIEM vendor is selling. Are they really using big data or is it the same thing they have been selling all along? You need to understand what big data is before you can tell whether a vendor’s BD offering is valuable or snake oil. Some vendors are (deliberately) sloppy, and their big data offerings may not actually be big data at all. They might offer a relational data store with a “Big Data” label stuck on, or a proprietary flat file data storage format without any of the features that make big data platforms powerful.
Let’s start with Wikipedia’s Big Data page. Wikipedia’s definition (as of this writing) captures the principal challenges big data is intended to address: increased Volume (quantity of data), Velocity (rate of data accumulation), and Variety (different types of data) – also called the 3Vs. But Wikipedia fails to actually define big data. The term “big data” has been so overused, with so many incompatible definitions, that it has become meaningless.
The current poster child for big data is Apache Hadoop, an open source platform based on Google BigTable. A Hadoop installation is built as a clustered set of commodity hardware, with each node providing storage and processing capabilities. Hadoop provides tools for data storage, data organization, query management, cluster management, and client management.
It is helpful to think about the Hadoop framework as a ‘stack’ like the LAMP stack. These Hadoop components are normally grouped together but you can replace each component, or add new ones, as desired. Some clusters add optional data access services such as Sqoop and Hive. Lustre, GFS, and GPFS, can be swapped in as the storage layer. Or you can extend HDFS functionality with tools like Scribe. You can select or design a big data architecture specifically to support columnar, graph, document, XML, or multidimensional data. This modular approach enables customization and extension to satisfy specific customer needs.
But that is still not a definition. And Hadoop is not the only player. Users might choose Cassandra, Couch, MongoDB, or RIAK instead – or investigate 120 or more alternatives. Each platform is different – focusing on its own particular computational problem area, replicating data across the cluster in its own way, with its own storage and query models, etc. One common thread is that every big data system is based on a ‘NoSQL’ (non-relational) database; they also embrace many non-relational technologies to improve scalability and performance. Unlike relational databases, which we define by their use of relational keys, table storage, and various other common traits, there is no such commonality among NoSQL platforms. Each layer of a big data environment may be radically different, so there is much less common functionality than we see between RDBMS.
But we have seen this problem before – the term “Cloud Computing” used to be similarly meaningless, but we have come to grips with the many different cloud service and consumption models. We lacked a good definition until NIST defined cloud computing based on a series of essential characteristics. So we took a similar approach, defining big data as a framework of utilities and characteristics common to all NoSQL platforms.
- Very large data sets (Volume)
- Extremely fast insertion (Velocity)
- Multiple data types (Variety)
- Clustered deployments
- Provides complex data analysis capabilities (MapReduce or equivalent)
- Distributed and redundant data storage
- Distributed parallel processing
- Modular design
- Hardware agnostic
- Easy to use (relatively)
- Available (commercial or open source)
- Extensible – designers can augment or alter functions
There are more essential characteristics to big data than just the 3Vs. Additional essential capabilities include data management, cost reduction, more extensive analytics than SQL, and customization (including a modular approach to orchestration, access, task management, and query processing). This broader collection of characteristics captures the big data value proposition, and offers a better understanding of what big data is and how it behaves.
What does it look like?
This is a typical big data cluster architecture; multiple nodes cooperate to manage data and process queries. A central node manages the cluster and client connections, and clients communicate directly with the name node and individual data nodes as necessary for query operations.
This simplified shows the critical components, but a big data cluster could easily comprise 500 nodes hosting 30 applications. More nodes enable faster data insertion, and parallel query processing improves responsiveness substantially. 500 nodes should be overkill to support your SIEM installation, but big data can solve much larger problems than security analytics.
Why Are Companies Adopting Big Data?
Thinking of big data simply as a system that holds “a lot of data”, or even limiting its definition to the 3Vs, is reminiscent of The Blind Men and the Elephant – each blind man only perceives one facet of the whole.
The popularity of big data is largely due to its incredibly cheap analytics capabilities. The big data revolution has been driven by three simple evolutionary changes in the market: inexpensive commodity computing resources, availability of a boatload of interesting data to analyze, and virtually free analytics tools. Together they created the widespread demand which big data systems address. Once organizations see what big data can do for marketing and sales data they wonder what it can do for other computationally complex challenges such as threat and fraud detection.
Conversely, you could think about it this way: The big data revolution is not because companies of all sizes suddenly stumbled over millions of dollars earmarked to buy “big iron’, but because they can now perform advanced analytics for pennies. We have been able to build systems which can process and store vast quantities of data for a couple decades, but they required multi-million-dollar investments to even get started. These large and expensive systems were scare, data sets of the relevant scale were much less easy to obtain, and personnel who could manage and program them were incredibly rare and thus expensive.
It is helpful to keep these three drivers in mind when considering what big data is, and is not. We could spend another 20 pages defining big data, but this should be enough information to make meaningful comparisons between different solutions. If an underlying data management architecture does not support distribution of queries across nodes, it will be less able than a true big data system to deliver timely information. If a “big data” infrastructure is proprietary, it will be much harder to leverage open source or commodity commercial tools to extend platform capabilities, or to find people who can implement your policies and build security analytics from them.
Big data means better scalability, better analysis, and faster results at lower cost. Who can deliver, and how? We will answer these questions later in this series. Next I will discuss how big data technologies address the 3Vs and their potential to provider better data security analytics.