Security Analytics with Big Data: New Events and New ApproachesBy Adrian Lane
So why are we looking at big data, and what problems can we expect it to solve that we couldn’t before? Most SIEM platforms struggle to keep up with emerging needs for two reasons. The first is that threat data does not come neatly packaged from traditional sources, such as
syslog and netflow events. There are many different types of data, data feeds, documents, and communications protocols that contain diverse clues to a data breaches or ongoing attacks. We see clear demand to analyze a broader data set in order hopes of detecting advanced attacks. The second issue is that many types of analysis, correlation, and enrichment are computationally demanding. Much like traditional multi-dimensional data analysis platforms, crunching the data takes horsepower. More data is being generated; add more types of data we want, and multiply that by additional analysess – and you get a giant gap between what you need to do and what you can presently do.
Our last post considered what big data is and how NoSQL database architectures inherently address several of the SIEM pain points. In fact, the 3Vs (Volume, Velocity, & Variety) of big data coincide closely with three of the main problems faced by SIEM systems today: scalability, performance, and effectiveness. This is why big data is such an important advancement for SIEM. Volume and velocity problems are addressed by clustering systems to divide load across many commodity servers, and variability through the inherent flexibility of big data / NoSQL. But of course there is more to it.
Analysis: Looking at More
Two of the most serious problems with current SIEM solutions are that they struggle with the amount of data to be managed, and they cannot deal with the “data velocity” of near-real-time events. Additionally, they need to accept and parse new and diverse data types to support new types of analysis. There are many different types of event data, any of which might contain clues to security threats. Common data types include:
Human-readable data: There is a great deal of data which humans can process easily, but which is much more difficult for machines – including blog comments and Twitter feeds. Tweets, discussion fora, Facebook posts, and other types of social media are all valuable for threat intelligence. Some attacks are coordinated in fora, which means companies want to monitor these fora for warnings of possible or imminent attacks, and perhaps even details of the attacks. Some botnet command and control (C&C) communications occur through social media, so there is potential to detect infected machines through this traffic.
Telemetry feeds: Cell phone geolocation, lists of sites serving malware, mobile device IDs, HR feeds of employee status, and dozens of other real-time data feeds denote changes in status, behavior, and risk profiles. Some of these feeds are analyzed as the stream of events is captured, while others are collected and analyzed for new behaviors. There are many different use cases but security practitioners, observing how effectively retail organizations are able to predict customer buying behavior, are seeking the same insight into threats.
Financial data: We were surprised to learn how many customers use financial data purchased from third parties to help detect fraud. The use cases we heard centered around SIEM for external attacks against web services, but they were also analyzing financial and buying history to predict misuse and account compromise.
Contextual data: This is anything that makes other data more meaningful. Contextual data might indicate automated processes rather than human behavior – a too-fast series of web requests, for example, might indicate a bot rather than a human customer. Contextual data also includes risk scores generated by arbitrary analysis of metadata, and detection of odd or inappropriate series of actions. Some is simply collected from a raw event source while other data is derived through analysis. As we improve our understanding of where to look for attack and breach cluse, we will leverage new sources of data and examine them in new ways. SIEM generates some contextual data today, but collection of a broader variety of data enables better analysis and enrichment.
Identity and Personas: Today many SIEMs link with directory services to identify users. The goal is to link a human user to their account name. With cloud services, mobile devices, distributed identity stores, identity certificates, and two-factor identity schemes, it has become much harder to link human beings to account activity. As authentication and authorization facilities become more complex, SIEM must connect to and analyze more and different identity stores and logs.
Network Data: Some of you are saying “What? I thought all SIEMs looked at network flow data!” Actually, some do but others don’t. Some collect and alert on specific known threats, but only a tiny portion of that passes down the wire. Cheap storage makes it feasible to store more network events and perform behavioral computation on general network trends, service usage, and other pre-computed aggregate views of network traffic.
In the future we may be able to include all data. Each of these examples demonstrates what will be possible in the short term. In the long term we may record any and all useful or interesting data. If we can link in data sets that provide a different views or help us make better decisions, we will.
We already collect many of these data types, but we have been missing the infrastructure to analyze them meaningfully.
Analysis: Doing It Better
One limitation of many SIEM platforms is their dependence on relational databases. Even if you strip away relational constructs that limit insertion performance, they still rely on a SQL language with traditional language processors. The fundamental relational database architecture was designed and optimized for relational queries. Flexibility is severely limited by SQL – statements always include
WHERE clauses, and we have a limited number of comparison operators for searching. At a high level we may have Java support, but the actual queries still devolve down to SQL statements. SQL may be a trusty hammer but we are running into more and more problems which don’t look like nails.
Big data platforms offer many different non-SQL languages, with different data access constructs which can be bolted together with filtering, tagging, and indexing. Pig, Piql, Mahout, Crunch, AVRO, Hive, Dremel, and so on. The key is that we are no longer limited to a single data query/access model. You can do very complex comparisons, bundled in different programing languages, and optimized for different types of analysis. This enables better and faster analysis, and new analyses that are simply impossible in a pure-SQL world. Many NoSQL distributions also support SQL queries, and even more offer SQL-like syntax, but big data has given us a much broader range of choices.
Analysis: Doing It Faster
One essential characteristic of big data is the way its architecture scales up to very large data sets – clustering breaks up analysis and data management into smaller, more manageable chunks. Commodity hardware and open source software make it cost effective. But in one key way, big data is exactly the opposite of a relational platform. Relational databases focus on a central, confined data model – all data is brought to a central location for management. Big data, in contrast, scatters data across many different servers. Relational databases move data to a central location with plenty of processing power. Big data systems leverage whatever computing power is available, preferably near the data.
This is critical because requests are no longer bottlenecked by a single large server – instead they can be shared across many – possibly hundreds – of smaller servers. The most common method of distributing work this way is MapReduce. A query is distributed – mapped – to many different nodes. Each node examines only its own small subset of data, mapping it against the specified query, and returning its own matches. The results from all nodes are filtered and deduplicated, yielding the ‘reduced’ result. The NoSQL platform coordinates all work among the nodes, which enables hundreds of nodes to work on smaller problems in parallel.
Understand that platforms like Hadoop usually do one or two things really well, but not everything. That is why we see so many different combinations of NoSQL technologies the one big data umbrella, often combined in a single product or solution. Some vendors talk about Cassandra as their base infrastructure integrated with Hadoop for its MapReduce capabilities. Some use Apache Lucene to augment Hadoop with fully indexed low-latency search. One vendor told us about their experiments with Chuckwa on top of Hadoop. There are many options, and no big data bundle is the overall best. Each vendor is looking to close performance and scalability gaps in their existing solution, and selecting a combination of technologies to suit their needs.
But there is another important point to make here: Big data has infinite permutations, so how can you know your vendor is really offering big data and addressing performance, scalability, and flexibility? Without a clear and agreed-upon definition it is all too easy for vendors to paint their old gear with the Big Data brush to attract attention. If you see words such as ‘proprietary’ you need to dig deeper! Does the proprietary solution actually have the essential characteristics we mentioned in the previous post? Will it integrate with other off-the-shelf commercial and open source big data platforms? Will it scale cost effectively? Can you share data and analysis with other platforms? Hopefully the answers to all these questions are ‘yes’, but the burden of the proof is on the vendor if they don’t use a recognized NoSQL distribution such as Hadoop, Cassandra, Riak, or HPCC. The value of these platforms has already been established and they are all undergoing ongoing development.
Next we will go into more detail on key platform issues when we address how big data augments SIEM architectures. We will also highlight how big data is integrated into SIEM.