Security Analytics with Big Data: New Events and New Approaches
So why are we looking at big data, and what problems can we expect it to solve that we couldn’t before? Most SIEM platforms struggle to keep up with emerging needs for two reasons. The first is that threat data does not come neatly packaged from traditional sources, such as syslog and netflow events. There are many different types of data, data feeds, documents, and communications protocols that contain diverse clues to a data breaches or ongoing attacks. We see clear demand to analyze a broader data set in order hopes of detecting advanced attacks. The second issue is that many types of analysis, correlation, and enrichment are computationally demanding. Much like traditional multi-dimensional data analysis platforms, crunching the data takes horsepower. More data is being generated; add more types of data we want, and multiply that by additional analysess – and you get a giant gap between what you need to do and what you can presently do. Our last post considered what big data is and how NoSQL database architectures inherently address several of the SIEM pain points. In fact, the 3Vs (Volume, Velocity, & Variety) of big data coincide closely with three of the main problems faced by SIEM systems today: scalability, performance, and effectiveness. This is why big data is such an important advancement for SIEM. Volume and velocity problems are addressed by clustering systems to divide load across many commodity servers, and variability through the inherent flexibility of big data / NoSQL. But of course there is more to it. Analysis: Looking at More Two of the most serious problems with current SIEM solutions are that they struggle with the amount of data to be managed, and they cannot deal with the “data velocity” of near-real-time events. Additionally, they need to accept and parse new and diverse data types to support new types of analysis. There are many different types of event data, any of which might contain clues to security threats. Common data types include: Human-readable data: There is a great deal of data which humans can process easily, but which is much more difficult for machines – including blog comments and Twitter feeds. Tweets, discussion fora, Facebook posts, and other types of social media are all valuable for threat intelligence. Some attacks are coordinated in fora, which means companies want to monitor these fora for warnings of possible or imminent attacks, and perhaps even details of the attacks. Some botnet command and control (C&C) communications occur through social media, so there is potential to detect infected machines through this traffic. Telemetry feeds: Cell phone geolocation, lists of sites serving malware, mobile device IDs, HR feeds of employee status, and dozens of other real-time data feeds denote changes in status, behavior, and risk profiles. Some of these feeds are analyzed as the stream of events is captured, while others are collected and analyzed for new behaviors. There are many different use cases but security practitioners, observing how effectively retail organizations are able to predict customer buying behavior, are seeking the same insight into threats. Financial data: We were surprised to learn how many customers use financial data purchased from third parties to help detect fraud. The use cases we heard centered around SIEM for external attacks against web services, but they were also analyzing financial and buying history to predict misuse and account compromise. Contextual data: This is anything that makes other data more meaningful. Contextual data might indicate automated processes rather than human behavior – a too-fast series of web requests, for example, might indicate a bot rather than a human customer. Contextual data also includes risk scores generated by arbitrary analysis of metadata, and detection of odd or inappropriate series of actions. Some is simply collected from a raw event source while other data is derived through analysis. As we improve our understanding of where to look for attack and breach cluse, we will leverage new sources of data and examine them in new ways. SIEM generates some contextual data today, but collection of a broader variety of data enables better analysis and enrichment. Identity and Personas: Today many SIEMs link with directory services to identify users. The goal is to link a human user to their account name. With cloud services, mobile devices, distributed identity stores, identity certificates, and two-factor identity schemes, it has become much harder to link human beings to account activity. As authentication and authorization facilities become more complex, SIEM must connect to and analyze more and different identity stores and logs. Network Data: Some of you are saying “What? I thought all SIEMs looked at network flow data!” Actually, some do but others don’t. Some collect and alert on specific known threats, but only a tiny portion of that passes down the wire. Cheap storage makes it feasible to store more network events and perform behavioral computation on general network trends, service usage, and other pre-computed aggregate views of network traffic. In the future we may be able to include all data. Each of these examples demonstrates what will be possible in the short term. In the long term we may record any and all useful or interesting data. If we can link in data sets that provide a different views or help us make better decisions, we will. We already collect many of these data types, but we have been missing the infrastructure to analyze them meaningfully. Analysis: Doing It Better One limitation of many SIEM platforms is their dependence on relational databases. Even if you strip away relational constructs that limit insertion performance, they still rely on a SQL language with traditional language processors. The fundamental relational database architecture was designed and optimized for relational queries. Flexibility is severely limited by SQL – statements always include FROM and WHERE clauses, and we have a limited number of comparison operators for searching. At a high level we may have Java support, but the actual queries still devolve down to SQL statements. SQL may be a trusty