Security Analytics with Big Data: Use Cases
Why do we use big data for security analytics? Aside from big data hype in the press, what motivates customers to look for new solutions? On the other side of the coin, why are vendors altering their products to use – or at least integrate with – big data? In our discussions with customers they cite performance and scalability, particularly for security event analysis. In fact this research project was originally outlined as a broad examination of the potential for big data for security analytics. The customers we speak with don’t care about generalities – they need to solve existing problems, specifically around installed SIEM and log management systems. We refocused this research on a focused need to scale beyond what they have today and get more from existing investments, and big data is a means to that end. Today’s post focuses on the customer use cases and delves into why SIEM, log management, and other event-centric monitoring systems struggle under evolving requirements. Data velocity and clustered data management are new terms in IT, but they define two core characteristics of big data. This is no coincidence – as IT practitioners learn more about the promise of big data they apply its capabilities to the problems of existing SIEM solutions. The inherent strengths of big data overlap beautifully with SIEM deficiencies in the areas of scalability, analysis speed, and rapid data insertion. And given the potential for greater analysis capabilities, big data is viewed as a way to both keep pace with exploding volumes of event data and do more with it. Specific use cases drive interest in big data. Big data analytics are expanding, and complement SIEM. But the reason it is such a major trend is that big data addresses important issues in existing platforms. To serve prospective buyers we need to understand the issues that drive them to investigate new products and solutions. The basic issues above are the ones that always seem to plague SIEM: scaling, efficiency, and detection of threats – but those are generic placeholders for more specific demands. Use Cases More (Types of) Data – The problem we heard most often was “We need to analyze more types of data to get better analysis”. The need to include more data types, beyond traditional netflow and syslog event streams, is to derive actionable information from the sea of data. Threat intelligence is not not a simple signature and detection is more complex than reviewing a single event. Communications data such as Twitter streams, blog comments, voice, and other rich data sources are unstructured and require different parsing algorithms to interpret. Netflow syslog data is highly structured, with each element defined by its location within a record. Blog comments, phishing emails, botnet C&C, or malicious files? Not so much. The problems accommodating more types of data are scalability and usability. First, adding data types means handling more data, and existing systems often can’t handle any more. Adding capacity to already taxed systems often requires costly add-ons. Rolling out additional data collectors and servers to process their output data takes months, and the cost in IT time can be prohibitive as well. That all assumes the SIEM architecture can scale up to greater volumes of data coming in faster. Second, many of these systems cannot handle alternative data types – either they normalize the data in a way that strips much of its value or the system lacks suitable tools for analyzing alternate (raw) data types. Most systems have evolved to include configuration management and identity information, but they don’t handle Twitter feeds or diverse threat intelligence. Given evolving attack profiles, the flexibility to capture and dig into any data type is now a key requirement. Anti-Drill-Down – We have seen steady advances in aggregation, correlation, dashboards, and data enrichment to help security folks identity security threats, faster. But these iterative advancements have not kept pace with the volume of security data that needs to be parsed, nor the diversity of attack signatures. Overall situational awareness has not improved and the signal-to-noise ratio has gotten worse instead of than better. The entire process – the entire mindset – has been called into question. Today the typical process is as follows: a) An event or combination of events that looks interesting is captured. b) SIEM correlates and enriches data to provide better context, analyzes data in terms of rules, and generates an alert if it detects an anomaly. c) To verify that a suspicious event is indeed a threat, generally a human must “drill down” into a combination of machine-readable and human-readable data to make sense of it. The security practitioner must cross reference-multiple data sources. Enrichment is handy but too much manual analysis is still required to weed through false positives. In many cases the analyst extracts data to run other scripts or tools to produce the final analysis – we have even seen exports to MS Excel to find outliers and detect fraud. We need better analytics tools with more options than simple SQL queries and pattern matching. The types of analysis SIEMs can perform are limited, and most SIEM solutions lack programatic extensions to enable more complex analysis. “The net result is we always get a blob of stuff we have to sift through, then verify, investigate, validate and, often adjust the policy to filter our more detritus.” The anti-drill-down use case offers more automated checking using more powerful analytics and data mining tools than simple scripts and SQL queries. Architectural Limitations – Some customers attribute their performance issues – especially lagging timely threat analysis – to SIEM architecture and process. It takes time to gather data, move it to a central location, normalize, correlate, and then enrich. This generally makes near-real-time analysis a fantasy. Queries run on centralized event servers, and often take minutes to complete, while compliance reports generally take hours. Some users report that the volume of data stresses their systems, and queries on relational servers take too long to complete. Centralized computation limits the speed and timelines of analysis and reporting. The current