As we discussed in our last post on Critical Incident Response Gaps, we tend to gather too much of the wrong kinds of information, too early in the process. To clarify that a little bit, we are still fans of collecting as much data as you can, because once you miss the opportunity to collect something you’ll never get another chance. Our point is that there is a tendency to try to boil the ocean with analysis of all sorts of data. That causes failure and has plagued technologies like SIEM, because customers try to do too much too soon.

Remember, the objective from an operational standpoint is to react faster, which means discovering as quickly as possible that you have an issue, and then engaging your incident response process. But merely responding quickly isn’t useful if your response is inefficient or ineffective, which is why the next objective is to react better.

Collecting the Right Data at the Right Time

Balancing all the data collection sources available today is like walking a high wire, in a stiff breeze, after knocking a few back at the local bar. We definitely don’t lack for potential information sources, but many organizations find themselves either overloaded with data or missing key information when it’s time for investigation. The trick is to realize that you need three kinds of data:

  1. Data to support continuous monitoring and incident alerts/triggers. This is the stuff you look at on a daily basis to figure out when to trigger an incident.
  2. Data to support your initial response process. Once an incident triggers, these are the first data sources you consult to figure out what’s going on. This is a subset of all your data sources. Keep in mind that not all incidents will tie directly to one of these sources, so sometimes you’ll still need to dive into the ocean of lower-priority data.
  3. Data to support post-incident investigation and root cause analysis. This is a much larger volume of data, some of it archived, used to for the full in-depth investigation.

One of the Network Security Fundamentals I wrote about early in the year was called Monitor Everything because I fundamentally believe in data collection and driving issue identification from the data. Adrian pushed back pretty hard, pointing out that monitoring everything may not be practical, and focus should be on monitoring the right stuff. Yes, there is a point in the middle. How about collect (almost) everything and analyze the right stuff? That seems to make the most sense.

Collection is fairly simple. You can generate a tremendous amount of data, but with the log management tools available today scale is generally not an issue. Analysis of that data, on the other hand, is still very problematic; when we mention too much of the wrong kinds of information, that’s what we are talking about. To address this issue, we advocate segmenting your network into vaults and analyzing traffic and events within the critical vaults at a deep level.

So basically it’s about collecting all you can within the limits of reason and practicality, then analyzing the right information sources for early indications of problems, so you can then engage the incident response process. You start with a set of sources to support your continuous monitoring and analysis, followed by a set of prioritized data to support initial incident management, and close with a massive archive of different data sources, again based on priorities.

Continuous Monitoring

We have done a lot of research into SIEM and Log Management, as well as advanced monitoring (Monitoring up the Stack). That’s the kind of information to use in your ongoing operational analysis. For those vaults (trust zones) you deem critical, you want to monitor and analyze:

  • Perimeter networks and devices: Yes, the bad guys tend to be out there, so they need to cross the perimeter to get to the good stuff. So we want to look for issues on those devices.
  • Identity: Who is as important as what, so analyze access to specific resources – especially within a privileged user context.
  • Servers: We are big fans of anomaly detection and white listing on critical servers such as domain controllers and app servers, so you can be alerted to funky stuff happening at the server level – which usually indicates something that warrants investigation.
  • Database: Likewise, correlating database anomalies against other types of traffic (such as reconnaissance and network exfiltration) can indicate a breach in progress. Better to know that early, before your credit card brand notifies you.
  • File Integrity: Most attacks involve some change to key system files, so by monitoring their integrity you can pinpoint when an attacker is trying to make changes. You can even block these attacks using technology like HIPS, but that’s another story for another day.
  • Application: Finally, you should be able to profile normal transactions and user interactions for your key applications (those accessing protected data) and watch for non-standard activities. Again, they don’t always indicate a problem, but do allow you to prioritize investigation.

We recommend focusing on your most important zones, but keep in mind that you need some baseline monitoring of everything. The two most common sources we see for baselines are network monitoring and endpoint & server logs (or whatever security tools you have on those systems).

Full Packet Capture Sandwich

One emerging advanced monitoring capability – the most interesting to us – is full packet capture. Rich wrote about this earlier this year. Basically these devices capture all the traffic on a given network segment. Why? In a nutshell, it’s the only way you can really piece together exactly what happened, because this way you have the actual traffic. In a forensic investigation this is absolutely crucial will provide detail you cannot get from log records.

Going back to our Data Breach Triangle, you need some kind of exfiltration for a real breach. So we advocate heavy perimeter egress filtering and monitoring, to (hopefully) prevent valuable data from escaping your network. Capturing all network traffic isn’t really practical for any organization of scale, but perimeter traffic should be doable.

Additionally, along with our vaulting concepts, we recommend organizations look to deploy full packet capture on the most critical internal segments as well. That’s what attackers will go for, so if you are capturing data from the key internal networks as well as perimeter traffic (which is why we call this the sandwich), you have a better chance to piece together what happened.

We believe monitoring these sources for the critical vaults (trust zones) and integrating full packet capture needs to be a key part of your security operational processes. What about less-critical internal zones? You can probably minimize analysis by just focusing on things like IDS alerts and NetFlow output. That should be enough to pinpoint egregious issues and investigate as appropriate. But you probably can’t analyze and/or capture all traffic on all your networks at all times – this is likely to be well past the point of diminishing returns.

Sources and Sizing

The good news is that you are likely already collecting most of the data you need, which tends to be in the form of log records. Regardless of how deeply you analyze the data, you want to collect as much as is feasible. That means pulling logs from pretty much all the devices you can. Depending on the platform you utilize for data collection, you can implement a set of less sophisticated log aggregators and only send data from critical segments upstream to a SIEM for analysis. We described this ring architecture on pages 23-26 of Understanding and Selecting SIEM/Log Management.

In addition to your logs, you also want to mine your identity system, the configurations of your key devices, and network flow data. Many security analysis platforms can gather data from all sorts of sources, so that is less of a constraint.

From a sizing standpoint, we believe it’s important to be able to analyze log data over at least a 90-day period. Today’s attackers are patient and persistent, meaning they aren’t just trying to do a smash and grab – they stretch their attack timelines to 30, 60, or even 90 days. So you have two vectors for sizing your system: the number of critical segments you analyze and how long you keep the data. We prefer greater retention across more critical resources, rather than retaining & analyzing everything with a short time horizon.

For full packet capture – depending on the size of your perimeter, key network segments, and your storage capabilities – it may not be realistic to capture more than 4-7 days of traffic. If so, you’ll need to continue traditional log monitoring and analysis on your perimeter and critical segments to catch the low and slow attacks, with full packet capture for the last week to dig deeper once you’ve identified the pattern.

This is the first leg of the data collection environment: what you need to do on an ongoing basis. Another issue we pointed out as a possible incident response gap is not having enough of the right data later in the process. That means you must be able to dig deep when doing forensic analysis, to ensure you understand the attack and response appropriately. Next we’ll dig into what that means and how your tactics must evolve given the new types of persistent attacks in play.