React Faster and Better: Respond, Investigate, and RecoverBy Rich
After you have validated and filtered the initial alert, then escalated to contain and respond to the incident, you may need to escalate for further specialized response, investigation, and (hopefully) recovery.
This progression to the next layer of escalation varies more among organizations we have talked with than the others – due to large differences in available resources, skill sets, and organizational priorities, but as with the rest of this series the essential roles are fairly consistent.
Tier 3: Respond, Investigate, and Recover
Tier 3 is where incident response management, specialized resources, and the heavy hitters reside. In some cases escalation may be little more than a notification that something is going on. In others it might be a request for a specialist such as a malware analyst for endpoint forensics analysis. This is also the level where most in-depth investigation is likely to occur – including root cause analysis and management of recovery operations. Finally, this level might include all-hands-on-deck response for a massive incident with material loss potential.
Despite the variation in when Tier 3 begins, the following structure aligns at a high level with the common processes we see:
- Escalate response: Some incidents, while not requiring the involvement of higher management, may need specialized resources that aren’t normally involved in a Tier 2 response. For example, if an employee is suspected of leaking data you may need a forensic examiner to look at their laptop. Other incidents require the direct involvement of incident response management and top-tier response professionals. We have listed this as a single step, but it is really a self-contained response cycle of constantly evaluating needs and pulling in the right people – all the way up to executive management if necessary.
- Investigate: You always investigate to some degree during an incident, but depending on its nature there may be far more investigation after initial containment and remediation. As with most steps in Tier 3, the lines aren’t necessarily black and white. For certain kinds of incidents – particularly advanced attacks – the investigation and response (and even containment) are carried out in lockstep. For example, if you detect customized malware, you will need to perform a concurrent malware analysis, system forensic analysis, and network forensic analysis.
- Determine root cause: Before you can close an incident you need to know why it happened and how to prevent it from happening again. Was it a business process failure? Human error? Technical flaw? You don’t always need this level of detail to remediate and get operations back up and running on a temporary basis, but you do need it to fully recover – and more importantly to ensure it doesn’t happen again. At least not using the same attack vector.
- Recover: Remediation gets you back up and running in the short term, but in recovery you finish closing the holes and restore normal operations. The bulk of recovery operations are typically handled by non-security IT operations teams, but at least partially under the direction of the security team. Permanent fixes are applied, permanent holes closed, and any restored data examined to ensure you aren’t re-introducing the very problems that allowed the incident in the first place.
- (Optional) Prosecute or Discipline: Depending on the nature of the incident you may need to involve law enforcement and carry a case through to prosecution, or at least discipline/fire an employee. Since nothing involving lawyers except billing ever moves quickly, this can extend many years beyond the official end of an incident.
Tier 3 is where the buck stops. There are no other internal resources to help if an incident exceeds capabilities. In that case, outside contractors/specialists need to be brought in, who are then (effectively) added to your Tier 3 resources.
We described Tier 1 as dispatchers, and Tier 2 as firefighters. Sticking with that analogy, Tier 3 is composed of chiefs, arson investigators, and rescue specialists. These are the folks with the strongest skills and most training in your response organization.
- Primary responsibilities: Ultimate incident management. Tier 3 handles incidents that require senior incident management and/or specialized skills. These senior individuals manage incidents, use their extensive skills for complex analysis and investigation, and coordinate multiple business units and teams. They also coordinate, train, and manage lower level resources.
- Incidents they manage: Anything that Tier 2 can’t handle. These are typically large or complex incidents, or more-constrained incidents that might involve material losses or extensive investigation. A good rule of thumb is that if you need to inform senior or executive management, or involve law enforcement and/or human resources, it’s likely a Tier 3 incident. This tier also includes specialists such as forensics investigators, malware analysts, and those who focus on a specific domain as opposed to general incident response.
- When they escalate: If the incident exceeds the combined response capabilities of the organization. In other words, if you need outside help, or if something is so bad (e.g., a major public breach) that executive management becomes directly involved.
These responders and managers have a combination of broad and deep skills. They manage large incidents with multiple factors and perform the deep investigations to support full recovery and root cause analysis. They tend to use a wide variety of specialized tools, including those they write themselves. It’s impossible to list all the options out, but here are the main categories:
- Network (full packet capture) forensics: You’ve probably noticed this category appearing at all the levels. While the focus in the other response tiers is more on alerting and visualization, at this level you are more likely to dig deep into the packets to fully understand what’s going on for both immediate response and later investigation. If you don’t capture it you can’t analyze it, and full packet capture is essential for the advanced incident response which provides the focus here. Once data is gone you can’t get it back – thus our incessant focus on capturing as much as you can, when you can.
- Endpoint forensics: From investigating an employee stealing data for their next job, to analyzing a compromised server, endpoint forensics are essential for any investigation. In previous tiers we focused more on the network side because we were more concerned with initial response and containment, but now is the time to dig in and perform the detailed investigation. This is where we need to really understand what happened, and that typically requires a deep look at the endpoint.
- Log management and analysis: Keeping with the emphasis on deeper investigation, logs become one of the most important resources. Collection and analysis tools speed up the process, even when you also rely on your own home-grown scripts and tools. Indexing and search become critical aspects of these tools, as the sheer amount of data can be overwhelming and most is not relevant to any particular investigation. Also pay attention to the capability of the tool to visualize the data and drill down. Basically anything that is going to help you pinpoint what happened and why might be useful here.
- Database and application lab: These aren’t tools like the rest, but places to replicate databases and/or applications for offline analysis. In the real world you can’t always do this in an isolated lab, so you might have to work on a live system (including analysis of volatile memory). Both options might be as simple as reviewing source code and scripts, or as complex as performing live memory analysis on a production application.
- Forensics and malware analysis lab: Although this might be the same as your database and application lab, you’ll use a different set of tools for this analysis. Here you are more concerned with picking things apart than replicating a production environment. One-way write blockers and debuggers are tools of choice.
- Other security tools: Some other useful security tools we see at this level are Data Loss Prevention, Web Application Firewalls, Database Activity Monitoring, and File Activity Monitoring. They provide information that slips through gaps in the rest of a monitoring infrastructure, because these tools are purpose built to dig into very specific aspects of IT operations. Technically you could recover much of what they provide through other sources like network forensics, but these offer more context for quicker analysis (and, in the case of DAM, might be your only source). These are not deployed nearly as widely as most other security tools, especially the relatively new File Activity Monitors, but we do see them appearing more often.
The key thing to remember about Tier 3 is that it’s more about the people than the tools or even process. This is where the most senior and experienced incident response professionals reside, however you equip them and assign responsibilities. However, having proper tools will make those people and processes much more effective. But keep reality clearly in focus. These folks represent the last line of defense, so if you exceed their capabilities, your options are either to live with the losses or bring in outside help.
We’ll wrap up this series with a reasonably detailed scenario showing the process in action.