Incident Response Fundamentals: Contain, Investigate, and MitigateBy Rich
In our last post, we covered the first steps of incident response – the trigger, escalation, and size up. Today we’re going to move on to the next three steps – containment, investigation, and mitigation.
Now that I’m thinking bigger picture, incident response really breaks down into three large phases. The first phase covers your initial response – the trigger, escalation, size up, and containment. It’s the part when the incident starts, you get your resources assembled and responding, and you take a stab at minimizing the damage from the incident.
The next phase is the active management of the incident. We investigate the situation and actively mitigate the problem.
The final phase is the clean up; where we make sure we really stopped the incident, recover from the after effects, and try and figure out why this happened and how we can prevent it in the future. This includes the mop up (cleaning the remnants of the incident and making sure there aren’t any surprises left behind), your full investigation and root cause analysis, and Q/A (quality assurance) of your response process.
Now since we’re writing this as we go, I technically should have included containment in the previous post, but didn’t think of it at the time. I’ll make sure it’s all fixed before we hit the whitepaper.
Containing an ongoing incident is one of the most challenging tasks in incident response. You lack a complete picture of what’s going on, yet you have to take proactive actions to minimize damage and potential incident growth. And you have to do it fast. Adding to the difficulty is the fact that in some cases your instincts to stop the problem may actually exacerbate the situation.
This is where training, knowledge, and experience are absolutely essential. Specific plans for certain major incident categories are also important.
- For “standard” virus infections and attacks your policy might be to isolate those systems on the network so the infection doesn’t spread. This might include full isolation of a laptop, or blocking any email attachments on a mail server.
- For those of you dealing with a well-funded persistent attacker (yeah, APT), the last thing you want to do is start taking known infected systems offline. This usually leads the attacker to trigger a series of deeper exploits, and you might end up with 5 compromised systems for every one you clean. In this case your containment may be to stop putting new sensitive data in any location accessed by those compromised systems (this is just an example, responding to these kinds of attackers is most definitely a complex art in and of itself).
- For employee data theft, you first get HR, legal, and physical security involved. They may direct you to you to instantly lock them out or perhaps just monitor their device and/or limit access to sensitive information while they build a case.
- For compromise of a financial system (like credit card processing), you may decide to suspend processing and/or migrate to an alternative platform until you can determine the cause later in your response.
These are just a few quick examples, but the goal is clear – make sure things do not get worse. But you have to temper this defensive instinct with any needs for later investigation/enforcement, the possibility that your actions might make the situation worse, and the potential business impact. And although it’s not possible to build scenarios for every possible incident, you want to map out your intended responses for the top dozen or so, to make sure that everyone knows what they should be doing to contain the damage.
At this point you have a general idea of what’s going on and have hopefully limited the damage. Now it’s time to really dig in and figure out exactly what you are facing.
Remember – at this point you are in the middle of an active incident; your focus is to gather just as much information as you need to mitigate the problem (stop the bad guys, since this series is security-focused) and to collect it in a way that doesn’t preclude subsequent legal (or other) action. Now isn’t the time to jump down the rabbit hole and determine every detail of what occurred, since that may draw valuable resources from the actual mitigation of the problem.
The nature of the incident will define what tools and data you need for your investigation, and there’s no way we can cover them all in this series. But here are some of the major options, some of which we’ll discuss in more detail as we discuss deeper investigation and root cause analysis later in the process.
- Network security monitoring tools: This includes a range of network security tools such as network forensics, DLP, IDS/IPS, application control, and next-generation firewalls. The key is that the more useful tools not only collect a ton of information, but also include analysis and/or correlation engines that help you quickly sift through massive volumes of information quickly.
- Log Management and SIEM: These tools collect a lot of data from heterogenous sources you can use to support investigations. Log Management and SIEM are converging, which is why we include both of them here. You can check out our report on this technology to learn more.
- System Forensics: A good forensics tool(s) is one of your best friends in an investigation. While you might not use it to its complete capabilities until later in the process, the forensics tool allows you to collect forensically-valid images of systems to support later investigations while providing valuable immediate information.
- Endpoint OS and EPP logs: Operating systems collect a fair bit of log information that may be useful to pinpoint issues, as does your endpoint protection platform (most of the EPP data is likely synced to its server). Access logs, if available, may be particularly useful in any incident involving potential data loss.
- Application and Database Logs: Including data from security tools like Database Activity Monitoring and Web Application Firewalls.
- Identity, Directory and DHCP logs: To determine who logged in when, and what IP addresses were assigned. In many investigations, understanding who is involved is as important as figuring out what they are doing.
- Analysis Tools: Sorting through all this data isn’t necessarily the easiest thing in the world. Most incident responders and investigators use a collection of tools/scripts to help sort through the data and focus on what’s important.
Essentially we are using all the information collected by the tools described in our data collection and monitoring post.
For those of you with the resources and expertise, additional tools such as decompilers and code analysis may be extremely useful.
Finally, many of you in smaller organizations will need to rely more on the output from off the shelf security tools. These are only helpful when they collect and present the right information. For example, an IDS with the wrong rules set is pretty useless. This is, again, where experience and knowledge of the risks and threats your organization faces are key. As well as highlighting the importance of learning from every incident to ensure both your response and your tools improve each time.
This is the shortest section of our process to describe, but the most important.
Following our incident command principles, “management by objective” comes into play here. To mitigate an incident we set a series of discreet achievable goals, and assign resources to handle them.
For example let’s take an old-fashioned malware infection. While the overall goal is to stop the infection and clean your systems, that’s not very specific or achievable without more detail. Your incident action plan might look more like:
- Update our email security gateway signatures to block the current malware at the mail server, both internally and from the outside.
- Temporarily disable attachments of file types known to be vectors for the infection.
- Use our EPP tool with an updated signature to determine which systems are infected. If the vendor can’t provide a signature, send resources to the lab to determine whether there is an alternative way to detect the infection – such as outbound network traffic or a custom scanning script in conjunction with the vulnerability scanner.
- Suspend write access on file servers hosting potentially infected files. Possibly disable file sharing on common repositories, if needed.
- Block any outbound network channels known to be used by the malware – this could be specific command and control destinations, or entire ports & protocols.
- Create a list of all known infected systems.
- Assign resources to clean and/or re-image any known infected systems. Do so in an orderly way with a timeline, ensuring that data to help the investigation is captured properly before the machines are blown away.
- Assign resources to determine whether further (deeper) scanning is required to identify infected systems.
- Scan file repositories for additional malware in centralized storage.
- If there is a vector for the malware to potentially infect servers, create a scanning and investigation plan to identify infected servers. …
Something like this should get you through the initial mitigation, and then you need an action plan for the rest of your investigation and to restore the organization to normal operations.
An incident is considered mitigated once you have halted the spread of damage, and regained the ability to continue operations. That doesn’t mean everything is back to normal, or that the incident is closed, but rather that the worst is over and you can start trying to figure out what really happened, and begin returning to completely normal operations. We will begin that discussion next.