TI+IR/M: The New Incident (Response) & Management Process ModelBy Mike Rothman
Now that we have the inputs (both internal and external) to our incident response/management process we are ready to go operational. So let’s map out the IR/M process in detail to show where threat intelligence and other security data allows you to respond faster and more effectively.
Trigger and Escalate
You start the incident management process with a trigger that kicks off the incident response process, and the basic information you gather varies based on what triggered the alert. You may get alerts from all over the place, including any of your monitoring systems and the help desk. Nobody has a shortage of alerts – the problem is finding the critical alerts and taking immediate action. Not all alerts require a full incident response – much of what you already deal with on a day-to-day basis is handled by existing security processes. Incident response/management is about those situations that fall outside the range of your normal background noise.
Where do you draw the line? That depends entirely on your organization. In a small business, a single system infected with malware might require a response because all devices have access to critical information. But a larger company might handle the same infection within standard operational processes. Regardless of where the line is drawn, communication is critical. All parties must be clear on which situations require a full incident investigation and which do not before you can decided whether to pull the trigger or not.
For any incident you need a few key pieces of information early to guide next steps. These include:
- What triggered the alert?
- If someone was involved or reported it, who are they?
- What is the reported nature of the incident?
- What is the reported scope of the incident? This is basically the number and nature of systems/data/people involved.
- Are any critical assets involved?
- When did the incident occur, and is it ongoing?
- Are there any known precipitating events for the incident? Is there a clear cause?
Gather what you can from this list to provide an initial picture of what’s going on. When the initial responder judges an incident to be more serious it’s time to escalate. You should have guidelines for escalation, such as:
- Involvement of designated critical data or systems.
- Malware infecting a certain number of systems.
- Sensitive data detected leaving the organization.
- Unusual traffic/behavior that could indicate an external compromise.
Once you escalate it is time to assign an appropriate resource, request additional resources if needed, and begin the response with triage.
Before you can do anything, you will need to define accountabilities among the team. That means specifying the incident handler, or the responsible party until a new responsible party is defined. You also need to line up resources to help based on answers to the questions above, to make sure you have the right expertise and context to work through the incident. We have more detail on staffing the response in our Incident Response Fundamentals series.
The next thing to do is to narrow down the scope of data you need to analyze. As discussed in the last post, you spend considerable time collecting events and logs, as well as network and endpoint forensics. This is a tremendous amount of data so narrowing down the scope of what you investigate is critical. You might filter on the segments attacked, or logs of the application in question. Perhaps you will take forensics from endpoints at a certain office if you believe the incident was contained. This is all to make the data mining process manageable.
With all this shiny big data technology, do you need to actually move the data? Of course not, but you will need flexible filters so you can see only items relevant to this incident in your forensic search results. Time is of the essence in any response, so you cannot afford to get bogged down with meaningless and irrelevant results as you work through collected data.
Once you have filters in place you will want to start analyzing the data to answer several questions:
- Who is attacking you?
- What tactics are they using?
- What is extent of the potential damage?
You may have an initial idea based on the alert that triggered the response, but now you need to prove that hypothesis. This is where threat intelligence plays a huge role in accelerating your response. Based on the indicators you found, a TI service can help identify a potentially responsible party. Or at least a handful of them. Every adversary has their preferred tactics, and whether through adversary analysis (discussed in Really Responding Faster) or actual indicators, you want to leverage external information to understand the attacker and their tactics. It is a bit like having a crystal ball, allowing you to focus your efforts and what the attacker likely did, and where.
Then you need to size up or scope out the damage. This comes down to the responder’s initial impressions as they roll up to the scene. The goal here is to take the initial information provided and expand on it as quickly as possible to determine the true extent of the incident.
To determine scope you will want to start digging into the data to establish the systems, networks, and data involved. You won’t be able to pinpoint every single affected device at this point – the goal is to get a handle on how big a problem you might be facing, and generate some ideas on how to best mitigate it.
Finally, based on the incident handler’s initial assessment, you need to decide whether this requires a formal investigation due to potential law enforcement impact. If so you will need to start thinking about chain of custody for the evidence so you can prove the data was not tampered with, and tracking the incident in a case management system. Some organizations treat every incident this way, and that’s fine. But not all organizations have the resources or capabilities for that, in which case you will need a pre-defined set of criteria to determine whether to pursue a full-blown formal incident response.
Quarantine and Image
As you get deeper into response you will have some more decisions to make. The first is how to most effectively contain the damage. Will you take the devices offline? Or will you leave them on and monitor the crap out of them, to see what you can learn about your adversary? Another option if you run an advanced security program is to distribute disinformation to your adversary, sending them down a false path or getting them to further identify themselves.
There are many options for quarantining a device to contain an attack. You could move it onto a separate network with access to nothing, or disconnect it from the network altogether. You could have the device log out or turn it off. What you cannot do is act rashly – you need to make sure things do not get worse. Many malware kits (and attackers) will wipe a device if it is powered down or disconnected from the network, as a self-destruct mechanism. This is a time to think before you act.
Which brings up the next task: to take forensic images of affected devices, especially if you are thinking about chain of custody within the context of a formal investigation. In this case how you capture and store images is as important as what you find. You need to make sure your responders understand how the law works, and what can provide a basis for reasonable doubt in court. Yes, it sucks that responders need to worry about this stuff, but better they learn ahead of time than when a perpetrator walks off with your data scot-free due to a technicality.
Once images are taken and devices are quarantined (or at least safe so they cannot cause more damage), it is time to dig deep into the attack to really understand what happened. Our research has shown a timeline approach provides a clear path to figuring out what happened. You start the timeline with the initial attack vector (identify root cause) and follow the adversary as they systematically work towards achieving their mission. They move laterally within your environment and compromise additional devices on the way to their target. To ensure a complete cleanup you will want the ability to pinpoint exactly which devices were affected, and preferably to review the exfiltrated data via full packet captures of perimeter networks.
Again, investigation is more art than science. Sometimes you cannot work logically with a clear timeline because a lot of stuff happened at the same time. So focus on what you know. At some point a device was compromised. At another subsequent point data was exfiltrated. Now systematically fill in gaps in between to understand what the attacker did and how. Stay focused on the completeness of the investigation as well – a missed compromised device is sure to mean reinfection somewhere down the line.
Next you perform a damage assessment. What was lost? How much data? Exfiltrated data tends to be encrypted by attackers, so you may not be able to break the crypto during your investigation, but the files will yield clues. You know how big the exfiltration was. Your investigation identified the impacted devices. So get out your detective hat and start putting pieces together. For example if 2gb of data was exfiltrated from the finance network, it probably wasn’t the 4tb schematics to your new product on the research segment. Common sense goes a long way during investigation.
Now that you have a better handle on the attack and are equipped with a timeline of attacker activity, you can start mitigating. There are many ways to ensure the attack doesn’t happen again. Some are temporary, including shutting down access to certain devices via certain protocols. Or locking down traffic in and out of critical servers. Or disabling attachments on emails. Or even blocking outbound communication to certain geographies, based on adversary intelligence. There are also more ‘permanent’ mitigations, such as putting in place a service or product to block denial of service attacks. Or possibly wiping affected devices and starting over.
Regardless, you will need to establish a list of mitigation activities to address the incident, and marshal resources to get it done. These resources could be internal or external, depending on the extent of the compromise and the availability of internal resources. We favor big bang device remediation, rather than a rolling thunder approach of incremental cleaning/reimaging. You want to eradicate the adversary from your environment – tipping them off that you know about the attack and are cleaning it slowly just provides opportunity for them to dig deeper.
Of course if it’s a widespread attack that will cause unplanned downtime you won’t be popular with either affected employees or the operations team making changes. As you manage incidents keep in mind that your objective is to contain the damage to the organization and ensure it doesn’t happen again – not to get Christmas cards from everyone.
When is your mitigation done? Once you have halted the spread of damage and regained the ability to continue operations. Your environment may not be pretty as you finish the mitigation, having resorted to a bunch of temporary workarounds to protect your information and make sure devices are no longer affected. But you don’t get points for elegance – always favor speed over style.
Once operations are back online and the damage is contained it is time to take a step back and clean up any actions disrupting normal business operations, making sure you are comfortable that particular attack will not happen again. This might involve leaving new controls or policies, implemented during the response, in place. For example during an incident you might block all traffic on a certain port to disable the command and control network of a malware infection, or makee certain servers read-only to avoid an adversary downloading malware to their file systems.
The clean-up is complete when you have restored all changes to where you were before the incident, or accepted specific changes as a permanent part of your standards/configurations. Some changes – such as updating patch levels or configurations on devices – will clearly stay, while temporary workarounds need to be backed out as you return to your revised normal.
While the incident managers focus on completing the investigation and cleaning out temporary controls, IT operations handles updating software and restoring normal operations. This could mean updating patches on all systems, checking for and cleaning malware, restoring systems from backup and bringing them back up to date, etc.
At this point you have completed your investigation and any remaining activities are out of your hands and the responsibility of IT operations. You know what happened, why, and what needs to be done to minimize the chance of a similar incident in the future.
Your last step is to analyze the response process itself: did you detect the incident quickly enough? Respond fast enough? Respond effectively? What do you need to learn to improve the process? The result might be that you’re happy with how your team managed the incident. But there will be opportunities for improvement, which may involve changes limited to the team itself, changes to technology use or configurations, or broader organizational changes (education, network configuration, and so on). During this postmortem response analysis, there cannot be any sacred cows. No one is perfect and it’s okay to make mistakes – once. You don’t want to make the same mistake again.
You cannot completely prevent attacks, so the key is to optimize your response process to detect and manage problems as quickly and efficiently as possible, which brings us full circle back to threat intelligence. You also need to learn about your adversary during this process. You were attacked once and will likely be attacked again. How will you stay on top of adversaries’ tactics, make sure you are keeping up, and stay ready for the new attacks that will be coming your way?
Threat intelligence drives that feedback loop to make sure you are adapting your controls as often as needed to be ready for adversaries, rather than learning what needs to change during another incident response.
We will wrap up this series by running through a scenario, showing how threat intelligence and this updated incident response/management process can both accelerate and improve the effectiveness of your response.