You did well. You followed your incident response plan and the fire is out. Too bad that was the easy part, and you now get to start the long journey from ending a crisis all the way back to normal. If we get back to our before, during, and after segmentation, this is the ‘after’ part.
In the vast majority of incidents the real work begins after the immediate incident is over, when you’re faced with the task of returning operations to status quo ante, finding out the root cause of the problem, and putting controls in place to ensure it doesn’t happen again.
The after part of the process consists of three phases (Mop up, Analyze, and Q/A), two of which overlap and can be performed concurrently. And remember – we are describing a full incident response process and tend to use major situations in our examples, but everything we are talking about scales down for smaller incidents too, which might be managed by a single person in a matter of minutes or hours. The process should scale both up and down, depending on the severity and complexity of an incident, but even dealing with what seems to be the simplest incident requires a structured process. That way you won’t miss anything.
We steal the term “mop up” from the world of firefighting – where cleaning up after yourself may literally involve a mop. Hopefully we won’t need to break out the mops in an IT incident (though stranger things have happened), but the concept is the same – clean up after yourself, and do what’s required to restore normal operations. This usually occurs concurrently with your full investigation and root cause analysis.
There are two aspects to mopping up, each performed by different teams:
- Cleaning up incident response changes: During a response we may take actions that disrupt normal business operations, such as shutting down certain kinds of traffic, filtering email attachments, and locking down storage access. During the mop up we carefully return to our pre-incident state, but only as we determine it’s safe to do so, and some controls implemented during the response may remain in place. For example, during an incident you might have blocked all traffic on a certain port to disable the command and control network of a malware infection. During the mop up you might reopen the port, or open it and filter certain egress destinations. Mop up is complete when you have restored all changes to where you were before the incident, or have accepted specific changes as a permanent part of your standards/configurations. Some changes – such as updating patch levels – will clearly stay, while others – including temporary workarounds – need to be backed out as a permanent solution goes into place.
- Restoring operations: While the incident responders focus on investigation and cleaning out temporary controls they put in place during the incident, IT operations handles updating software and restoring normal operations. This could mean updating patch levels on all systems, or checking for and cleaning malware, or restoring systems from backup and bringing them back up to date, and so on.
The incident response team defines the plan to safely return to operations and cleans up the remnants of its actions, while IT operations teams face the tougher task of getting all the systems and networks where they need to be on a ‘permanent’ basis (not that anything in IT is permanent, but you know what we mean).
Investigation and Analysis
The initial incident is under control, and operations are being restored to normal as a result of the mop up. Now is when you start in-depth investigation of the incident to determine its root cause and determine what you need to do to prevent a similar incident from happening in the future.
Since you’ve handled the immediate problem, you should already have a good idea of what happened, but that’s a far cry from a full investigation. To use a medical analogy, think of it as switching from treating the symptoms to treating the source of the infection. To go back to our malware example, you can often manage the immediate incident even without knowing how the initial infection took place. Or in the case of a major malicious data leak, you switch from containing the leak and taking immediate action against the employee to building the forensic evidence required for legal action, and ensuring the leak becomes an isolated incident, not a systematic loss of data.
In the investigation we piece together all the information we collected as part of the incident response with as much additional data we can find, to help produce an accurate timeline of what happened and why. This is a key reason we push heavy monitoring so strongly, as a core process throughout your organization – modern incidents and attacks can easily slip through the gaps of ‘point’ tools and basic logs. Extensive monitoring of all aspects of your environment (both the infrastructure and up the stack), often using a variety of technologies, provides more complete information for investigation and analysis.
We have already talked about various data sources throughout this series, so instead of rehashing them, here are a few key areas that tend to provide more useful nuggets of information:
- Beyond events: Although IDS/IPS, SIEM, and firewall logs are great to help manage an ongoing incident, they may provide an incomplete picture during your deeper investigation. They tend to only record information when they detect a problem, which doesn’t help much if you don’t have the right signature or trigger in place. That’s where a network forensics (full network packet capture) solution comes in – by recording everything going on within the network, these devices allow you to look for the trails you would otherwise miss, and piece together exactly what happened using real data.
- System forensics: Some of the most valuable tools for analyzing servers and endpoints are system forensics tools. OS and application logs are all too easy to fudge during an attack. These tools are also critical for incidents where you may want to take legal action, because you’ll need strong evidence to support your case.
- Avoid log tampering: Ideally you will manage server and application logs using a central solution, as opposed to leaving them on the systems that generate them. Sending the logs to a secure remote location (a log aggregation platform) makes it much harder for an attacker to cover their tracks. But the value of aggregating logs isn’t merely to limit tampering – you can also be alerted if a source suddenly drops offline (which otherwise you wouldn’t notice), which typically indicates an error or a bad guy covering his tracks.
Here are some good starting points based on incident type:
- Employees: For anything involving an employee, start with forensics of their system and then crawl through the various accounts they have access to (applications and storage). Then check for unusual activities on their coworkers’ systems and accounts, because they might have used one of those to cover their tracks. Also look at your network monitoring tools, such as DLP, URL filtering, email archiving, and even network forensics for unusual communications.You are trying to establish a pattern of behavior, and that involves understanding exactly how the employee is interacting with systems.
- Malware: For targeted malware, network forensics and system forensics are your friends. System forensics and malware analysis help figure out what you are dealing with, and network forensics pinpoint how the attack code got in and what it was trying to do behind your perimeter.
- Web applications: For web applications (and associated servers and storage), if you have a Web Application Firewall and/or Database Activity Monitoring, these tools may provide more information more quickly than crawling through (potentially) dozens of system, application, and database logs. This is another area where network forensics can help, assuming the tool recorded all the relevant traffic. But we can’t assume this is the case for public web applications – especially those with high traffic levels.
- Old school attack: SIEM is usually the first place to start for a ‘traditional’ network attack because unsophisticated attacks tend to leave a prominent trail; next check network forensics, IDS/IPS/firewall logs, and local system log files for affected systems.
Remember that just because an incident occurred on an endpoint or server, you don’t have to rely purely on information sources directly associated with that device. At minimum you’ll want to validate the information you gather from the system(s) in question, so network traffic might be your best evidence in case of an endpoint compromise. Keep in mind that investigation is all about correlation: pulling together bits of information from different sources, and then (often painstakingly) analyzing them to piece together the big picture.
You need a few years worth of information to really understand the depths of a forensic investigation. In this series, we’re merely providing some of the more useful starting points.
At the conclusion of this phase you should have the best picture possible of exactly what happened, why, and how it can (hopefully) be prevented in the future. Not that you can prevent all incidents – but in those cases your goal is to reduce the time it takes to detect and respond to the incident.
Guidance for When the Lawyers Get Involved
Forensics, investigation, and incident analysis each are a science unto itself, with many books and training courses available. But a few core principles are important to follow, even if you are a single responder in a small organization:
- Keep a case log: It’s absolutely critical to keep a log of everything you do during a response and investigation. Even handwritten notes/logs are okay. Make sure you record dates and times, and remember that this log itself is a piece of evidence and must be kept secure. In a larger organization you may have a case management/incident response application, while those of you in smaller organizations will capture it all manually.
- Maintain data integrity: From system forensics to analysis of network logs, never work with a “live” system or data without first making a forensically valid duplicate. The act of viewing files itself changes the associated metadata, which may be important if you later take legal action (or even if you don’t). Clearly you need to exercise judgment here – you won’t be creating forensic backups of everything you touch in every incident, but you had better know what you are doing. Once you start messing with a system, without a copy you can’t go back and revert your own actions, and you risk losing valuable data.
- Maintain chain of custody: Again, this isn’t necessarily something you’ll do for every incident, but if you have the slightest inkling there might be legal repercussions, you need to maintain a full chain of custody for all the evidence. That means evidence is stored in secure locations, with full logs of everyone who has touched it and what they have done to it.
If something might go to court it’s critical to keep a log file, maintain data (evidence) integrity, and chain of custody.
At this point you have completed your investigation and either operations are fully restored or the remaining activities are out of your hands and the responsibility of IT operations. You know what happened, why, and what needs to be done to minimize the chance of a similar incident in the future.
The last step is to analyze the response process itself: did you detect the incident quickly enough? Respond fast enough? Respond effectively? What do you need to learn to improve the process?
The result might be that you’re happy with the response, though that’s probably a long shot. Most good teams almost always figure out ways to improve their response, which may involve changes limited to the team itself, changes to technology use or configurations, or broader organizational changes (education, network structure, and so on). During this post-mortem response analysis, there cannot be any sacred cows. No one is perfect and it’s okay to make mistakes – once. You don’t want to make the same mistake again.
You can’t completely prevent attacks, so the key is to optimize your response process to detect and manage problems as quickly and efficiently as possible.
As we wrap up this series, we’ll talk about how to phase this kind of incident response process into your environment, since it’s unlikely you just adopt the entire thing in one fell swoop.