Incident Response Fundamentals: Mop up, Analyze, and QA
You did well. You followed your incident response plan and the fire is out. Too bad that was the easy part, and you now get to start the long journey from ending a crisis all the way back to normal. If we get back to our before, during, and after segmentation, this is the ‘after’ part. In the vast majority of incidents the real work begins after the immediate incident is over, when you’re faced with the task of returning operations to status quo ante, finding out the root cause of the problem, and putting controls in place to ensure it doesn’t happen again. The after part of the process consists of three phases (Mop up, Analyze, and Q/A), two of which overlap and can be performed concurrently. And remember – we are describing a full incident response process and tend to use major situations in our examples, but everything we are talking about scales down for smaller incidents too, which might be managed by a single person in a matter of minutes or hours. The process should scale both up and down, depending on the severity and complexity of an incident, but even dealing with what seems to be the simplest incident requires a structured process. That way you won’t miss anything. Mop up We steal the term “mop up” from the world of firefighting – where cleaning up after yourself may literally involve a mop. Hopefully we won’t need to break out the mops in an IT incident (though stranger things have happened), but the concept is the same – clean up after yourself, and do what’s required to restore normal operations. This usually occurs concurrently with your full investigation and root cause analysis. There are two aspects to mopping up, each performed by different teams: Cleaning up incident response changes: During a response we may take actions that disrupt normal business operations, such as shutting down certain kinds of traffic, filtering email attachments, and locking down storage access. During the mop up we carefully return to our pre-incident state, but only as we determine it’s safe to do so, and some controls implemented during the response may remain in place. For example, during an incident you might have blocked all traffic on a certain port to disable the command and control network of a malware infection. During the mop up you might reopen the port, or open it and filter certain egress destinations. Mop up is complete when you have restored all changes to where you were before the incident, or have accepted specific changes as a permanent part of your standards/configurations. Some changes – such as updating patch levels – will clearly stay, while others – including temporary workarounds – need to be backed out as a permanent solution goes into place. Restoring operations: While the incident responders focus on investigation and cleaning out temporary controls they put in place during the incident, IT operations handles updating software and restoring normal operations. This could mean updating patch levels on all systems, or checking for and cleaning malware, or restoring systems from backup and bringing them back up to date, and so on. The incident response team defines the plan to safely return to operations and cleans up the remnants of its actions, while IT operations teams face the tougher task of getting all the systems and networks where they need to be on a ‘permanent’ basis (not that anything in IT is permanent, but you know what we mean). Investigation and Analysis The initial incident is under control, and operations are being restored to normal as a result of the mop up. Now is when you start in-depth investigation of the incident to determine its root cause and determine what you need to do to prevent a similar incident from happening in the future. Since you’ve handled the immediate problem, you should already have a good idea of what happened, but that’s a far cry from a full investigation. To use a medical analogy, think of it as switching from treating the symptoms to treating the source of the infection. To go back to our malware example, you can often manage the immediate incident even without knowing how the initial infection took place. Or in the case of a major malicious data leak, you switch from containing the leak and taking immediate action against the employee to building the forensic evidence required for legal action, and ensuring the leak becomes an isolated incident, not a systematic loss of data. In the investigation we piece together all the information we collected as part of the incident response with as much additional data we can find, to help produce an accurate timeline of what happened and why. This is a key reason we push heavy monitoring so strongly, as a core process throughout your organization – modern incidents and attacks can easily slip through the gaps of ‘point’ tools and basic logs. Extensive monitoring of all aspects of your environment (both the infrastructure and up the stack), often using a variety of technologies, provides more complete information for investigation and analysis. We have already talked about various data sources throughout this series, so instead of rehashing them, here are a few key areas that tend to provide more useful nuggets of information: Beyond events: Although IDS/IPS, SIEM, and firewall logs are great to help manage an ongoing incident, they may provide an incomplete picture during your deeper investigation. They tend to only record information when they detect a problem, which doesn’t help much if you don’t have the right signature or trigger in place. That’s where a network forensics (full network packet capture) solution comes in – by recording everything going on within the network, these devices allow you to look for the trails you would otherwise miss, and piece together exactly what happened using real data. System forensics: Some of the most valuable tools for analyzing servers and endpoints are system forensics tools. OS and application logs are all too easy to fudge during an attack. These tools are also