Incident Response in the Cloud Age: Addressing the Skills GapBy Mike Rothman
As we described in our last post, incident response in the Cloud Age requires an evolved response process, in light of data sources you didn’t have before, including external threat intelligence, and the ability to analyze data in ways that weren’t possible just a few years ago. You also need to factor in the fact that access to specific telemetry, especially around the network, is limited because you don’t have control over networks anymore.
But even with these advances, the security industry needs to face the intractable problem that comes up in pretty much every discussion we have with senior security types. It’s people, folks. There simply are not enough skilled investigators (forensicators) to meet demand. And those who exist tend to hop from job to job, maximizing their earning potential. As they should – given free markets and all.
But this creates huge problems if you are running a security team and need to build and maintain a staff of analysts, hunters, and responders. So where can you find folks in a seller’s market? You have a few choices:
- Develop them: You certainly can take high-potential security professionals and teach them the art of incident response. Or given the skills gap, lower-potential security professionals. Sigh. This involves a significant investment in training, and a lot of the skills needed will be acquired in the crucible of an active incident.
- Buy them: If you have neither the time nor the inclination to develop your own team of forensicators, you can get your checkbook out. You’ll need to compete for these folks in an environment where consulting firms can keep them highly utilized, so they are willing to pay up for talent to keep their billable hours clicking along. And large enterprises can break their typical salary bands to get the talent they need as well. This approach is not cheap.
- Rent them: Speaking of consulting firms, you can also find forensicators by entering into an agreement with a firm that provides incident response services. Which seems to be every security company nowadays. It’s that free market thing again. This will obviously be the most expensive, because you are paying for the overhead of partners to do a bait and switch and send a newly minted SANS-certified resource to deal with your incident. OK, maybe that’s a little facetious. But only a bit.
The reality is that you’ll need all of the above to fully staff your team. Developing a team is your best long-term option, but understand that some of those folks will inevitably head to greener pastures right after you train them up. If you need to stand up an initial team you’ll need to buy your way in and then grow. And it’s a good idea to have a retainer in place with an external response firm to supplement your resources during significant incidents.
Changing the Game
It doesn’t make a lot of sense to play a game you know you aren’t going to win. Finding enough specialized resources to sufficiently staff your team probably fits into that category. So you need to change the game. Thinking about incident response differently covers a lot, including:
- Narrow focus: As discussed earlier, you can leverage threat intelligence and security analytics to more effectively prioritize efforts when responding to incidents. Retrospectively searching for indicators of malicious activity and analyzing captured data to track anomalous activity enables you to focus efforts on those devices or networks where you can be pretty sure there are active adversaries.
- On the job training: In all likelihood your folks are not yet ready to perform very sophisticated malware analysis and response, so they will need to learn on the job. Be patient with your I/R n00bs and know they’ll improve, likely pretty quickly. Mostly because they will have plenty of practice – incidents happen daily nowadays.
- Streamline the process: To do things differently you need to optimize your response processes as well. That means not fully doing some things that, given more time and resources, you might. You need to make sure your team doesn’t get bogged down doing things that aren’t absolutely necessary, so it can triage and respond to as many incidents as possible.
- Automate: Finally you can (and will need to) automate the I/R process where possible. With advancing orchestration and integration options as applications move to the cloud, it is becoming more feasible to apply large does of automation to remove a lot of the manual (and resource-intensive) activities from the hands of your valuable team members, letting machines do more of the heavy lifting.
Streamline and Automate
You can’t do everything. You don’t have enough time or people. Looking at the process map in our last post, the top half is about gathering and aggregating information, which is largely not a human-dependent function. You can procure threat intelligence data and integrate that directly into your security monitoring platform, which is already collecting and aggregating internal security data.
In terms of initial triage and sizing up incidents, this can be automated to a degree as well. We mentioned triggered capture, so when an alert triggers you can automatically start collecting data from potentially impacted devices and networks. This information can be packaged up and then compared to known indicators of malicious or misuse activities (both internal and external), and against your internal baselines.
At that point you can route the package of information to a responder, who can start to take action. The next step is to quarantine devices and take forensic images, which can be largely automated as well. As more and more infrastructure moves into the cloud, software-defined networks and infrastructure can automatically take devices in question out of the application flow and quarantine them. Forensic images can be taken automatically with an API call, and added to your investigation artifacts. If you don’t have fully virtualized infrastructure, there are a number of automation and orchestration tools are appearing to provide an integration layer for these kinds of functions.
When it comes time to do damage assessment, this can largely be streamlined due to new technologies as well. As mentioned above, retrospective searching allows you to search your environment for known bad malware samples and behaviors consistent with the incident being investigated. That will provide clues to the timeline and extent of compromise. Compare this to the olden days (like a year ago, ha!) when you had to wait for the device to start doing something suspicious, and hope the right folks were looking at the console when bad behavior began.
In a cloud-native environment (where the application was built specifically to run in the cloud), there really isn’t any mitigation or cleanup required, at least on the application stack. The instances taken out of the application for investigation are replaced with known-good instances that have not been compromised. The application remains up and unaffected by the attack. Attacks on endpoints still require either cleanup or reimaging, although endpoint isolation technologies make it quicker and easier to get devices back up and running.
In terms of watching for the same attack moving forward, you can feed the indicators you found during the investigation back into your security analytics engine and watch for them as things happen, rather than after the attack. Your detection capabilities should improve with each investigation, thanks to this positive feedback loop.
It also makes sense to invest in an incident response management system/platform that will structure activities in a way that standardizes your response process. These response workflows make sure the right stuff happens during every response, because the system requires it. Remember, you are dealing with folks who aren’t as experienced, so having a set of tasks for them to undertake, especially when dealing with an active adversary, can ensure a full and thorough investigation happens. This kind of structure and process automation can magnify the impact of limited staff with limited skills.
It may seem harsh, but successful I/R in the Cloud Age requires you to think differently. You need to take inexperienced responders, and make them more effective and efficient. Using a scale of 1-10, you should look for people ranked 4-6. Then with training, structured I/R process, and generous automation, you may be able to have then function at a level of 7-8, which is a huge difference in effectiveness.