When we do a process-centric research project, it works best to wrap up the series with a scenario that really illuminates the concepts we’ve discussed throughout the series and make things a bit more tangible.

In this situation, imagine you work for a mid-sized retailer that uses a mixture of in-house technology, SaaS, and has recently moved a key warehousing system into an IaaS provider upon rebuilding the application for cloud computing. You’ve got a modest sized security team of 10, which is not enough, but a bit more than many of your peers have. Senior management understands why security is important (to a point) and gives you decent leeway, especially relative to the new IaaS application. In fact, you were consulted during the IaaS architecture phase and provided some guidance (with some help from your friends at Securosis) as to building a Resilient Cloud Network Architecture and how to secure the cloud control plane. You also had the opportunity to integrate some orchestration and automation technology into the cloud technology stack.

##The Trigger

You have your team on pretty high alert because a number of your competitors have recently been targeted by an organized crime ring that has gained a foothold with the competitors and proceeded to steal a ton of information about customers, pricing, and merchandising strategies. Given that this isn’t your first rodeo, you know when there is smoke there is usually fire, you decide to task one of your more talented security admins to do a little proactive _hunting_ in your environment. Just to make sure there isn’t anything going on.

The admin starts to poke around by searching internal security data with some of the more recent samples of malware found in the attacks on the other retailers. The malware sample was provided by the retail industry’s ISAC (information sharing and analysis center). The analyst got a hit on one of the samples, confirming what your gut told you. You’ve got an active adversary on the network. So now you need to engage the incident response process.

##Job 1: Initial Triage

Now that you know there is a _situation_, you assemble the response team. There aren’t a lot of you and half of the team has to pay attention to operational tasks, since taking down the systems wouldn’t make you popular with senior management or the investors. You also don’t want to jump the gun until you know what you’re dealing with, so you inform the senior team of the situation, but don’t take any systems offline. Yet.

Since the adversary is active on the internal network, they most likely entered via a phishing or other social engineering attack. The admin’s searches showed 5 devices showing indications of the malware, so those devices are taken off the network immediately. Not shut down, but put on a separate network with Internet access to not tip off the adversary to your discovery of their presence on your network.

Then you check the network forensics tool, looking for indications that data has been leaking. There are a few suspicious file transfers and luckily you integrated the egress filtering capability on the firewall with the network forensics tool. So once the firewall showed that some anomalous traffic was being sent to known bad sites (via a threat intelligence integration on the firewall), you started capturing the network traffic originating from the devices triggering the firewall alert. Automatically. That automation stuff sure makes things easier than having to manually do everything.

As part of your initial triage, you’ve got endpoint telemetry telling you there are issues and network forensics data to get a clue as to what’s leaking. This is enough to know that you not only have an active adversary, but also that you more than likely have lost data. So you fire up the case management system, which will structure the investigation and then store all the artifacts of the investigation.

The team is tasked with their responsibilities and sent on their way to get things done. You make the trek to the executive floor to keep senior management updated on the incident.

##Check the Cloud

The attack seems to have started on the internal network, but you don’t want to take chances and need to make sure the new cloud-based application isn’t at risk. A quick check of the cloud console shows strange activity on one of the instances. A device within the presentation layer of the cloud stack was flagged by the monitoring system of the IaaS provider because there was an unauthorized change on that specific instance. Looks like the time you spent setting up the configuration monitoring service was time well spent.

Since security was involved in the architecture of the cloud stack, you are in good shape. The application was built to be isolated. Even though it seems the presentation layer has been compromised, the adversaries can’t get to anything of value. And the clean-up has _already happened_. Once the IaaS monitoring system threw an alert, the instance in question was taken offline, and put into a special security group only accessible by the investigators. A forensic server was spun up and some other analysis was done. Another example of orchestration and automation really facilitating the incident response process.

The presentation layer has large variances in traffic it needs to handle, so it was built using auto-scaling technology and immutable servers. Once the (potentially) compromised instance was removed from the group, another instance with a clean configuration was spun up and took on the workloads. But it’s not clear if this attack is related to the other incident, so you take the information about the cloud attack and pull it down to feed it into the case management system. But the reality is that this attack, even if related, isn’t presenting danger at this point, so it’s put to the side so you can focus on the internal attack and probably exfiltration.

##Building the Timeline

Now that you’ve done the initial triage, it’s time to dig into the attack and start building a timeline of what happened. You start by looking at the comprised endpoints and the network metadata to see what the adversaries did. By examining the endpoint telemetry, you’ve deduced that _Patient Zero_ was a contractor on the human resources (HR) team. This individual was tasked with looking at the inbound resumes submitted to the main HR email and do the initial screening of qualification for an open position. The resume was a compromised Word file using a pretty old Windows 7 attack. It turns out the contractor was using their own machine, which hadn’t been patched and was vulnerable to the attack. You can’t be irritated with the contractor, it was their job to open those files. The malware rooted the device, connected it up to a botnet and then installed a remote access trojan (RAT) to allow the adversary to take control of the device and start a systematic attack of the rest of the infrastructure.

As an aside, you remember that since the organization supports a BYOD policy contractors can use their own machines. The operational process fail was not inspecting the machine upon connecting to the network and making sure it was patched and had an authorized configuration. That’s something that needs to be scrutinized as part of the post-mortem.

Once the adversary had presence on the network, they proceeded to compromise another 4 devices ultimately ending up on both the CFO and the VP of Merchandising’s devices. Looking at the network forensic metadata showed how the adversary moved laterally within the network and took advantage of the weak segmentation within the internal networks. There are only so many hours in the day and the focus had been on making sure the perimeter was strong and ingress traffic was scrutinized.

With both the CFO and VP of Merchandising’s devices compromised, it was pretty straightforward to see the exfiltration through the network metadata. A quick comparison of the file sizes of the data triggered by the egress filter shows that the latest quarterly board report was most likely exfiltrated, as well as a package of merchandising comps and plans for an exclusive launch with a very hot new fashion company. It was a bit surprising that the adversary didn’t bother to encrypt the stolen data, but evidently they made the bet that a mid-sized retailer wouldn’t have sophisticated DLP or egress content filtering in place. Or maybe they just didn’t care whether anyone knew what was exfiltrated.

You organization clearly has a more mature security program, since the egress filter trigger started a full packet capture of the outbound traffic from all of the compromised devices. So you know exactly what was taken, when, and where it went. That will be useful when talking to law enforcement and possibly prosecuting someone at some point, but right now that’s little consolation.

##Cleaning Up the Mess

Now that you have the incident timeline, it’s time to clean things up and return your environment to a known good state. The first step is to clean up the affected machines. The executives are cranky because the decision was made to reimage their machines, but knowing the adversary used techniques in the other attacks to maintain persistence on the compromised devices it’s more prudent just to wipe the machines.

The information related to the incident will need to be aggregated and then packaged up for law enforcement and the general counsel related to the inevitable public disclosure. You take another note that the team should consider using a case management system to both track the activity related to the incident, as well as provide a place to store the artifacts from the investigation and ensure proper chain of custody. Given the smaller staff of the team, that could provide some leverage to make the next incident response go more smoothly.

Finally, the incident was discovered by a savvy admin doing some hunting on your networks. Thus as a way to close the active part of the investigation, you’ll task the same resource with going back through the environment and hunting to make sure both this attack has been fully eradicated and to ensure other attacks aren’t in process. Given the small size of the team, it’s not easy to devote the resources to hunting, but given the results, this is an activity that will need to be added on a monthly cadence.

##Closing the Loop

To finalize this incident, you hold a post-mortem with the extended team including some representatives from the general counsel’s office. The threat intelligence being used needs to be revisited and scrutinized, given the adversary had connected to a botnet and it wasn’t detected. And the rules on the egress filters have been tightened based on the reality that if the exfiltrated data had been encrypted, the response would have been dramatically complicated. The post-mortem also provided a great opportunity to reinforce the importance of having security involved in the application architecture process, given how well the new IaaS application stood up under attack.

Yet, this was another reminder that sometimes a skilled admin who can follow their instincts is the best defense. The tools in place definitely helped to accelerate the response to identify root cause faster and remediate more effectively. But doing Incident Response in the Cloud Age involves both people and technology, as well as internal and external data, to ensure an effective and efficient investigation and eventual remediation.

Share: