Incident Response in the Cloud Age: More Data, No Data, or Both?By Mike Rothman
As we discussed in the first post of this series, incident response needs to change, given disruptions such as cloud computing and the availability of new data sources, including external threat intelligence. We wrote a paper called Leveraging Threat Intelligence in Incident Response (TI+IR) back in 2014 to update our existing I/R process map. Here is what we came up with:
So what has changed in the two years since we published that paper? Back then the cloud was nascent and we didn’t know if DevOps was going to work. Today both the cloud and DevOps are widely acknowledged as the future of computing and how applications will be developed and deployed. Of course we will take a while to get there, but they are clearly real already, and upending pretty much all the existing ways security currently works, including incident response.
The good news is that our process map still shows how I/R can leverage additional data sources and the other functions involved in performing a complete and thorough investigation. Although it is hard to get sufficient staff to fill out all the functions described on the map. But we’ll deal with that in our next post. For now let’s focus on integrating additional data sources including external threat intelligence, and handling emerging cloud architectures.
More Data (Threat Intel)
We explained why threat intelligence matters to incident response in our TI+IR paper:
To really respond faster you need to streamline investigations and make the most of your resources, a message we’ve been delivering for years. This starts with an understanding of what information would interest attackers. From there you can identify potential adversaries and gather threat intelligence to anticipate their targets and tactics. With that information you can protect yourself, monitor for indicators of compromise, and streamline your response when an attack is (inevitably) successful.
You need to figure out the right threat intelligence sources, and how to aggregate the data and run the analytics. We don’t want to rehash a lot of what’s in the TI+IR paper, but the most useful information sources include:
- Compromised Devices: This data source provides external notification that a device is acting suspiciously by communicating with known bad sites or participating in botnet-like activities. Services are emerging to mine large volumes of Internet traffic to identify such devices.
- Malware Indicators: Malware analysis continues to mature rapidly, getting better and better at understanding exactly what malicious code does to devices. This enables you to define both technical and behavioral indicators, across all platforms and devices to search for within your environment, as described in gory detail in Malware Analysis Quant.
- IP Reputation: The most common reputation data is based on IP addresses and provides a dynamic list of known bad and/or suspicious addresses based data such as spam sources, torrent usage, DDoS traffic indicators, and web attack origins. IP reputation has evolved since its introduction, and now features scores comparing the relative maliciousness of different addresses, factoring in additional context such as Tor nodes/anonymous proxies, geolocation, and device ID to further refine reputation.
- Malicious Infrastructure: One specialized type of reputation often packaged as a separate feed is intelligence on Command and Control (C&C) networks and other servers/sources of malicious activity. These feeds track global C&C traffic and pinpoint malware originators, botnet controllers, compromised proxies, and other IP addresses and sites to watch for as you monitor your environment.
- Phishing Messages: Most advanced attacks seem to start with a simple email. Given the ubiquity of email and the ease of adding links to messages, attackers typically find email the path of least resistance to a foothold in your environment. Isolating and analyzing phishing email can yield valuable information about attackers and tactics.
As depicted in the process map above, you integrate both external and internal security data sources, then perform analytics to isolate the root cause of the attacks and figure out the damage and extent of the compromise. Critical success factors in dealing with all this data are the ability to aggregate it somewhere, and then to perform the necessary analysis.
This aggregation happens at multiple layers of the I/R process, so you’ll need to store and integrate all the I/R-relevant data. Physical integration is putting all your data into a single store, and then using it as a central repository for response. Logical integration uses valuable pieces of threat intelligence to search for issues within your environment, using separate systems for internal and external data. We are not religious about how you handle it, but there are advantages to centralizing all data in one place. So as long as you can do your job, though – collecting TI and using it to focus investigation – either way works. Vendors providing big data security all want to be your physical aggregation point, but results are what matters, not where you store data.
Of course we are talking about a huge amount of data, so your choices for both data sources and I/R aggregation platform are critical parts of building an effective response process.
No Data (Cloud)
So what happens to response now that you don’t control a lot of the data used by your corporate systems? The data may reside with a Software as a Service (SaaS) provider, or your application may be deployed in a cloud computing service. In data centers with traditional networks it’s pretty straightforward to run traffic through inspection points, capture data as needed, and then perform forensic investigation. In the cloud, not so much.
To be clear, moving your computing to the cloud doesn’t totally eliminate your ability to monitor and investigate your systems, but your visibility into what’s happening on those systems using traditional technologies is dramatically limited.
So the first step for I/R in the cloud has nothing to do with technology. It’s all about governance. Ugh. I know most security professionals just felt a wave of nausea hit. The G word is not what anyone wants to hear. But it’s pretty much the only way to establish the rules of engagement with cloud service providers. What kinds of things need to be defined?
- SLAs: One of the first things we teach in our cloud security classes is the need to have strong Service Level Agreements (SLAs) with cloud providers. And these SLAs need to be established before you sign a deal. You don’t have much leverage during negotiations, but you have none after you signed. The kinds of SLAs include response time, access to specific data types, proactive alerts (them telling you when they had an issue), etc. We suggest you refer to the Cloud Security Alliance Guidance for specifics about proper governance structures for cloud computing.
- Hand-offs and Escalations: At some point there will be an issue, and you’ll need access to data the cloud provider has. How will that happen? The time to work through these issues is not while your cloud technology stack is crawling with attackers. Like all aspects of I/R, practice makes pretty good – there is no such thing as perfect. That means you need to practice your data gathering and hand-off processes with your cloud providers. The escalation process within the service provider also needs to be very well defined to make sure you can get adequate response under duress.
Once the proper governance structure is in place, you need to figure out what data is available to you in the various cloud computing models. In a SaaS offering you are pretty much restricted to logs (mostly activity, access, and identity logs) and information about access to the SaaS provider’s APIs. This data is quite limited, but can help figure out whether an employee’s account has been compromised, and what actions the account performed. Depending on the nature of the attack and the agreement with your SaaS provider, you may also be able to get some internal telemetry, but don’t count on that.
If you run your applications in an Infrastructure as a Service (IaaS) environment you will have access to logs (activity, access, and identity) of your cloud infrastructure activity at a granular level. Obviously a huge difference from SaaS is that you control the servers and networks running in your IaaS environment, so you can instrument your application stacks to provide granular activity logging, and route network traffic through an inspection/capture point to gather network forensics. Additionally many of the IaaS providers have fairly sophisticated offerings to provide configuration change data and provide light security assessments to pinpoint potential security issues, both of which are useful during incident response.
Those running private or hybrid clouds connecting to cloud environments at an IaaS provider, as well as your own data center, will also have access to logs generated by virtualization infrastructure. As we alluded before, regardless of where the application runs, you can (and should) be instrumenting the application itself to provide granular logging and activity monitoring to detect misuse. With the limited visibility in the cloud, you really don’t have a choice but to both build security into your cloud technology stacks, and make sure you are able to generate application logs to provide sufficient data to support an investigation.
Capture the Flag
In the cloud, whether it’s SaaS, IaaS, or hybrid cloud, you are unlikely to get access to the full network packet stream. You will have access to the specific instances running in the cloud (whether SaaS or hybrid cloud), but obviously the type of telemetry you can gather will vary. So how much forensics information is enough?
- Full Network Packet Capture: Packets are useful for knowing exactly what happened and being able to reconstruct and play back sessions. To capture packets you need either virtual taps to redirect network traffic to capture points, or to run network traffic through sensors in the cloud. But faster networks and less visibility are making full packet capture less feasible.
- Capture and Release: This approach involves capturing the packet stream and deriving metadata about network traffic dynamics, and content in the stream as well. It’s more efficient because you aren’t necessarily keeping the full data stream, but get a lot more information than can be gleaned from network flows. This still requires inline sensors or virtual taps to capture traffic before releasing it.
- Triggered Capture: When a suspicious alert happens you may want to capture the traffic and logs before and after the alert on the devices/networks in question. That requires at least a capture and release approach (to get the data), and provides flexibility to only capture when you think something is important, so it’s more efficient that full network packet capture.
- Network Flows: It will be increasingly common to get network flow data, which provides source and destination information for network traffic through your cloud environment, and enables you to see if there was some kind of anomalous activity prior to the compromise.
- Instance Logs: The closest analogy is the increasingly common endpoint detection and forensics offerings. If you deploy them within your cloud instances, you can figure out what happened, but may lack context on who and why unless you also fully capturing device activity. Also understand that these tools will need to be updated to handle the nuances of working in the cloud, including autoscaling and virtual networking.
We’ve always been fans of more rather than less data. But as we move into the Cloud Age practitioners need to be much more strategic and efficient about how and where to get data to drive incident response. It will come from external sources, as well as some logical sensors and capture points within the clouds (both public and private) in use. The increasing speed of networks and telemetry available from instances/servers, especially in data centers, will continue to challenge the scale of data collection infrastructure, so scale is a key consideration for I/R in the Cloud Age.
All this I/R data now requires technology that can actually analyze it within a reasonable timeframe. We hear a lot about “big data” for security monitoring these days. Regardless of what it’s called by the industry hype machine, you need technologies to index, search through, and find patterns within data – even when you don’t know exactly what you’re looking for, to start. Fortunately other industries – including retail – have been analyzing data to detect unseen and unknown patterns for years (they call it “business intelligence”), and many of their analytic techniques are available to security.
This scale issue is compounded by cloud usage requiring highly distributed collection infrastructure, which makes I/R collection more art than science, so you need to be constantly learning how much data is enough. The process feedback loop is absolutely critical to make sure that when the right data is not captured, the process evolves to collect the necessary infrastructure telemetry, and instrument applications to ensure sufficient visibility for thorough investigation.
But in the end, incident response always depends on people to some degree. That’s the problem nowadays, so our next post will tackle talent for incident response, and the potential shifts as cloud computing continues to take root.