Our last AESP post covered a number of approaches to preventing attacks on endpoints and servers. Of course prevention remains the shiny object most practitioners hope to achieve. If they can stop the attack before the device is compromised there need be no clean-up. We continue to remind everyone that hope is not a strategy, and counting on blocking every attack before it reaches your devices always ends badly.
As we detailed in the introduction, you need to plan for compromise because it will happen. Adversaries have gotten much better, attack surface has increased dramatically, and you aren’t going to prevent every attack. So pwnage will happen, and what you do next is critical, both to protecting the critical information in your environment and to your success as a security professional.
So let’s reiterate one of our core security philosophies: Once the device is compromised, you need to shorten the window between compromise and when you know the device has been owned. Simple to say but very hard to do. The way to get there is to change your focus from prevention to a more inclusive process, including detection and investigation…
Our introduction described detection:
You cannot prevent every attack, so you need a way to detect attacks after they get through your defenses. There are a number of different options for detection – most based on watching for patterns that indicate a compromised device. The key is to shorten the time between when the device is compromised and when you discover it has been compromised.
To be fair, there is a gray area between detection and prevention, at least from an endpoint and server standpoint. With the exception of application control, the prevention techniques described in the last post depend on actually detecting the bad activity first. If you are looking at controls using advanced heuristics, you detect the malicious behavior first – then you block it. In an isolation context you run executables in the walled garden, but you don’t really do anything until you detect bad activity – then you kill the virtual machine or process under attack.
But there is more to detection than just figuring out what to block. Detection in the broader sense needs to include finding attacks you missed during execution because:
- You didn’t know it was malware at the time, which happens frequently – especially given how quickly attackers innovate. Advanced attackers have stockpiles of unknown exploits (0-days) they use as needed. So your prevention technology could be working as designed, but still not recognize the attack. There is no shame in that.
- Alternatively, the prevention technology may have missed the attack. This is common as well because advanced adversaries specialize in evading known preventative controls.
So how can you detect after compromise? Monitor other data sources for indicators that a device has been compromised. This series is focused on protecting endpoints and servers, but looking at devices is insufficient. You also need to monitor the network for a full perspective on what’s really happening, using a couple techniques:
- Network-based malware detection: One of the most reliable ways to identify compromised devices is to watch for communications with known botnets. You can look for specific traffic patterns, or for communications to known botnet IP addresses. We covered these concepts in both the NBMD 2.0 and TI+SM series.
- Egress/Content Filtering: You can also look for content that should not be leaving the confines of your network. This may involve a broad DLP deployment – or possibly looking for sensitive content on your web filters, email security gateways, and next generation firewalls.
Keep in mind that every endpoint and server device has a network stack of some sort. Thus a subset of this monitoring can be performed within the device, by looking at traffic that enters and leaves the stack.
As mentioned above, threat intelligence (TI) is making detection much more effective, facilitated by information sharing between vendors and organizations. With TI you can become aware of new attacks, emerging botnets, websites serving malware, and a variety of other things you haven’t seen yet and therefore don’t know are bad. Basically you leverage TI to look for attacks even after they enter your network and possibly compromise your devices. We call this retrospective searching. This works by either a) using file trajectory – tracking all file activity on all devices, looking for malware files/droppers as they appear and move through your network; or b) looking for attack indicators on devices with detailed activity searching on endpoints – assuming you collect sufficient endpoint data.
Even though it may seem like it, you aren’t really getting ahead of the threat. Instead you are looking for likely attacks – the reuse of tactics and malware against different organizations gives you a good chance to see malware which has hit others before long.
Once you identify a suspicious device you need to verify whether the device is really compromised. This verification involves scrutinizing what the endpoint has done recently for indicators of compromise or other such activity that would confirm a successful attack. We’ll describe how to capture that information later in this post.
Once you validate the endpoint has been compromised, you go into incident response/containment mode. We described the investigation process in the introduction as:
Once you detect an attack you need to verify the compromise and understand what it actually did. This typically involves a formal investigation, including a structured process to gather forensic data from devices, triage to determine the root cause of the attack, and searching to determine how broadly the attack has spread within your environment.
As we described in React Faster and Better, there are a number of steps in a formal investigation. We won’t rehash them here, but to investigate a compromised endpoint and/or server you need to capture a bunch of forensic information from the device, including:
- Memory contents
- Process lists
- Disk images (to capture the state of the file system)
- Registry values
- Executables (to support malware analysis and reverse engineering)
- Network activity logs
As part of the investigation you also need to understand the attack timeline. This enables you to identify the first compromised device (Patient Zero), as well as all affected devices, so you can effectively contain the damage when you reach the remediation phase. This timeline shows you how the malware got into your network in the first place, and how it proliferated to infect other devices.
This highlights one of the biggest problems in handling modern malware – getting rid of it completely. Even if you wipe an infected device to bare metal and reimage, unless you successfully identifying all other infected devices in your environment, and clean them all successfully, the malware will cause additional trouble. So as part of investigations you need to isolate all devices affected by the attack and clean them once and for all.
You can’t just rely on behavioral indicators (the device behaving badly) to identify additional affected devices, because the malware may be lying dormant and awaiting instructions from the bot master. You need to analyze detailed telemetry from endpoints and servers to determine whether these indicators are present – which brings us to the proverbial glue that enables both detection and investigation of attacks on endpoints and servers.
Capture Two Birds (with One Agent)
As we explained in our 2014 Endpoint Security Buyer’s Guide, to really investigate a device, you need to capture what’s happening on the endpoints and servers at a very granular level. This includes file activity, registry changes, privilege escalation, executed programs, network activity, and a variety of other activities happening on endpoints. We call this function Device Activity Monitoring, and it is also called ETDR (Enterprise Threat Detection and Response).
The key functions in device activity monitoring start with data capture. To get the data needed for a comprehensive investigation you need to capture data continuously. Of course that might not be practical on all devices, in which case you will use a trigger to start full collection. For example if a user clicks a link in an email that takes the browser to a suspicious site, you would start pulling detailed data from the endpoint, because a compromise is likely imminent.
Another capture decision is where to store the data. There is a battle brewing between products that capture this device telemetry data and store it on customer premises, and those which store data in the cloud. There are pros and cons to both approaches. On one side you will hear a lot about the security implications of moving such sensitive data to the cloud – and those are legitimate concerns. On the other hand, the need for large-scale analysis of aggregated and anonymized data, to identify emerging patterns across organizations, favors a cloud-based model. Mr. Market will determine the right approach soon enough, but where to store your telemetry data is definitely a deployment decision you need to make when selecting an approach.
Next, the activity monitoring technology should have adequate hooks for threat intelligence (TI) integration. The vendor’s research team can and should populate agents with emerging attack indicators, IP and file reputation, etc., to provide a basis for detecting advanced attacks. But one research feed is not enough so you will want a product flexible enough to ingest other feeds – likely through industry standard TI formats such as STIX, TAXII, OpenIOC, OTX, et al.
Finally, endpoints and servers generate a huge amount of data, so it’s also necessary for the product to perform big data style analysis on the telemetry dataset, to identify patterns and develop relationships between data sources. Having the data is the first step. Supplementing it with external information to help prioritize focus areas is second. Being able to analyze the data to provide useful information to security practitioners and incident responders is the third leg of the device activity monitoring triangle.
If you missed on prevention, you need to detect bad behavior by infected endpoints and servers, and then verify and investigate the attack. But that doesn’t entirely solve the problem – you still have active malware on the devices. Now it’s time to remediate.
Remediation tends to fall within the purview of Operations, but security teams can get a quick win by providing very detailed data and recommendations for remediation to Ops, to help focus their efforts and aid them in fully cleaning up the attack. We will discuss that Quick Win as we wrap up this series, in our next and final post.