Endpoint Advanced Protection Buyer’s Guide: Key Capabilities for Response and HuntingBy Mike Rothman
The next set of key Endpoint Detection and Response (EDR) capabilities we will discuss is focused on response and hunting.
Response begins after the attack has happened. Basically, Pandora’s Box is open and an active adversary is on your endpoints, probably stealing your stuff. So you need to understand the depth of the attack, and to focus on containment and returning the environment to a known safe state as quickly as possible.
Understand that detection and response are considered different use cases when evaluating endpoint security vendors, but you aren’t really going to buy detection without buying a response capability as well. That would be like buying binoculars so you could spot forest fires, with no plan for what to do when you found one. In this case you can’t just call the friendly Rangers. You detect and validate an attack – then you need to respond.
Detection and response functions are so closely aligned that functionality between them blurs. For clarity, in our vernacular detection results in an enriched alert which is sent to a security analyst. The analyst responds by validating the alert, figuring out the potential damage, determining how to contain the attack, and then working with operations to provide an orchestrated response. Ideally, detection is largely automated and response is where a human comes into play.
We understand reality is a bit more complicated, but this oversimplification makes the explanation and exploration simpler, as well as the evaluation and selection process for detection and response technologies.
Endpoint response starts with data collection. For detection you have the option not to store or maintain endpoint telemetry. We don’t think that makes any sense, but people make poor choices every day. But for response we clearly need to mine endpoint data to figure out exactly what happened and assess the damage. Data management and accessibility is the first key capability of a response platform.
Data types: So what do you store? In a perfect world you would store everything, and some offerings include full recording of pretty much everything that takes place on all endpoints, which they typically call “Full DVR”. But of course that requires capturing and storing a ton of data, so a reasonable approach is to derive metadata and perform broader (full) recording on all devices you suspect of being compromised. At minimum you’ll want to gather endpoint logs for all system-level activities and configuration changes, file usage statistics with associated hashes, full process information (including parent and child process relationships), user identity/authentication activities (such as logins and entitlement changes), and network sessions. More importantly for selection, your response offering should be able to collect as much data, with as much granularity, as you deem necessary. Don’t give ups data collection because your response platform doesn’t support it.
Data storage/management: Once you figure out what you will collect, you get into the mechanics of actually storing it; you’ll want some flexibility. Many of the data management decision points are similar between detection and response – particularly around cost and performance. But response data is needed longer, is more valuable than most of the full data feed you use for detection – which should contain mostly innocuous baseline traffic – and requires granular analysis and mining, so the storage architecture becomes more pertinent.
Local collection: Historically, well before the cloud was a thing, endpoint telemetry was stored locally on each device. Storage is relatively plentiful on endpoint devices and data doesn’t need to be moved, so this is a cost-efficient option. But you cannot perform analysis across endpoints to respond to broader campaigns without comibining the data, so at some point you need central aggregation. Another concern with local collection is the possibility of data being tampered with, or inaccessible when you need it.
Central aggregation: The other approach is to send all telemetry to a central aggregation point, typically in the cloud. This requires a bunch of storage and consumes network resources to send the data to the central aggregation point. But because you are likely buying a service, if they vendor decides to store stuff in the cloud that’s their problem. Your concern is the speed and accuracy of analysis of your endpoint telemetry, and your ability to drill down into it during response. The rest of the architecture can vary depending on how the vendor’s product works. Focus on how you can get at the data when you need it.
Hybrid: We increasingly see a hybrid approach, where a significant amount of data is stored locally (where storage is reasonably cheap), and relevant metadata is sent to a central spot (typically in the cloud) for analytics. This approach is efficient by leveraging the advantages of both local storage and central analytics. But if you need to drill down into the data that could be a problem, because it isn’t all in one place, and data on-device could have been either tampered with or destroyed during the attack. Make sure you understand how to access endpoint-specific telemetry during investigation.
Device imaging: Historically this has been the purview of purpose-built incident response platforms. But as EDR continues to evolve the capability to pull a forensic image from a device is important – both to ensure proper chain of custody (in the event of prosecution), and support deeper investigation.
You got an alert from the detection process, and you have been systematically collecting data; now your SOC analyst needs to figure out whether the alert is valid or a false positive. Historically a lot of this has been by feel, and experienced responders often have a hunch that something is malicious. But as we have pointed out many times, we don’t have enough experienced responders, so we need to use technology more effectively to validate alerts.
Case management: The objective is to make each analyst as effective and efficient as possible, so you should have a place for all the information related to an alert to be stored. This includes enrichment data from threat intel (described above) and other artifacts gathered during validation. This also should feed into a broader incident response platform, if the forensics/response team uses one.
Visualization: To reliably and quickly validate an alert, it is very helpful to see a timeline of all activity on a device. That way you can see if child processes have been spawned unnecessarily, registry entries have been added without reason, configuration changes have been made, or network traffic volume is outside the normal range. Or about a thousand other activities that show up in a timeline. An analyst needs to perform a quick scan of device activity and figure out what requires further investigation. Visualization can cover one or several devices, but be wary of overcomplicating the console. It is definitely possible to present too much information.
Drill down: Once an analyst has figured out which activity in the timeline raises concerns, they drill into it. They should be able to see the process tree if it’s a process issue, the destination of suspicious network traffic, ot whatever else is available and relevant. From there they’ll find other things to investigate, so being able to jump between different events (and across devices) helps identify the root cause of attacks quickly. There is also a decision to be made regarding whether you need full DVR/reconstruction capabilities when drilling down. Obviously the more granular the available telemetry, the more accurate the validation and root cause analysis. But with increasingly granular metadata available, you might not need full capture. Decide during the proof of concept evaluation, which we will discuss later.
Workflows and automation: The more structured you can make your response function, the better a chance your junior analysts have of finding the root cause of an attack, and figuring out how to contain and remediate it. Response playbooks for a variety of different kinds of endpoint attacks within the EDR environment helps standardize and structure the response process. Additionally, being able to integrate with automation platforms to streamline response – at least the initial phases – dramatically improves effectiveness.
Real-time polling: When drilling down, it sometimes becomes apparent that other devices are involved in an attack, so the ability to jump to other devices during validation provides additional information and context for understanding the depth of the attack and number of devices involved. This is critical supporting documentation when the containment plan is defined.
Sandbox integration: During validation you will also want to check whether an executed file is actually malware. Agents can store executables, and integrate with network-based sandboxes to explode and analyze files – to figure out both whether a file is malicious and also what it does. This provides context for eventual containment and remediation steps. Ideally this integration will be native, and enable you to select an executable within the response console to send to the sandbox, with the verdict and associated report filed with the case.
Once an alert is validated and the device impact understood, the question is what short-term actions can contain the damage. This is largely an integration function, where you will want to do a number of things.
Quarantine/Isolation: The first order of business is to ensure the device doesn’t cause any more damage, so you’ll want to isolate the device by locking down its communications, typically only to the endpoint console. Responders can still access the machine but adversaries cannot. Alternatively, it is useful to have an option to assign the device to a quarantine network using network infrastructure integration, to enable ongoing observation of adversary activity.
Search: Most attacks are not limited to a single machine, so you’ll need to figure out quickly whether any other devices have been attacked as part of a broader campaign. Some of that takes place during validation as the analyst pivots, but figuring out the breadth of an attack requires them to search the entire environment for indicators of the attack, typically via metadata.
Natural language/cognitive search: An emerging search capability is use of natural language search terms instead of arcane Boolean operators. This helps less sophisticated analysts be more productive.
Remediation: Once the containment strategy is determined, the ability to remediate the device from within the endpoint response console (via RDP or shell access) facilitates returning it to its pre-attack configuration. This may also involve integration with an endpoint configuration management tools to restore the machine to a standard configuration.
At the end of the detection/response process, the extent of the campaign should be known and impacted devices should be remediated. The detection/response process is reactive, triggered by an alert. But if you want to turn the tables a bit, to be a bit more proactive in finding attacks and active adversaries, you will look into hunting.
Threat hunting has come into vogue over the past few years, as more mature organizations decided they no longer wanted to be at the mercy of their monitoring and detection environments, and wanted a more active role in finding attackers. So their more accomplished analysts started looking for trouble. They went hunting for adversaries rather than waiting for monitors to report attacks in progress.
But hunting selection criteria are very similar to detection criteria. You need to figure out what behaviors and activities to hunt for, then you seek them out. You start with a hypothesis, and run through scenarios to either prove or disprove it. Once you find suspicious activity you work through traditional response functions such as searching, drilling down into endpoint telemetry, and pivoting to other endpoints, following the trail.
Hunters tend to be experienced analysts who know what they are looking for – the key is to have tools to minimize busywork, and let them focus on finding malicious activity. The best tools for hunting are powerful yet flexible. These are the most useful capabilities for a hunter:
Retrospective search: Hunters often know what they want to focus on – based on an emerging attack, threat intel, or a sense of what tactics they would use as an attacker. Enabling hunters to search through historical telemetry from the organization’s endpoints enables them to find activity which might not have triggered an alert at the time, possibly because it wasn’t a known attack.
Comprehensive drill down: Given the sophistication of a typical hunter, they should be equipped with a very powerful view into suspicious devices. That typically warrants full device telemetry capture, allowing analysis of the file system and process map, along with memory and the registry. Attacks that weren’t detected at the time were likely taking evasive measures, and thus require low-level device examination to determine intent.
Enrichment: Once a hunter is on the trail of an attacker, they need a lot of supporting information to map TTPs (Tactics, Techniques, and Procedures) to possible adversaries, track network activity, and possibly reverse engineer malware samples. Having the system enrich and supplement the case file with related information streamlines their activity and keeps them on the trail.
Analytics: Behavioral anomalies aren’t always apparent, even when the hunter knows what he or she is looking for. Advanced analytics to find potential patterns of malicious activity, and a way to drill down further (as described above), also streamlines hunting.
Case management: As with response, a hunter will want to store artifacts and other information related to the hunt, and have a comprehensive case file populated in case they find something. Case management capabilities (described above, under response) tends to provide this capability for all use cases.
Hunting tools are remarkably similar detection and response tools. The difference is whether the first thread in the investigation comes from an alert, or is found by a hunter. From there the processes are very similar, meaning the tool criteria are also very close.