To state the obvious, traditional security operations is broken. Every organization faces more sophisticated attacks, the possibility of targeted adversaries, and far more complicated infrastructure; compounding the problem, we have fewer skilled resources to execute on security programs. Obviously it’s time to evolve security operations by leveraging technology to both accelerate human work and take care of rote, tedious tasks which don’t add value. So security orchestration and automation are terms you will hear pretty consistently from here on out.
Some security practitioners resist the idea of automation, mostly because if done incorrectly the ramifications are severe and likely career-limiting. So we’ve advocated a slow and measured approach, starting with use cases that won’t crater the infrastructure if something goes awry. We discussed two of those in depth: enriching alerts and accelerating incident response, in our Regaining Balance post. The value of being able to respond to more alerts, better, is obvious. So we expect technologies focused on this (small) aspect of security operations to become pervasive over the next 2-3 years.
But the real leverage lies not just in making post-attack functions work better. The question is: How can you improve your security posture and make your environment more resilient by orchestrating and automating security controls? That’s what this post will dig into. But first we need to set some rules of engagement for what automation of this sort looks like. And more importantly, how you can establish trust in what you are automating. Ultimately the Future of Security Operations hinges on this concept. Without trust, you are destined to remain in the same hamster wheel of security pain (h/t to Andy Jaquith). Attack, alert, respond, remediate, repeat. Obviously that hasn’t worked too well, or we wouldn’t continue having the same conversations year after year.
The Need for Trustable Automation
It’s always interesting to broach the topic of security automation with folks who have had negative experiences with early (typically network-centric) automation. They instantaneously break out in hives when discussing automatically reconfiguring anything. We get it. When there is downtime or another adverse situation, ops people get fired and can’t pay their mortgages. Predictably, survival instincts kick in, limiting use of automation.
Thus our focus on Trustable Automation – which means you tread carefully, building trust in both your automated processes and the underlying decisions that trigger them. Iterate your way to broader use of automation with a simple phased approach.
- Human approval: The first step is to insert a decision point into the process where a human takes a look and ensures the proper functions will happen as a result of automation. This is basically putting a big red button in the middle of the process and giving an ops person the ability to perform a few checks and then hit it. It’s faster but not really fast, because it still involves waiting on a human. Accept that some processes are so critical they never get past human approval, because the organization just cannot risk a mistake.
- Automation with significant logging: The next step is to take the training wheels off and let functions happen automatically, while making sure to log pretty much everything and have humans keep close tabs on it. Think of this as taking the training wheels off, but staying within a few feet of the bike, just in case it tips over. Or running an application in Debug mode so you can see exactly what is happening. If something does happen which you don’t expect, you’ll be right there to figure out what didn’t work as expected and correct it. As you build trust in the process, we recommend you continue to scrutinize logs, even when things go perfectly. This helps you understand the frequency of changes, and which changes are made. Basically you are developing a baseline of your automated process, which you can use in the next phase.
- Automation with guardrails: Finally you reach the point where you don’t need to step through every process. The machines are doing their job. That said, you still don’t want things to go haywire. Now you leverage the baseline you developed using automation with logging. With these thresholds you can build guardrails to make sure nothing happens outside your tolerances. For example, if you are automatically adding entries to an egress IP blacklist to stop internal traffic going to known bad locations, and all of a sudden your traffic to your SaaS CRM system is due to be added to your blacklist due to a fault threat intel update, you can prevent that update and alert administrators to investigate the threat intel update. Obviously this requires a fundamental understanding of the processes being automated and an ability to distinguish between low-risk changes which should be made automatically from those which require human review. But that level of knowledge is what engenders trust, right?
Once you have built some trust in your automated process, you still want a conceptual net to make sure you don’t go splat if something doesn’t work as intended. The second requirement for trustable automation is rollback. You need to be able to quickly and easily get back to a known good configuration. So when rolling out any kind of automation (whether via scripting or a platform), you’ll want to make sure you store state information, and have the capability to reverse any changes quickly and completely. And yes, this is something you’ll want to test extensively, both as you select an automation platform and once you start using it.
The point is that as you design orchestration and automation functions, you have a lot of flexibility to get there at your own pace. Some folks have a high threshold for pain and jump in with both feet, understanding at some point they will likely need to clean up a mess. Others choose to tiptoe toward an automated future, adding use cases as they build comfort in the ability of their controls to work without human involvement. There is no right answer – you’ll reach this orchestrated and automated future when you get there. But you will get there.
Given increasing trust in a more automated approach to SecOps, let’s discuss additional use cases which highlight the power of this approach.
We mentioned guardrails as one of the phases of implementing automation into your operational processes. Let’s dig a little deeper into some examples of how guardrails work within a security context. There are many other examples of putting guardrails around operations, network, and storage processes. But we’re security folks so let’s look at security guardrails.
- Unauthorized privilege escalation: Let’s say you receive an alert of privilege escalation on a high-profile device (perhaps the CFO’s phone). The trigger would be a log event of the escalation, which would result in rolling back the change and firing a high-priority alert at the SOC. If the change is legitimate you can always recommit. The CFO may be a bit miffed that your machines interrupted their work, but this kind of guardrail makes sure privileges remain where they should be, unless approved.
- Rogue devices: An unknown WiFi access point was detected using passive network scanning. It’s not in your CMDB, as it would be if it went through the enterprise provisioning process, nor is it a type of device that would be installed by the enterprise networking team, so it’s safer to just take the device off the network until you can figure out why it’s there and whether it’s legit.
- Deploy new IPS rules: Finally, similar to the egress IP blacklist above, IPS rules are automatically updated based on a trusted threat intel feed. But what happens if traffic from your biggest customer is blocked because that application traffic looks like reconnaissance? In this case you can flag the customer’s network as one that shouldn’t ever be blocked and send a high-profile alert to investigate. Worst case, the block was legitimate (and the customer’s network was compromised); then you work with the customer to remedy their situation, but you were protected.
All these examples are simplistic, but you can look at any runbook and understand the edge cases which would be problematic if bad changes happen automatically. Build guardrails for those specific situations and then allow your machines to do their thing without threatening your environment.
Another popular process for automation is handling phishing messages. Phishing is increasingly common, and it’s resource-intensive to manually deal with every inbound message (shocking, right?). This is a perfect scenario for automation, which could look like this:
- Receive phishing message: Your email security service flags a message as a phishing attempt and forwards it to a mailbox set up to trigger your automated process.
- Block egress: Phish tend to travel in schools, so odds are good that similar messages will be sent to many of your users. So you take the message from the phishing mailbox, extract the URL, and then automatically update your DNS server to divert requests to that server to a safe internal address, which instead displays an educational material about clicking phishing messages.
- Investigate endpoint: A user being targeted by a phish might be targeted by many things, so you’ll want to keep an eye on that device and automatically update your Endpoint Detection and Response (EDR) tool to increase logging frequency and depth on that device. You’ll also put the employee of the on a watch list in your SIEM/UBA product, so it is subject to additional scrutiny.
- Pay it forward: You are unlikely to be the only organization to be targeted by this phishing campaign, so you can automatically package up the information you got from analyzing the message and networking specifics, and forward them to your site takedown service. They will find the responsible ISP and initiate a request to take down the malicious site. Then folks less sophisticated than you can benefit as well.
You can also attach this phishing operational process to your incident response process. If your EDR information indicates a potential device compromise, you can automatically start capturing network traffic from that device and send it all to your response platform to initiate investigation.
We just talked about an inbound use case (phishing), so let’s flip perspective to dig into an exfiltration use case.
- DLP alert fires: Unfortunately you probably get a number of DLP alerts every day, typically many are never investigated due to the volume of activity and lack of skilled resources to triage and investigate.
- Classify the issue: You receive many different kinds of alerts, so it makes sense to kick off different runbooks depending on type. For simplicity’s sake let’s just say you consider the leak of account numbers (or other personal data) in email an inadvertent error, while an encrypted package going through the gateway is considered malicious.
- Kick off an educational process: If the alert is deemed inadvertent you send a request to your security awareness training platform (via API) to register the user into a training module on protecting customer data. They can complete the training and be on their way without intervention by security personnel.
- Capture endpoint data: If you determine the incident might be malicious, you immediately run a scan and then monitor the endpoint very closely. This process should also alert the SOC to a potential issue and start assembling the case file, described under Incident Response process.
- Quarantine device: Depending on the results of your scan and telemetry analysis, if there is a concern of compromise you can automatically quarantine the network, pull images of device memory and storage, and send a more urgent alert to an incident which requires investigation.
- Determine proliferation: Once the type of attack is identified from the endpoint scan, you can automatically search through existing endpoint security data to identify devices which were attacked similarly.
Almost this entire process can run in an automated fashion, leveraging logic and conditional tests within the process. Depending on type, an alert might kick off several different runbooks, each taking its urgency and potential severity into account. Some organizations want human hands involved in the response process, so they establish interrupts in the process for analyst review and possible intervention. For instance you could hold quarantine of endpoint devices for approval by an analyst. The process is the same, except an additional gate prior to quarantine and remediation for manual approval.
You design your automated processes to work for your organization and its requirements. As mentioned above, you work toward full automation at a pace that works for you.
Updating SaaS Web Proxy
Finally, let’s see how this approach works if you need to integrate with services which don’t run on-premise. Many organizations have embraced SaaS-based secure web services, but some want more granular control over which sites and networks their users can access. So you can decide to supplement your service’s built-in IP blacklist with multiple threat intelligence services to make sure you don’t miss anything.
- Aggregate threat intel: All your external data feeds can be aggregated in a threat intel platform (or your SIEM if you prefer), where you perform some normalization to see if any of several services identify a suspect IP address as bad.
- Block validated bad sites: If an IP address shows up in multiple threat lists, it should obviously be blocked. But your SaaS service may already be blocking it, so you first poll the service for the IP’s status. If it’s already blocked, do nothing. If it’s not, use the SaaS API to add the address to their blacklist.
- Monitor potentially bad sites: For traffic showing up on just one list (meaning your suspicion is not validated), you send an API request to the service to tighten policies for that IP. This likely entails more detailed logging, and perhaps capturing packets to and from that device. Depending on the sophistication of your internal security team, you might also send them an alert to perform additional investigation on that IP for a final determination.
This example shows the value and importance of APIs to the automation process. There is a logical flow, and the API enables clean integration and higher-order logic to be integrated into the process.
These additional use cases illustrate the flexibility of this approach, and its value to SecOps – which is why we believe it’s the future. You can automate where possible, supplement with internal resources as appropriate, and ultimately embrace these capabilities at whatever pace works for your organization.
But the core process is similar, regardless of the degree of automation you embrace. You need great familiarity with your processes and understand expected behavior, and to plan for unexpected edge cases. You need to slowly build trust in both the triggers for your processes, and then what happens when they are initiated. This happens by first having humans ride shotgun on the process, approving each step. Then running without human intervention, but with detailed and granular logging to make sure you understand each step. Finally you let the machine do its thing, with safety guardrails to ensure your process doesn’t run amok and disrupt availability.
We expect orchestration and automation to become the Future of Security Operations. So the sooner you start figuring out how to apply these tactics in your environment, the sooner you can give yourself (and your organization) a change to keep pace with the attacks coming your way.