How Cloud Security Managers Should Respond to Meltdown and Spectre

I hope everyone enjoyed the holidays… just in time to return to work, catch up on email, and watch the entire Internet burn down thanks to a cluster of hardware vulnerabilities built into pretty much every computing platform available.

I won’t go into details or background on Meltdown and Spectre (note: if I ever discover a vulnerability, I want it named “CutYourF-ingHeartOutWithSpoon”). Instead I want to talk about them in the context of the cloud, short-term and long-term implications, and some response strategies. These are incredibly serious vulnerabilities – not only due to their immediate implications, but also because they will draw increased scrutiny to a set of hardware weaknesses, which in turn are likely to require a generational fix (a computer generation – not your kids).

Meltdown

Briefly, Meltdown increases the risk of a multi-tenancy break. This has impacts on three levels:

It potentially enables any instance or guest on a system to read all the memory on that system. This is the piece which cloud providers have almost completely patched.
On a single system, it could also allow code in a container to read the memory of the entire server. This is likely also patched by cloud providers (AWS/Google/Microsoft).
Because Function as a Service (‘serverless’) offerings are really implemented as code in containers, the same issues apply to these products.

Meltdown is a privilege escalation vulnerability and requires a malicious process to be run on the system – you cannot use it to gain an initial foothold or exploitation, but to do things like steal secrets from memory once you have presence.

Meltdown in its current form on major cloud providers is likely not an immediate security risk. But just to be safe I recommend immediately applying Meltdown patches at the operating system level to any instances you have running. This would have been far worse if there hadn’t been a coordinated disclosure between researchers, hardware and operating system vendors, and cloud providers. You may see some performance degradation, but anything that uses autoscaling shouldn’t really notice.

Spectre

Spectre is a different group of vulnerabilities which relies on a different set of hardware-related issues. Right now Spectre only allows access to memory the application already has access to. This is still a privilege escalation issue because it’s useful for things like allowing hostile JavaScript code in a browser access to data outside its sandbox. This also seems like it could be an issue for anything which runs multiple processes in a sandbox (such as containers), and might allow reading data from other guests or containers on the same host.

Exploitation is difficult, the cloud providers are on it, and there is nothing to be done right now – other than to pay attention.

So for both attacks, your short-term action is to patch instances and keep an eye on upcoming patches.

Oh – and if you run a private cloud, you really need to patch everything yesterday and be prepared to replace all your hardware within the next few years. All your hardware. Oops.

Long-term implications and recommendations

These are complex vulnerabilities related to deeply embedded hardware functionality. Spectre itself is more an entire vulnerability/exploit class than a single patchable vulnerability. Right now we seem to have the protections we need available, and the performance implications appear manageable (although the performance impact will be costly for some customers).

The bigger concern is that we don’t know what other variants of both vulnerability classes may appear (or be discovered by malicious actors who don’t make them public). The consensus among my researcher friends is that this is a new area of study; while it’s not completely novel, it’s definitely drawing highly intelligent and experienced eyeballs. I will be very surprised if we don’t see more variants and implications over the next few years. Hardware manufacturers need to update chip designs, which is a slow process, and even then they are likely to leave holes which researchers will eventually discover.

Let’s not mince words – this is a very big deal for cloud computing. The immediate risk is very manageable but we need to be prepared for the long-term implications.

As this evolves, here is what I recommend:

Obviously, immediately patch all your operating systems on all your instances to the best of your ability. Hopefully cloud provider mitigations at the hypervisor level are already protecting you, but it’s still better to be safe. Start with a focus on instances where memory leaks are the worst threat.
For highly sensitive workloads (e.g., encryption) immediately consider moving to dedicated tenancy and don’t run any less-privileged workloads on the same hardware. Dedicated tenancy means you rent a whole box from your cloud provider, and only your workloads run on it. This eliminates much of the concern of guest to host breaks.
Migrate to dedicated PaaS where possible, especially for things like encryption operations. For example if you move to an AWS Elastic Load Balancer and perform discrete application data encryption in KMS, your crypto operations and keys are never exposed in the memory of any general-purpose system. This is the critical piece: the hardware underpinning these services isn’t used for anything other than the assigned service. So another tenant cannot run a malicious process to read the box’s physical memory. If you can’t run malicious code as a tenant, then even if you break multi-tenancy you still need to compromise the entire system – which cloud providers are damn good at preventing. Removing customers’ ability to run arbitrary processes is a massive roadblock to exploitation of these kinds of vulnerabilities.
Continue to migrate workloads to Function as a Service (also called ‘serverless’ and ‘Lambda’), but recognize there still are risks. Moving to servlerless pushes more responsibility for mitigating future vulnerabilities in these (and any other) classes onto your cloud provider, but since tenants can run nearly arbitrary code there is always a chance of future issues. Right now my feeling is that the risk is low, and far lower than running things on your own servers or even instances.
Adopt DevOps and automation, because that’s the best way to move fast, fix a lot of things at once, and be prepared for the next exploit with a cute logo and its own PR team.

If I missed anything, feel free to drop a comment or hit me up at @rmogull on Twitter.