Update: Amazon published some details. Less than 10% of AWS systems are affected, and the vulnerability will be disclosed October 1st. As suspected this is about Xen – not the
Yesterday I received notice that Amazon Web Services is force rebooting one of my instances. Then more emails started rolling in, and it looks like many (or all) of them will be rebooted during a single maintenance window. It has been a few years since this happened, and the reason ties into how AWS updates the servers your instances run on. We actually teach this in our cloud security training class, including how to architect your own cloud so you might not have to do the same thing – with, of course, many caveats.
My initial assumption was application of a quiet security patch, and that looks dead on:
And here is what looks like that vuln:
XSA-108 | 2014-10-01 12:00 | none (yet) assigned | (Prereleased, but embargoed)
How AWS updates servers
Amazon uses a modified version of the Xen hypervisor. Our understanding of their architecture indicates they do not support live migration. Live migration, available under VMware as vMotion, allows you to move a running virtual machine from one physical host to another without shutting it down.
When you build a cloud, host servers consist of (at least) a hypervisor with management and connectivity components. Sometimes, as with OpenStack, you even have a usable operating system. All these components need to be updated periodically. Some updates require rebooting the host server. To update the hypervisor you typically need to shut down the virtual machines (instances) running on top of it.
There are two common ways to manage these updates to reduce downtime:
- Update a host without any virtual machines running on it, then live migrate instances from a vulnerable host to a patched one. Then update the vulnerable host once all its instances are running elsewhere.
- If you cannot live migrate, do the same thing by shutting down and restarting the instances. If you built your cloud properly you can set a rule in the controller to not launch instances on the vulnerable host while preparing to reboot. Then the simple act of shutting down and relaunching the instance will automatically migrate it to a patched host.
In case you didn’t realize, every time you shut an instance down and start it again you likely move to a new host server. That is just normal cloud automation at work. When AWS has a large security patch like this they cannot rely on all customers conveniently relaunching during the desired window, so they need to take a maintenance window and do it for all affected users. Simple reboots generally do not trigger a host migration because a reboot doesn’t actually shutdown the entire instance – the virtual machine just executes the operating system shutdown and reboot procedures, but the instance is never destroyed or completely halted.
Many people don’t architect resilient servers to handle reboots, which is the problem. Or the reboots require some manual testing. This is why I am a massive fan of DevOps – its techniques provide extra resiliency for situations like this – but that’s for another post.
Our cloud security training covers this, and one critical security requirement when building a private (or public) cloud is to understand your patching requirements and their implications for instances. For example if you architect for live migration you can reduce required reboots, by accepting different implications and constraints.