As far back as I can remember, I have been a fan of testing your defenses. Some people call it pen testing, others refer to it as an assurance process, but the point is the same either way. The bad folks test your defenses every day, and if you aren’t using the same tactics to find out what they can get, you’re going to have a bad day. Maybe not today, maybe not even tomorrow. But the clock is ticking.

Truly understanding your security posture gets even harder when you start thinking about the cloud and the complexities of architecting a totally new infrastructure. We have a zillion dollars worth of systems management installed to monitor and manage our data centers, although I reserve judgment on how suck-tastic that investment has been. Now that we are moving many things to the cloud (whatever that means), it’s time to revisit how we test our infrastructure.

The existing systems management (and security) vendors are falling all over themselves to position their existing products as appropriate for managing cloud operations, but most of their solutions are heavy on slide decks and virtual appliances (same stuff, different wrapper), and lighter on the actual management technology. In fairness, it’s still early, so we shouldn’t totally count out the systems management incumbents, right? I mean, those are some innovative organizations, [sarcasm]no?[/sarcasm]

Yet, this cloud thing will force us to totally rethink how we run operations, and thus how we test our environments. The good news is that many of the cloud services leaders are more than happy to share what they are doing, so you can learn what works and what doesn’t, avoiding the school of hard knocks. I mean, when before has a company basically shared its data center architecture? Thanks, Facebook.

And now NetFlix is sharing some of their management approaches. Netflix’s concept is to use a Simian Army (not literal monkeys, but automated testing processes) to put their infrastructure through the ringer. To see where it breaks. To pinpoint performance issues. And to do it continuously, on an ongoing basis. They even have a chaos gorilla, which takes entire availability zones out of play, so they can see how their infrastructure reacts.

The same discipline applies to security. You need to build a set of hacking simians to try to break your stuff. No, it won’t be easy, and you’ll need to do a lot of manual scripting and integration to build a security monkey. Although there are some offerings (like Core’s new Insight product), focused on running continuous testing processes, it’s still early in this market. So you’ll need to do a lot of the work. But the alternative is having your dirty undergarments posted on pastebin.

But don’t forget my standard caveat: when you test using live ammo, be careful! Given the economics of cloudy things, you should have a test environment that looks an awful lot like your production environment. And let the monkeys loose on your test environment early and often. But some of these monkeys can/should be used on the production stuff. Although you can make the test environment look the same, it’s not. We learn that hard lesson over and over again.

In the post, Netflix talks about shutting down production instances (with a lot of oversight, obviously), just to see what happens. They reminds me a bit of the kanban process in manufacturing, in that you mess with the working system to find the breaking points, to see where you can make it more efficient. The assumption that everything is working fine has never held water. The question is whether you search for what’s broken, or wait for it to find you.

But most of all, I love both the metaphor and the message of Netflix’s approach. These guys test their stuff, so when half of Amazon AWS goes down they stay up. Obviously this isn’t a panacea (as their recent outage showed), but clearly there is something going on over at Netflix. So jump on the monkey bandwagon – they are taking over the world anyway.