Resilient Cloud Network Architectures: Design Patterns
We introduced resilient cloud networks in this series’ first post. We define them as networks using cloud-specific features to provide both stronger security and higher availability for your applications. This post will dig into two different design patterns, and show how cloud networking enables higher resilience. Network Segregation by Default Before we dive into design patterns let’s make sure we are all clear on using network segmentation to improve your security posture, as discussed in our first post. We know segmentation isn’t novel, but it is still difficult in a traditional data center. Infrastructure running different applications gets intermingled, just to efficiently use existing hardware. Even in a totally virtualized data center, segmentation requires significant overhead and management to keep all applications logically isolated – which makes it rare. What is the downside of not segmenting properly? It offers adversaries a clear path to your most important stuff. They can compromise one application and then move deeper into your environment, accessing resources not associated with the application stack they first compromised. So if they bust one application, there is a high likelihood they’ll end up with free rein over everything in the data center. The cloud is different. Each server in a cloud environment is associated with a security group, which defines with very fine granularity which other devices it can communicate with, and over what protocols. This effectively enables you to contain an adversary’s ability to move within your environment, even after compromising a server or application. This concept is often called limiting blast radius. So if one part of your cloud environment goes boom, the rest of your infrastructure is unaffected. This is a key concept in cloud network architecture, highlighted in the design patterns below. PaaS Air Gap To demonstrate a more secure cloud network architecture, consider an Internet-facing application with both web server and application server tiers. Due to the nature of the application, communications between the two layers are through message queues and notifications, so the web servers don’t need to communicate directly with each other. The application server tier connects to the database (a Platform as a Service offering from the cloud provider). The application server tier also communicates with a traditional data center to access other internal corporate data outside the cloud environment. An application must be architected for the get-go to support this design. You aren’t going to redeploy your 20-year-old legacy general ledger application to this design. But if you are architecting a new application, or can rearchitect existing applications, and want total isolation between environments, this is one way to do it. Let’s describe the design. Network Security Groups The key security control typically used in this architecture is a Network Security Group, allowing access to the app servers only from the web servers, and only to the specific port and protocol required. This isolation limits blast radius. To be clear, the NSG is applied individually to each instance – not to subnets. This avoids a flat network, where all instances within a subnet have unrestricted access to all subnet peers. PaaS Services In this application you wouldn’t open access from the web server NSG to the app server NSG, because the architecture doesn’t require direct communication between web servers and app servers. Instead the cloud provider offers a message queue platform and notification service which provide asynchronous communication between the web and application tiers. So even if the web servers are compromised, the app servers are not accessible. Further isolation is offered by a PaaS database, also offered by the cloud service provider. You can restrict requests to the PaaS DB to specific Network Security Groups. This ensures only the right instances can request information from the database service, and all requests are authorized. Connection to the Data Center The application may require data from the data center, so the app servers have access to the needed data through a VPN. You route all traffic to the data center through this inspection and control point. Typically it’s better not to route cloud traffic through inspection bottlenecks, but in this design pattern it’s not a big deal, because the traffic needs to pass over a specific egress connection to the data center, so you might as well inspect there as well. You ensure ingress traffic over that connection can only go to the app server security group. This ensures that an adversary who compromises your network cannot access your whole cloud network by bouncing through your data center. Advantages of This Design Isolation between Web and App Servers: By putting the auto-scaling groups in a Network Security Groups, you restrict their access to everything. No Direct Connection: In this design pattern you can block direct traffic to the application servers from anywhere but the VPN. Intra-application traffic is asynchronous via the message queue and notification service, so isolation is complete. PaaS Service: This architecture uses cloud provider services, with strong built-in security and resilience. Cloud providers understand that security and availability are core to their business. What’s next for this kind of architecture? To advance this architecture you could deploy mirrors of the application in different zones within a region to limit the blast radius in case one device is compromised, and to provide additional resiliency in case of a zone failure. Additionally, if you use immutable servers within each auto-scale group, you can update/patch/reconfigure instances automatically by changing the master image and having auto-scaling replace the old instances with new ones. This limits configuration drift and adversary persistence. Multi-Region Web Site This architecture was designed to deploy a website in multiple regions, with availability as close to 100% as possible. This design is far more expensive than even running in multiple zones within a single region, because you need to pay for network traffic between regions (compared to free intra-region traffic); but if uptime is essential to your business, this architecture improves resiliency. This is an externally facing application so you run traffic through a cloud WAF to get rid of obvious attack traffic. Inbound sessions can