Resilient Cloud Network Architectures: Design PatternsBy Mike Rothman
We introduced resilient cloud networks in this series’ first post. We define them as networks using cloud-specific features to provide both stronger security and higher availability for your applications. This post will dig into two different design patterns, and show how cloud networking enables higher resilience.
Network Segregation by Default
Before we dive into design patterns let’s make sure we are all clear on using network segmentation to improve your security posture, as discussed in our first post. We know segmentation isn’t novel, but it is still difficult in a traditional data center. Infrastructure running different applications gets intermingled, just to efficiently use existing hardware. Even in a totally virtualized data center, segmentation requires significant overhead and management to keep all applications logically isolated – which makes it rare.
What is the downside of not segmenting properly? It offers adversaries a clear path to your most important stuff. They can compromise one application and then move deeper into your environment, accessing resources not associated with the application stack they first compromised. So if they bust one application, there is a high likelihood they’ll end up with free rein over everything in the data center.
The cloud is different. Each server in a cloud environment is associated with a security group, which defines with very fine granularity which other devices it can communicate with, and over what protocols. This effectively enables you to contain an adversary’s ability to move within your environment, even after compromising a server or application. This concept is often called limiting blast radius. So if one part of your cloud environment goes boom, the rest of your infrastructure is unaffected.
This is a key concept in cloud network architecture, highlighted in the design patterns below.
PaaS Air Gap
To demonstrate a more secure cloud network architecture, consider an Internet-facing application with both web server and application server tiers. Due to the nature of the application, communications between the two layers are through message queues and notifications, so the web servers don’t need to communicate directly with each other. The application server tier connects to the database (a Platform as a Service offering from the cloud provider). The application server tier also communicates with a traditional data center to access other internal corporate data outside the cloud environment.
An application must be architected for the get-go to support this design. You aren’t going to redeploy your 20-year-old legacy general ledger application to this design. But if you are architecting a new application, or can rearchitect existing applications, and want total isolation between environments, this is one way to do it. Let’s describe the design.
Network Security Groups
The key security control typically used in this architecture is a Network Security Group, allowing access to the app servers only from the web servers, and only to the specific port and protocol required. This isolation limits blast radius. To be clear, the NSG is applied individually to each instance – not to subnets. This avoids a flat network, where all instances within a subnet have unrestricted access to all subnet peers.
In this application you wouldn’t open access from the web server NSG to the app server NSG, because the architecture doesn’t require direct communication between web servers and app servers. Instead the cloud provider offers a message queue platform and notification service which provide asynchronous communication between the web and application tiers. So even if the web servers are compromised, the app servers are not accessible.
Further isolation is offered by a PaaS database, also offered by the cloud service provider. You can restrict requests to the PaaS DB to specific Network Security Groups. This ensures only the right instances can request information from the database service, and all requests are authorized.
Connection to the Data Center
The application may require data from the data center, so the app servers have access to the needed data through a VPN. You route all traffic to the data center through this inspection and control point. Typically it’s better not to route cloud traffic through inspection bottlenecks, but in this design pattern it’s not a big deal, because the traffic needs to pass over a specific egress connection to the data center, so you might as well inspect there as well.
You ensure ingress traffic over that connection can only go to the app server security group. This ensures that an adversary who compromises your network cannot access your whole cloud network by bouncing through your data center.
Advantages of This Design
Isolation between Web and App Servers: By putting the auto-scaling groups in a Network Security Groups, you restrict their access to everything.
No Direct Connection: In this design pattern you can block direct traffic to the application servers from anywhere but the VPN. Intra-application traffic is asynchronous via the message queue and notification service, so isolation is complete.
PaaS Service: This architecture uses cloud provider services, with strong built-in security and resilience. Cloud providers understand that security and availability are core to their business.
What’s next for this kind of architecture? To advance this architecture you could deploy mirrors of the application in different zones within a region to limit the blast radius in case one device is compromised, and to provide additional resiliency in case of a zone failure.
Additionally, if you use immutable servers within each auto-scale group, you can update/patch/reconfigure instances automatically by changing the master image and having auto-scaling replace the old instances with new ones. This limits configuration drift and adversary persistence.
Multi-Region Web Site
This architecture was designed to deploy a website in multiple regions, with availability as close to 100% as possible. This design is far more expensive than even running in multiple zones within a single region, because you need to pay for network traffic between regions (compared to free intra-region traffic); but if uptime is essential to your business, this architecture improves resiliency.
This is an externally facing application so you run traffic through a cloud WAF to get rid of obvious attack traffic. Inbound sessions can be routed intelligently to either region via the DNS service. So you can route based on server utilization, network traffic, geography, or a variety of other criteria. This is a great example of software-defined security, programming traffic distribution within cloud stacks.
- Network Security Groups: In this design pattern you implement Network Security Groups to lock down traffic into the app servers. That isn’t shown specifically because it would greatly complicate the diagram. But Network Security Groups for the web and app servers should be part of this architecture.
- Compute Layer: Application and web servers are in auto-scale groups within each region. The load balancer distributes the sessions between the regions intelligently as well to ensure the most effective usage of web site.
- Database Layer: If this was just a multi-zone deployment you wouldn’t need to worry about database replication, as that capability is built into PaaS databases. But that is not the case here. You are operating in multiple regions, so you need to replicate your databases. It is like having two separate data centers. We cannot tackle the network architecture to support database replication here, because that would also overcomplicate the architecture. We just need to point out another way operating in multiple regions adds complexity.
- Static Files: This website of course includes a variety of static files, so you need to figure out how to keep the file stores in sync between regions as well. Using Network Security Groups, you can lock access to the storage buckets down to specific groups or instances. That’s a good way to make sure you don’t get malware files loaded up onto your website. Cross-region replication is a service your cloud provider may offer so you don’t need to build it yourself.
Advantages of This Design
This architecture is primarily intended to show how the cloud provides enables you to easily establish an application in multiple regions, similar to multiple data centers. So what’s unique about the cloud? You can take the entire stack in Region A and copy it to Region B with a few clicks in your cloud console. Of course you’d have some networks to reconfigure, but in most ways the environment is identical. Auto-scale groups work off the same images, so the operational overhead of supporting multiple cloud regions is drastically lower than operating across multiple data centers.
There are also significant availability advantages. In case a region goes down, the DNS service will detect that automatically and route all new incoming sessions to the available region. Existing sessions in the region will be lost, so there is some collateral damage, but that happens any time you lose a data center. Those sessions can be re-established to the available region. When the region recovers DNS will automatically start sending new sessions to it. This is all transparent to the application and users.
You can also provide a central spot for both static files and logs by using cross-region replication. This capability is currently specific to Amazon Web Services, but we expect it to be a critical feature on all cloud infrastructure platforms at some point. As is increasingly the case in the cloud, services which previously required extreme planning and/or additional products to manage, are now built-in platform features.
That’s a good note on which to wrap up this quick series. The cloud provides many capabilities that enable you to deploy applications significantly more securely and reliably than rolling them out in your own datacenter. Of course there will be resistance to the new way of thinking – there always is. But you can combat this resistance by using information like the suggestions in this series (and in our other blog posts and reference architectures) to highlight the obvious advantages of these new technologies.