This is the third post in our series, “Network Operations and Security Professionals’ Guide to Managing Public Cloud Journeys”, which we will release as a white paper after we complete the draft and have some time for public feedback. You might want to start with our first and second posts. Special thanks to Gigamon for licensing. As always, the content is being developed completely independently using our Totally Transparent Research methodology.
Learning cloud adoption patterns doesn’t just help us identify key problems and risks – we can use them to guide operational decisions to address the issues they consistently raise. This research focuses on managing networks and network security, but the patterns include broad security and operational implications which cover all facets of your cloud journey. Governance issues aside, we find that networking is typically one of the first areas of focus for organizations, so it’s a good target for our first focused research. (For the curious, IAM and compliance are two other top areas organizations focus on, and struggle with, early in the process).
Recommendations for a Safe and Smooth Journey
Developer Led
Mark sighed with relief and satisfaction as he validated the VPN certs were propagated and approved the ticket for firewall rule change. The security group was already in good shape and they managed to avoid having to add any kind of direct connect to the AWS account for the formerly-rogue project.
He pulled up their new cloud assessment dashboard and all the critical issues were clear. It would still take the IAM team and the project’s developers a few months to scale down unneeded privileges but… not his problem. The federated identity portal was already hooked up and he would get real time alerts on any security group changes.
“Now onto the next one,” he mumbled after he glanced at his queue and lost his short-lived satisfaction.
“Hey, stop complaining!” remarked Sarah, “We should be clear after this backlog now that accounting is watching the credit cards for cloud charges; just run the assessment and see what we have before you start complaining.”
Having your entire organization dragged into the cloud thanks to the efforts of a single team is disconcerting, but not unmanageable. The following steps will help you both wrangle the errant project under control, and build a base for moving forward. This was the first adoption pattern we started to encounter a decade ago as cloud starting growing, so there are plenty of lessons to pull from. Based on our experiences, a few principles really help manage the situation:
- Remember that to meet this pattern you should be new to either the cloud in general, or to this cloud platform specifically. These are not recommendations for unsanctioned projects covered by your existing experience and footprint.
- Don’t be antagonistic. Yes, the team probably knew better and shouldn’t have done it… but your goal now is corrective actions, not punitive.
- You goal is to reduce urgent risks while developing a plan to bring the errant project into the fold.
- Don’t simply apply your existing policies and tooling from other environments to this one. You need tooling and processes appropriate for this cloud provider.
- In our experience, despite the initial angst, these projects are excellent opportunities to learn your initial lessons on this platform, and to start building out for a larger supported program. If you keep one eye on immediate risks and the other on long-term benefits, everything should be fine.
The following recommendations go a long way towards reducing risks and increasing your chances of success. But before the bullet points we have one overarching recommendation: As you gain control over the unapproved project, use it to learn the particulars of this cloud provider and build out your core cloud management capabilities. When you assess, set yourself up to support your next ten assessments. When you enable monitoring and visibility, do so in a way which supports your next projects. Wherever possible build a core service rather than a one-off.
- Step one is to figure out what you are dealing with:
- How many environments are involved? How many accounts, subscriptions, or projects?
- How are the environments structured? This involves mapping out the application, the PaaS services offered by the provider (they offer PaaS services such as load balancers and serverless capabilities), the IAM, the network(s), and the data storage.
- How are the services configured?
- How are the networks structured and connected? The Software Defined Networks (SDN) used by all major cloud platforms only look the same on the surface – under the hood they are quite a bit different.
- And, most importantly, Where does this project touch other enterprise resources and data?!? This is essential for understanding your exposure. Are there unknown VPN connections? Did someone peer through an existing dedicated network pipe? Is the project talking to an internal database over the Internet? We’ve seen all these and more.
- Then prioritize your biggest risks:
- Internet exposures are common and one of the first things to lock down. We commonly see resources such as administrative servers and jump boxes exposed to the Internet at large. In nearly every single assessment we find at least one instance or container with port 22 exposed to the world. The quick fix for these is to lock them down to your known IP address ranges. Cloud providers’ security groups are very effective because they just drop traffic which doesn’t meet the rules, so they are an extremely effective security control and a better first step than trying to push everything through an on-premise firewall or virtual appliance.
- Identity and Access Management is the next big piece to focus on. This research is focused more on networking, so we won’t spend much time on this here. But when developers build out environments they almost always over-privilege access to themselves and application components. They also tend to use static credentials, because unsanctioned projects are unlikely to integrate into your federated identity management. Sweep out static credentials, enable federation, and turn on MFA everywhere you can.
- Misconfigurations of cloud services are next. Public storage buckets, unsecured API gateways, and other services which are Internet exposed but won’t show up if you only look at the virtual networks.
- After cleaning those up it’s time to start layering in longer-term remediations and your gameplan. This is a huge topic so we will focus on network management and security:
- On early discovery of developer-led projects, it is very common to want to tie the errant cloud account back into your on-premise network for connectivity and management. This instinct is usually wrong. Networking wasn’t involved at the start, so it is unlikely there is an established network connection, and adding one won’t necessarily provide any benefits. If the account is okay on its own, leave it. While outside the scope of this research, a wide range of techniques is available to provide necessary services to disconnected cloud accounts… or cloud-native connections such as service endpoints which achieve the same goals without the heavy lifting on fat pipes and CIDR segmenting.
- We aren’t suggesting you don’t manage the network – we are saying you don’t need to simply wire it up to your existing infrastructure to manage it or the resources it contains.
- A big complicating problem to integrating an unplanned SDN is the existing IP addressing (if there’s even a virtual network – a real question thanks to the new serverless architectures). This may be further motivation to keep it as a separate enclave.
- We assume you followed our advice above and locked down the perimeter. Now it’s important to fully map out all the internal connections, including connections between different virtual networks and accounts which are peered or otherwise connected using cloud-native techniques such as service endpoints.
- One of the most common networking mistakes seen in these kinds of projects is too-open internal networks. Clouds default to least privilege, but it is still all too easy to just open everything up to reduce friction during development. Use your map to start compartmentalizing internally. This may include network structure changes (routing and subnet modifications) which are easier to update with API calls and console clicks than stringing wires between routers.
- Security groups should reference each other (in Azure you need Application Security Groups) instead of relying on IP addressing for internal cloud connections. This is fundamental to cloud networks, but not where people with traditional network security backgrounds tend to start.
- Virtual security appliances (we are mostly talking about firewalls, IDS, and IPS) should only be used when security groups and native cloud capabilities can’t meet your needs. Virtual appliances are expensive to run because cloud providers charge for their compute cycles, and they create unnecessary chokepoints which affect performance and reliability. The most common situation you still need them for are FQDN-based outbound filtering, and specific blocklists which are hard or impossible to enforce with cloud-native security groups.
- Lastly, once everything is in a known good state, you should implement continuous configuration assessments and guardrails to keep things that way. For example in a production application you should generate an alert for any security group change, creation of new internet gateways, and other structural changes. All providers support monitoring these changes but you will likely need third-party tooling to pull the results together across providers and accounts.
- On early discovery of developer-led projects, it is very common to want to tie the errant cloud account back into your on-premise network for connectivity and management. This instinct is usually wrong. Networking wasn’t involved at the start, so it is unlikely there is an established network connection, and adding one won’t necessarily provide any benefits. If the account is okay on its own, leave it. While outside the scope of this research, a wide range of techniques is available to provide necessary services to disconnected cloud accounts… or cloud-native connections such as service endpoints which achieve the same goals without the heavy lifting on fat pipes and CIDR segmenting.
Overall the key to handling this situation is to avoid panic, focus on obvious risks first, and then take your time to sweep through the rest in as cloud and provider specific a way as possible. Use it as a base to build your program, understanding that you will need to make short-term sacrifices to handle any significant exposures.
Data Center Transformation
Sarah snagged an extra chair outside Mark’s cubicle as he shoved a pile of office detritus to the side to make space for her laptop.
“Okay, “ she started, “I published the PrivateLink endpoint for the log receiver and set the internal domain name, but I need you to open up the security group and approve my PR on the CloudFormation templates to deploy all the service endpoints into the VPCs.”
“No problem, “ he replied, “we got approval from the cloud team last week so we’re good to go. Do we need to talk to the server image team to embed the DNS for the log agents?”
“They already have it and are publishing the new base AMIs that need it. We think most teams will just set their agents to save instance and container logs directly to S3, but some of the legacy stuff still needs to push them on the network. We are also letting teams use their own PrivateLink addresses if they want to do a swap out of a local collector.:
“Nice, “ said Mark, “this will really help us drop some peering connections on the transit gateway. And I’m meeting with the database team next week to see if we can start moving them over.”
Large multi-year data center moves are some of the most complex projects in information technology. Moving everything from one physical location to another is a massive undertaking. Doing so while keeping services up and running, without shutting the business down (either planned or unplanned) even more so. Swapping to an entirely different technology foundation at the same time? That can be the definition of insanity, yet every single organization of any size does it at some point.
The most common mistakes we see involve shoehorning traditional architectural and security concepts into the cloud – which can lead to extended timelines, increased costs, and long-term management issues. A few principles will keep you moving in the right direction:
- If you are bad at network management and security in your existing data center, you will be pleasantly surprised at how little changes in cloud. Look at cloud as an opportunity to do things better, ideally in a cloud-native way. Don’t just bring across your existing practices without change – especially bad habits.
- Time is your friend. Don’t rush, and don’t let your cloud provider push you into moving faster than you are comfortable with. Their priorities are not yours.
- Don’t assume your existing tools and processes will work well in the cloud. We see many organizations bring things across due to employee familiarity, or because they already have licenses. Those aren’t the best reasons to deploy something in an entirely new operating environment.
- That said, these days many products offer extensions for the cloud. You should still evaluate them instead of assuming they will meet your needs in the cloud, but they might be a useful bridge.
- Learn first, move second. Take the time (if you have it) to hire and build the skills needed to operate on your new platform. You absolutely cannot expect your existing team to handle both the current environment and cloud if you don’t give them the time to learn the skills and do the job.
In the developer led pattern we had to balance closing immediate risks against simultaneously building support for an entirely new operating environment and preparing for long-term support. Scary and difficult, but also usually self-constrained to something manageable like a single application stack. In a data center transformation the challenges are scaling, transitioning completely to a new environment, and any need to carry over legacy resources not designed to run in cloud.
- Start by building your plan:
- You will not want to run everything in a single huge account/subscription/project on just 1-3 virtual networks. This is all too common and falls apart within 18-24 months due to service limits, how cloud networks work differently, and cloud-native application requirements.
- You will want multiple cloud environments (accounts/subscriptions/projects are the terms used by different providers) and very likely multiple virtual networks in each environment. These are needed for blast radius control, managing service limits, and limiting IAM scopes.
- Map out your existing applications and environments (networks, cross-app connectivity, associated security controls, and related supporting services such as DNS and logging). Create a registry and then prioritize and sequence your moves.
- Map out your application dependencies. You might have 50 applications which all connect back to a shared customer database. This directly impacts how you structure your accounts, virtual networks, and connectivity options.
- Design a flexible architecture. Think of it as a scaffold to build on as you pull project by project into the cloud. You don’t need to definitively plan out every piece of the migration before you move, unless you really like spending massive amounts of money on project managers and consultants.
- Then start building the scaffold:
- Start with foundational shared services you will need across all your cloud environments: logging/monitoring receivers, cloud assessment (cloud security posture management), cloud automation (including cloud detection and response), other visibility/monitoring tools, and IAM.
- Your will likely need at least one transit network (a central virtual network used to peer your other virtual networks, even across cloud environments). Design this network (in its own account) for transit only… don’t design it to contain any actual resources (except possibly some shared services).
- Many shared services work better as “endpoint services” which are published within the cloud provider but don’t require network peering outside. Implementation is quite different at each cloud provider, so we can’t get more specific in this research, but endpoint services really enable you to take advantage of cloud software defined networks, and reduce reliance on fixed IP addresses and traditional network segmentation.
- Build infrastructure as code templates for your “landing zones” for the new accounts you will create for various projects. These can and should embed foundational security controls, such as links to transit networks and endpoint services (as appropriate), baseline network security controls, and implementation of the assessment, monitoring/logging, visibility, automation, IAM, and other core tools you use to track each of your environments.
- Don’t forget, these are just pointers to get you started – we aren’t trying to downplay the complexity of such these projects.
- With the scaffold in place it’s time to start migrating workloads:
- Think of this as an iterative process. Just as you build a scaffold and smaller environments, move your projects over in a prioritized order to help you learn as you go.
- As you move each project over, try to refactor and rearchitect to the best of your ability. For example you should “fit the network to the application” – you can now have multiple software designed networks to each contain the bare minimum to support one project. This really helps reduce attack surface and provides compartmentalization.
- Keep up with continuous assurance. Mistakes happen and your shared monitoring, visibility, and remediation tools will help reduce exposure. Don’t wait until the end for one big assessment.
These migrations and transformation can be overwhelming if you try to plan everything out as one giant project. If you think in terms of building central services and a scaffold, then migrating projects iteratively, you reduce risk while increasing your chance of success.
Snap Migration
Clarice clicked swapped in the new security group and closed out the last (for now) high priority ticket. She checked her queue and the latest assessment results and everything looked okay,
“Well,” she thought to herself, “I guess it’s time to start hitting the internal groups.”
She Slacked Bill, “I think I have the dev teams locked down to our sanctioned CIDRs, how’s the service endpoint project going?”
“Pretty good… the log receiver is set up and we are close to cutting over the customer DB. We still need to peer the CRM stack’s network, but I think we can start weening off some of the marketing apps and shut down those direct network connections.”
“Cool. Paul is assessing the rest of the spaghetti mess. It will take a bit to break out most of the apps into their own accounts, but at least we have a good base for the new projects.”
Snap migration can be the riskiest of all adoption patterns. Short timelines, critical resources, and rarely the skills and staff needed. They combine the messiness of the developer led pattern with the scale of a data center transformation. In our experience these projects often include a heavy dose of cloud provider or consultant pressure to move fast and gloss over complexity.
Let’s start with our principles:
- Your primary objective is to minimize immediate risk while creating a baseline to use as you clean things up over time after the cutover.
- Get the right people with the right skills. This includes training, hiring, and consulting. Make sure you really vet the people you are bringing in – even your cloud provider’s experts may be fresh out of school with little real-world experience.
- Don’t just copy and paste your existing network into the cloud. This approach always fails within 18-24 months, for many reasons we have already cited.
- Constantly look for opportunities to manage blast radius. Use multiple virtual networks and accounts, and only connect them where needed.
- You typically won’t have time in a snap migration for any serious refactoring or rearchitecting. Instead focus on a strong scaffold and management controls, with the expectation that you can start making things a little more cloud native once the main cutover is complete.
These are simply bad situations, which you need to make the best of. Making some smart decisions early on will go a long way to helping you set yourself up for iterative cleanup after the mad rush is over.
- Start by building a scaffold – not a parking lot.
- Follow our recommendations for the data center transformation pattern.
- While you might need to replicate your current network, nothing says you have to do that in a single virtual network. With peering and transit networks, you can architect your new cloud network with subnets in separate virtual networks and accounts based on projects, then connect them together with your cloud provider’s peering capabilities. For example you can create the 10.0.1.0/24 subnet in one virtual network in one cloud account, and the 10.0.2.0/24 subnet in an entirely different virtual network and account, then peer them together.
- This improves your long-term security because account segregation, even across peered networks, helps manage the service limit and IAM issues which cause so many problems when everything is in one account. For example if different projects share the same virtual network, it is hard to designate IAM privileges so the various administrators cannot affect each other’s resources.
- Knowing your subnets and connectivity requirements are key factors for success.
- As with our data center transformation pattern, build your shared services after (or concurrently with) your network scaffold.
- Be cautious and judicious allowing Internet access. Controlling the public perimeter early is crucial. Quite a bit can be accidentally opened up during data migrations, as teams rush to throw assets into the cloud, so make sure you keep a continuous eye on things.
- Also track the network connections to your on-premise environments. At some point many of these openings should be shut down, as projects complete migration and no longer need to call back to the doomed data center.
- To the best of your ability also implement in-cloud network segregation with security groups. Another issue we often see is excessive security group openings within the network – ops, devs, or even security may not know all the right port and protocol combinations for a given application. There is literally zero cost to more security groups, which are effectively firewalls around every resource. Use them to your advantage and dial down permissions.
- In the long term you will want to sweep through and refactor and rearchitect where you can. This is much easier if you migrated into multiple accounts and virtual networks.
Native New Build
Maria checked the assessment results from Dev and everything looked good. The Internet facing bit was just a single page app hosted in S3, but the Lambda functions needed network access to hit the Elasticsearch cluster. The security groups were locked down tight and the logging all hooked in using S3 and SNS so they didn’t need to link back using the logging PrivateLink. The security and networking IAM roles had the right permissions for the monitoring tools and the IR team could escalate to write access as needed.
“Hey John, do you know what org unit we are dropping this marketing app into? I want to check the SCPs to make sure nothing will break.”
“Yep, let me check… ” he replied, “looks like the default marketing one.”
“Cool, I’ll go approve it for prod and promote the Terraform build.”
Cloud native doesn’t mean a project is inherently secure, but it does completely shift the security and networking focus. The key principles are:
- Cloud security and operations start with architecture and end with automation. A well-designed architecture will reduce most risks. Automation maintains a strong and safe posture over time.
- Serverless, containers, and other emerging technologies are the norm. You may or may not have networks, and the networks you do have will be quite different from traditional infrastructure.
- Your public-facing perimeter is more than just what your virtual networks expose. Many services in cloud providers are (potentially) directly public-facing, and must be managed at the configuration level.
- Subdomain takeovers in cloud are very common due to these services. Make sure you are monitoring at the DNS level, not just IP addresses.
- The biggest issues we see for this pattern are mostly related to governance. Dev teams are allowed to move fast and break things, and while there is nothing inherently wrong with that, it becomes a problem when they move faster than security can contain risks. Early engagement, architectural support, continuous monitoring, and strong team relations are essential for success.
- Fit networks to applications. We will talk more about this later in recommendations, but this is a core philosophy: start with the application’s needs, and build the network to fit them.
As your organization becomes more and more cloud-native, you will want to start with your people and setting up a secure foundation for individual projects to execute on:
- Invest in people. Hire smart, train them, and allow them to become experts on the platforms you deploy on. When you transition employees with traditional skills to build cloud-native projects, don’t force them to split their time. Let them focus.
- Your scaffold will be similar to the ones we recommend for data center transformation, but you should plan on different network and security architectures. In many cloud-native deployments there is no customer-managed network.
- Rely more on object storage (such as S3), service endpoints, API gateways, and other tools which don’t require managing IP addresses for shared services. That said, you will always still need some virtual networks and a transit gateway.
- Set standards for your container networks and integrate them into your overall network management. Publish guidelines and even templates to build an easy path for independent teams to follow. Container networks can be easy to lose track of, especially when they are self-contained.
- Continuous integration and infrastructure as code are your friends. Develop supported templates for different patterns (e.g., serverless, containers, standard virtual networks) which integrate your monitoring, logging, management, and security tools. Project teams can build these into their own templates; offering an easy path again helps encourage compliance.
- You will need to continuously monitor and enforce your standards across hundreds or even thousands of cloud accounts. Build this early and automate provisioning through infrastructure as code and other automation capabilities.
As a final reminder, cloud native architectures and operations are very different. Your core skills and objectives are the same, but the implementation details are incredibly different and often don’t even translate between cloud providers. Providers launch new features and services on a daily basis, further challenging overworked security and operations teams.
Learn, take your time, work well with project teams, be nimble, and if you are in management… give your people time to keep up with the rapid rate of change.
Comments