Securing SAP Clouds: Architecture and OperationsBy Adrian Lane
This post will discuss several keys differences in application architecture and operations – with a direct impact on security – which you need to reconsider when migrating to cloud services. These are the areas which make operations easier and security better.
As companies move large business-critical applications to the cloud, they typically do it backwards. Most people we speak with, to start getting familiar with the cloud, opt for cheap storage. Once a toe is in the water they place some development, testing, and failover servers in the cloud to backstop on-premise systems. These ar less critical than production servers, where firms do not tolerate missteps. By default firms design their first cloud systems and applications to mirror what they already have in existing data centers. That means they carry over the same architecture, network topology, operational model, and security models. Developers and operations teams work with a familiar model, can leverage existing skills, and can focus on learning the nuances of their new cloud service. More often than not, once these teams are up to speed, they expect to migrate production systems fully to the cloud. Logical, right? It’s good until you move production to the cloud, when it becomes very wrong.
Long-term, this approach creates problems. It’s the “Lift and Shift” model of cloud deployment, where you create an exact copy of what you have today, just running on a service provider’s platform. The issues are many and varied. This approach fails to take into account the inherent resiliency of cloud services. It doesn’t embrace automatic scaling up and down for efficient resource usage. From our perspective the important failures are around security capabilities. This approach fails to embrace ephemeral servers, highly segmented networks, automated patching, or agile incident response – all of which enable companies to respond to security issues faster, more efficiently, and more accurately than possible with existing systems.
Network and Application Segmentation
Most firms have a security ‘DMZ’, an untrusted zone between the outside world and their internal network, and inside a flat internal network. There are good reasons this less than ideal setup is common. Segregating networks in a data center is hard – users and applications leverage many different resources. To segregate networks often requires special hardware and software and becomes expensive to implement and difficult to maintain. As attackers commonly move from where they breached a company network, either “East/West” between servers or “North/South” gain control of applications as well. ‘Pivoting’ this way, to compromise as much as possible, is exactly why we segregate networks and applications.
But this is exactly the sort of capability provided by default with cloud services. If you’re leveraging SAP’s Hana Cloud Platform, or running SAP Hana on an IaaS provider like AWS, network segregation is built in. Inbound ports an protocols are disabled by default, eliminating many of the avenues attackers use to penetrate severs. You open only those ports and protocols you need. Second, SAP and AWS are inherently multi-tenant services, so individual accounts – and their assigned resources – are fully segregated and protected from other users. This enables you to limit the “blast radius” of a compromise to the resources in a single account. Application by application segregation is not new, but ease of use makes it newly feasible in the cloud. In some cases you can even leverage both PaaS and IaaS simultaneously – letting one cloud serve as an “air gap” for another. Your cloud service provider offers added advantages of running under different account credentials, roles, and firewalls. You can specify exactly which users can access specific ports, require TLS, and limit inbound connections to approved IP addresses.
“Immutable servers” have radically changed how we approach security. Immutable servers do not change once they go into production. You completely remove login access to the server. PaaS providers leverage this approach to ensure their administrators cannot access your underlying resources. For IaaS it means there is no administrative access to servers. In Hana, for example, your team only logs into the application layer, and the underlying servers do not offer administrator logins for the service provider – that capability is disabled. Your operating systems and applications cannot be changed, and administrative ports and accounts are disabled entirely. If you need to update an OS or application you alter the server configuration or select a new version of the application code in a cloud console, and then start new application servers and shut down the old versions.
HCP does not yet leverage immutable servers, but it is on the roadmap. Regular automated replacement is a huge shock, which takes most IT operations folks a long time to wrap their heads around, but something you should embrace early for the security and productivity gains. Preventing hostile administrative access to servers is one key advantage. And auditors love the fact that third parties do not have access.
This concept is limits which resources an attacker can access after initial compromise. We reduce blast radius by preventing attackers from pivoting elsewhere, by reducing the number of accessible services. There are a couple approaches. One is use of VPCs and the cloud’s native hyper-segregation. Most vulnerable ports, protocols, and permissions are simply unavailable. Another approach is to deploy different SAP features and add-ons in different user accounts, leveraging the isolation capabilities built into multi-tenant clouds. If a specific user or administrative account is breached, your exposure is limited to the resources in that account. This sounds radical but it not particularly difficult to implement. Some firms we have spoken with manage hundreds – or even thousands – of accounts to segregate development, QA, and production systems.
Most firms we speak with have a firewall to protect their internal network from outsiders, and identity and access management to gate user access to SAP features. Beyond that most security is not at the application layer – instead it is at the network layer. Intrusion detection, data loss prevention, extrusion filtering, user behavior monitoring, and similar security capabilities all work by inspecting network traffic. In the cloud you rely far more on application-layer security, along with application logs and agent-based monitors, to collect events for security analysis. You need to understand what network-oriented security measure become obsolete because the attack vectors they address have become non-issues, and find suitable replacements to address threats which remain an issue.
Patching and Change Management
In general, all organizations hate to patch. It requires server downtime, synchronizing the efforts of different teams to install, and then testing before applications or servers can go back into production. Small manual fixes and configuration changes are often needed to make the new code work, and all too often they never make it outside administrator skulls into change management systems. And there is always a chance a patch may break an application, requiring a patch rollback and recovery to a previous state.
One of the biggest obstacles for IT and security professionals is the idea that we patch servers regularly, automating much of the patching and rollout process. In PaaS infrastructure patching is handled for you on a regular basis. It occurs quietly behind the scenes, without service interruption, often without the customer aware of any changes. They can manage this because each logical server is actually multiple virtual servers, behind a load balancer, regularly cycled by the service provider to keep current and healthy. But as the platform is fully API-enabled, you can leverage these capabilities to roll out patches for your own applications using the Hana cloud.
On IaaS you run multiple instances of your applications in an autoscale group behind your load balancer, rolling out new patched instances as needed. You can leave unpatched versions up and running, allowing load balances to steer traffic away from them, until you are satisfied and ready to terminate the older instances. When something goes wrong with a server or application instance, it’s easiest to replace it with a fresh image. You can automatically scan for misconfigurations in metadata, the network, or applications across your Hana instance, and remediate with vendor APIs.
In some cases cloud providers roll out one or more patches per day, so there is a shrinking window for attackers to exploit known flaws. As a reference for how effective this can be, some cloud providers were able to patch hundreds of thousands of servers against the Heartbleed vulnerability in under 48 hours, before attackers could weaponize the exploit. This is a tremendous advantage for security. It means we do not need to be three, six, or twelve months behind on security patches – to SAP, servers, or our own applications. Most attackers leverage known vulnerabilities which have not been patched, so fast-flux patching is generally faster than they can react. It also means that if an attacker does manage to compromise a server or application instance, they cannot “camp out” long because their victim servers will be replaced soon.
When a company discovers malware on their network the long process of discovering which servers are infected begins. For each infected server, most organizations physically quarantine the machine, make backups, bring the physical server or image to the Security Operations Center for analysis, and then requisition a new server. We bring up incident response because it needs to be entirely re-examined. If you are using SAP’s PaaS server for Hana and believe there has been a compromise, you ares limited to application-layer logs and whatever SAP has contractually guaranteed to provide – typically not much. But IaaS and PaaS enable you to automate most of your response, which has traditionally been highly labor-intensive.
With IaaS you have even greater control over resources. With a very small set of API calls to your cloud service you can isolate a server, pull instance metadata, snapshot all storage and memory, change ownership and access rights of the live running instance to the security group, launch analysis tools, and launch a replacement server. It all runs within seconds. One of the most time-consuming security tasks for IT operations can be reduced to a background utility script. This is self-healing infrastructure, with operations and orchestration capabilities to make it reality.
To achieve fundamentally better security at lower cost, you need to redesign application deployments a bit. Realizing the benefits of Agile cloud operations requires a time investment. You need to understand how cloud services work, and to create automation scripts which embody your processes for automated patch management, secure deployment, and incident response. You will need to move much of your operations to a continuous integration model – code, scripts, and manifests must be automatically assembled; security credentials must be issued; and deployment must become automatic. Hopefully we have made clear that there is considerable overlap between what security teams typically prescribe and how operations teams work. Leveraged properly, these productivity advantages also produce security advantages. The great part is that these features are simply part of the cloud service you are already paying for.
That said, there is still a lot of work to do. To implement segregated application stacks and/or segregated networks, you need to alter your deployment model for applications to leverage granular isolation. To leverage “best of breed” cloud services you will re-architect or even break apart applications into smaller services, each running on the cloud best suited to its task. To take advantage of immutable servers you need to standardize your application configurations, and for Hana on IaaS to script your server startup and configuration processes. To patch with agility you must evolve how you apply and test patches, and your entire patch management process. Most of Operations’ work to support incident response becomes a straightforward script. This will save Operations hours of manual labor – after you script the process and set up connections for Security Operations to receive server images. But over the long term it requires less Operations work, and security is integrated into process.