Right now I’m working on updating many of my little command line tools into releasable versions. It’s a mixed bag of things I’ve written for demos, training classes, clients, or Trinity (our mothballed product). A few of these are security automation tools I’m working on for clients to give them a skeleton framework to build out their own automation programs. Basically, what we created Trinity for, that isn’t releasable.
One question that comes up a lot when I’m handing this off is why write custom Ruby/Python/whatever code instead of using CloudFormation or Terraform scripts. If you are responsible for cloud automation at all this is a super important question to ask yourself.
The correct answer is there isn’t one single answer. It depends as much on your experience and preferences as anything else. Each option can handle much of the job, at least for configuration settings and implementing a known-good state. Here are my personal thoughts from the security pro perspective.
CloudFormation and Terraform are extremely good for creating known good states and immutable infrastructure and, in some cases, updating and restoring to those states. I use CloudFormation a lot and am starting to also leverage Terraform more (because it is cross-cloud capable). They both do a great job of handling a lot of the heavy lifting and configuring pieces in the proper order (managing dependencies) which can be tough if you script programmatically. Both have a few limits:
- They don’t always support all the cloud provider features you need, which forces you to bounce outside of them.
- They can be difficult to write and manage at scale, which is why many organizations that make heavy use of them use other languages to actually create the scripts. This makes it easier to update specific pieces without editing the entire file and introducing typos or other errors.
- They can push updates to stacks, but if you made any manual changes I’ve found these frequently break. Thus they are better for locked-down production environments that are totally immutable and not for dev/test or manually altered setups.
- They aren’t meant for other kinds of automation, like assessing or modifying in-use resources. For example, you can’t use them for incident response or to check specific security controls.
I’m not trying to be negative here – they are awesome awesome tools, which are totally essential to cloud and DevOps. But there are times you want to attack the problem in a different way.
Let me give you a specific use case. I’m currently writing a “new account provisioning” tool for a client. Basically, when a team at the client starts up a new Amazon account, this shovels in all the required security controls. IAM, monitoring, etc. Nearly all of it could be done with CloudFormation or Terraform but I’m instead writing it as a Ruby app. Here’s why:
- I’m using Ruby to abstract complexity from the security team and make security easy. For example, to create new Identity and Access Management policies, users, and roles, the team can point the tool towards a library of files and the tool iterates through and builds them in the right order. The security team only needs to focus on that library of policies and not the other code to build things out. This, for them, will be easier than adding it to a large provisioning template. I could take that same library and actually build a CloudFormation template dynamically the same way, but…
- … I can also use the same code base to fix existing accounts or (eventually) assess and modify an account that’s been changed in the future. For example, I can (and will) be able to asses an account, and if the policies don’t match, enable the user to repair it with flexibility and precision. Again, this can be done without the security pro needing to understand a lot of the underlying complexity.
Those are the two key reasons I sometimes drop from templates to code. I can make things simpler and also use the same ‘base’ for more complex scenarios that the infrastructure as code tools aren’t meant to address, such as ‘fixing’ existing setups and allowing more granular decisions on what to configure or overwrite. Plus, I’m not limited to waiting for the templates to support new cloud provider features; I can add capabilities any time there is an API, and with modern cloud providers, it there’s a feature it has an API.
In practice you can mix and match these approaches. I have my biases, and maybe some of it is just that I like to learn the APIs and features directly. I do find that having all these code pieces gives me a lot more options for various use cases, including using them to actually generate the templates when I need them and they might be the better choice. For example, one of the features of my framework is installing a library of approved CloudFormation templates into a new account to create pre-approved architecture stacks for common needs.
It all plays together. Pick what makes sense for you, and hopefully this will give you a bit of insight into how I make the decision.
Posted at Wednesday 16th November 2016 9:58 pm
(0) Comments •
By Adrian Lane
I had a brief conversation today about security for cloud database deployments, and their two basic questions encapsulated many conversations I have had over the last few months. It is relevant to a wider audience, so I will discuss them here.
The first question I was asked was, “Do you think that database security is fundamentally different in the cloud than on-premise?”
Yes, I do. It’s not the same. Not that we no longer need IAM, assessment, monitoring, or logging tools, but the way we employ them changes. And there will be more focus on things we have not worried about before – like the management plane – and far less on things like archival and physical security. But it’s very hard to compare apples to apples here, because of fundamental changes in the way cloud works. You need to shift your approach when securing databases run on cloud services.
The second question was, “Then how are things different today from 2011 when you wrote about cloud database security?”
Database security has changed in three basic ways:
1) Architecture: We no longer leverage the same application and database architectures. It is partially about applications adopting microservices, which both promotes micro-segmentation at the network and application layer, and also breaks the traditional approach of closely tying the application to a database. Architecture has also developed in response to evolving database services. We see need for more types of data, with far more dynamic lookup and analysis than transaction support. Together these architectural changes lead to more segmented deployment, with more granular control over access to data and database services.
2) Big Data: In 2011 I expected people to push their Oracle, MS SQL Server, and PostgreSQL installations into the cloud, to reduce costs and scale better. That did not happen. Instead firms prefer to start new projects in the cloud rather than moving existing projects. Additionally we see strong adoption of big data platforms such as Hadoop and Dynamo. These are different platforms with slightly different security issues and security tools than the relational platforms which dominated the previous two decades. And in an ecosystem like Hadoop applications running on the same data lake may be exposed to entirely different service layers.
3) Database as a Service: At Securosis we were a bit surprised by how quickly the cloud vendors embraced big data. Now they offer big data (along with other relational database platforms) as a service. “Roll your own” has become much less necessary. Basic security around internal table structures, patching, administrative access, and many other facets is now handled by vendors to reduce your headaches. We can avoid installation issues. Licensing is far, far easier. It has become so easy to stand up a new relational database or big data cluster this way running databases on Infrastructure as a Service now seems antiquated.
I have not gone back through everything I wrote in 2011, but there are probably many more subtle differences. But the question itself overlook another important difference: Security is now embedded in cloud services. None of us here at Securosis anticipated how fast cloud platform vendors would introduce new and improved security features. They have advanced their security offerings much faster than any other platform or service offering I’ve ever seen, and done a much better job with quality and ease of use than anyone expected. There are good reasons for this. In most cases the vendors were starting from a clean slate, unencumbered by legacy demands. Additionally, they knew security concerns were an impediment to enterprise adoption. To remove their primary customer objections, they needed to show that security was at least as good as on-premise.
In conclusion, if you are moving new or existing databases to the cloud, understand that you will be changing tools and process, and adjusting your biggest priorities.
Posted at Monday 14th November 2016 10:00 pm
(0) Comments •
By Mike Rothman
We have been fans of testing the security of infrastructure and applications as long as we can remember doing research. We have always known attackers are testing your environment all the time, so if you aren’t also self-assessing, inevitably you will be surprised by a successful attack. And like most security folks, we are no fans of surprises.
Security testing and assessment has gone through a number of iterations. It started with simple vulnerability scanning. You could scan a device to understand its security posture, which patches were installed, and what remained vulnerable on the device. Vulnerability scanning remains a function at most organizations, driven mostly by a compliance requirement.
As useful as it was to understand which devices and applications were vulnerable, a simple scan provides limited information. A vulnerability scanner cannot recognize that a vulnerable device is not exploitable due to other controls. So penetration testing emerged as a discipline to go beyond simple context-less vulnerability scanning, with humans trying to steal data.
Pen tests are useful because they provide a sense of what is really at risk. But a penetration test is resource-intensive and expensive, especially if you use an external testing firm. To address that, we got automated pen testing tools, which use actual exploits in a semi-automatic fashion to simulate an attacker.
Regardless of whether you use carbon-based (human) or silicon-based (computer) penetration testing, the results describe your environment at a single point in time. As soon as you blink, your environment will have changed, and your findings may no longer be valid.
With the easy availability of penetration testing tools (notably the open source Metasploit), defending against a pen testing tool has emerged as the low bar of security. Our friend Josh Corman coined HDMoore’s Law, after the leader of the Metasploit project. Basically, if you cannot stop a primitive attacker using Metasploit (or another pen testing tool), you aren’t very good at security.
The low bar isn’t high enough
As we lead enterprises through developing security programs, we typically start with adversary analysis. It is important to understand what kinds of attackers will be targeting your organization and what they will be looking for. If you think your main threat is a 400-pound hacker in their parents’ basement, defending against an open source pen testing tool is probably sufficient.
But do any of you honestly believe an unsophisticated attacker wielding a free penetration testing tool is all you have to worry about? Of course not. The key thing to understand about adversaries is simple: They don’t play by your rules. They will attack when you don’t expect it. They will take advantage of new attacks and exploits to evade detection. They will use tactics that look like a different adversary to raise a false flag.
The adversary will do whatever it takes to achieve their mission. They can usually be patient, and will wait for you to screw something up. So the low bar of security represented by a pen testing tool is not good enough.
The increasing sophistication of adversaries is not your only challenge assessing your environment and understanding risk. Technology infrastructure seems to be undergoing the most significant set of changes we have ever seen, and this is dramatically complicating your ability to assess your environment.
First, you have no idea where your data actually resides. Between SaaS applications, cloud storage services, and integrated business partner networks, the boundaries of traditional technology infrastructure have been extended unrecognizably, and you cannot assume your information is on a network you control. And if you don’t control the network it becomes much harder to test.
The next major change underway is mobility. Between an increasingly disconnected workforce and an explosion of smart devices accessing critical information, you can no longer assume your employees will access applications and data from your networks. Realizing that authorized users needing legitimate access to data can be anywhere in the world, at any time, complicates assessment strategies as well.
Finally, the push to public cloud-based infrastructure makes it unclear where your compute and storage are, as well. Many of the enterprises we work with are building cloud-native technology stacks using dozens of services across cloud providers. You don’t necessarily know where you will be attacked, either.
To recap, you no longer know where your data is, where it will be accessed from, or where your computation will happen. And you are chartered to protect information in this dynamic IT environment, which means you need to assess the security of your environment as often as practical. Do you start to see the challenge of security assessment today, and how much more complicated it will be tomorrow?
We Need Dynamic Security Assessment
As discussed above, a penetration test represents a point in time snapshot of your environment, and is obsolete when complete, because the environment continues to change. The only way to keep pace with our dynamic IT environment is dynamic security assessment. The rest of this series will lay out what we mean by this, and how to implement it within your environment.
As a little prelude to what you’ll learn, a dynamic security assessment tool includes:
- A highly sophisticated simulation engine, which can imitate typical attack patterns from sophisticated adversaries without putting production infrastructure in danger.
- An understanding of the network topology, to model possible lateral movement and isolate targeted information and assets.
- A security research team to leverage both proprietary and public threat intelligence, and to model the latest and greatest attacks to avoid unpleasant surprises.
- An effective security analytics function to figure out not just what is exploitable, but also how different workarounds and fixes will impact infrastructure security.
We would like to thank SafeBreach as the initial potential licensee of this content. As you may remember, we research using our Totally Transparent Research methodology, which requires foresight on the part of our licensees. It enables us to post our papers in our Research Library without paywalls, registration, or any other blockage to you reading (and hopefully enjoying) our research.
We will start describing Dynamic Security Assessment in our next post.
Posted at Thursday 10th November 2016 9:13 pm
(0) Comments •
By Adrian Lane
Our last post in this series covers two key areas: Monitoring and Auditing. We have more to say, in the first case because most development and security teams are not aware of these options, and in the latter because most teams hold many misconceptions and considerable fear on the topic. So we will dig into these two areas essential to container security programs.
Every security control we have discussed so far had to do with preventative security. Essentially these are security efforts that remove vulnerabilities or make it hard from anyone to exploit them. We address known attack vectors with well-understood responses such as patching, secure configuration, and encryption. But vulnerability scans can only take you so far. What about issues you are not expecting? What if a new attack variant gets by your security controls, or a trusted employee makes a mistake? This is where monitoring comes in: it’s how you discover the unexpected stuff. Monitoring is critical to a security program – it’s how you learn what is effective, track what’s really happening in your environment, and detect what’s broken.
For container security it is no less important, but today it’s not something you get from Docker or any other container provider.
Monitoring tools work by first collecting events, and then examining them in relation to security policies. The events may be requests for hardware resources, IP-based communication, API requests to other services, or sharing information with other containers. Policy types are varied. We have deterministic policies, such as which users and groups can terminate resources, which containers are disallowed from making external HTTP requests, or what services a container is allowed to run. Or we may have dynamic – also called ‘behavioral’ – policies, which prevent issues such as containers calling undocumented ports, using 50% more memory resources than typical, or uncharacteristically exceeding runtime parameter thresholds. Combining deterministic white and black list policies with dynamic behavior detection provides the best of both worlds, enabling you to detect both simple policy violations and unexpected variations from the ordinary.
We strongly recommend that your security program include monitoring container activity. Today, a couple container security vendors offer monitoring products. Popular evaluation criteria for differentiating products and determining suitability include:
- Deployment Model: How does the product collect events? What events and API calls can it collect for inspection? Typically these products use either of two models for deployment: an agent embedded in the host OS, or a fully privileged container-based monitor running in the Docker environment. How difficult is it to deploy collectors? Do the host-based agents require a host reboot to deploy or update? You will need to assess what type of events can be captured.
- Policy Management: You will need to evaluate how easy it is to build new policies – or modify existing ones – within the tool. You will want to see a standard set of security policies from the vendor to help speed up deployment, but over the lifetime of the product you will stand up and manage your own policies, so ease of management is key to your-long term happiness.
- Behavioral Analysis: What, if any, behavioral analysis capabilities are available? How flexible are they, meaning what types of data can be used in policy decisions? Behavioral analysis requires starting with system monitoring to determine ‘normal’ behavior. The criteria for detecting aberrations are often limited to a few sets of indicators, such as user ID or IP address. The more you have available – such as system calls, network ports, resource usage, image ID, and inbound and outbound connectivity – the more flexible your controls can be.
- Activity Blocking: Does the vendor provide the capability to block requests or activity? It is useful to block policy violations in order to ensure containers behave as intended. Care is required, as these policies can disrupt new functionality, causing friction between Development and Security, but blocking is invaluable for maintaining Security’s control over what containers can do.
- Platform Support: You will need to verify your monitoring tool supports the OS platforms you use (CentOS, CoreOS, SUSE, Red Hat, etc.) and the orchestration tool (such as Swarm, Kubernetes, Mesos, or ECS) of your choice.
Audit and Compliance
What happened with the last build? Did we remove
sshd from that container? Did we add the new security tests to Jenkins? Is the latest build in the repository?
Many of you reading this may not know the answer off the top of your head, but you should know where to get it: log files. Git, Jenkins, JFrog, Docker, and just about every development tool you use creates log files, which we use to figure out what happened – and often what went wrong. There are people outside Development – namely Security and Compliance – who have similar security-related questions about what is going on with the container environment, and whether security controls are functioning. Logs are how you get these external teams the answers they need.
Most of the earlier topics in this research, such as build environment and runtime security, have associated compliance requirements. These may be externally mandated like PCI-DSS or GLBA, or internal security requirements from internal audit or security teams. Either way the auditors will want to see that security controls are in place and working. And no, they won’t just take your word for it – they will want audit reports for specific event types relevant to their audit. Similarly, if your company has a Security Operations Center, in order to investigate alerts or determine whether a breach has occurred, they will want to see all system and activity logs over a period of time to in order reconstruct events. You really don’t want to get too deep into this stuff – just get them the data and let them worry about the details.
The good news is that most of what you need is already in place. During our investigation for this series we did not speak with any firms which did not have Splunk, log storage, or SIEM on-premise, and in many cases all three were available. Additionally the vast majority of code repositories, build controllers, and container management systems – specifically the Docker runtime and Docker Trusted Registry – produce event logs, in formats which can be consumed by various log management and Security Information and Event Management (SIEM) systems. As do most third-party security tools for image validation and monitoring. You will need to determine how easy this is to leverage. Some tools simply dump
syslog-format information into a directory, and it’s up to you to drop this into Splunk, an S3 bucket, Loggly, or your SIEM tool. In other cases – most, actually – you can specify CEF, JSON, or some other format, and the tools can automatically link to the SIEM of your choice, sending events as they occur.
This concludes our research on Building a Container Security Program. We covered a ton of different aspects – both production and non-production. We tried to offer sufficient depth to be helpful, without overwhelming you with details. If we missed something you feel is important, or you have unanswered questions, please drop us a note. We will address it in the comments below, or in the final paper, as appropriate. Your feedback that helps make these series and papers better, so please help us and other readers out.
Posted at Wednesday 9th November 2016 8:30 pm
(0) Comments •
By Adrian Lane
This post will focus on the ‘runtime’ aspects of container security. Unlike the tools and processes discussed in previous sections, here we will focus on containers in production systems. This includes which images are moved into production repositories, security around selecting and running containers, and the security of the underlying host systems.
- The Control Plane: Our first order of business is ensuring the security of the control plane – the platforms for managing host operating systems, the scheduler, the container engine(s), the repository, and any additional deployment tools. Again, as we advised for build environment security, we recommend limiting access to specific administrative accounts: one with responsibility for operating and orchestrating containers, and another for system administration (including patching and configuration management). We recommend network segregation and physical (for on-premise) or logical segregation (for cloud and virtual) systems.
- Running the Right Container: We recommend establishing a trusted image repository and ensuring that your production environment can only pull containers from that trusted source. Ad hoc container management is a good way to facilitate bypassing of security controls, so we recommend scripting the process to avoid manual intervention and ensure that the latest certified container is always selected. Second, you will want to check application signatures prior to putting containers into the repository. Trusted repository and registry services can help, by rejecting containers which are not properly signed. Fortunately many options are available, so find one you like. Keep in mind that if you build many containers each day, a manual process will quickly break down. You’ll need to automate the work and enforce security policies in your scripts. Remember, it is okay to have more than one image repository – if you are running across multiple cloud environments, there are advantages to leveraging the native registry in each. Beware the discrepancies between platforms, which can create security gaps.
- Container Validation and BOM: What’s in the container? What code is running in your production environment? How long ago did we build this container image? These are common questions asked when something goes awry. In case of container compromise, a very practical question is: how many containers are currently running this software bundle? One recommendation – especially for teams which don’t perform much code validation during the build process – is to leverage scanning tools to check pre-built containers for common vulnerabilities, malware, root account usage, bad libraries, and so on. If you keep containers around for weeks or months, it is entirely possible a new vulnerability has since been discovered, and the container is now suspect. Second, we recommend using the Bill of Materials capabilities available in some scanning tools to catalog container contents. This helps you identify other potentially vulnerable containers, and scope remediation efforts.
- Input Validation: At startup containers accept parameters, configuration files, credentials, JSON, and scripts. In some more aggressive scenarios, ‘agile’ teams shove new code segments into a container as input variables, making existing containers behave in fun new ways. Either through manual review, or leveraging a third-party security tool, you should review container inputs to ensure they meet policy. This can help you prevent someone from forcing a container to misbehave, or simply prevent developers from making dumb mistakes.
- Container Group Segmentation: Docker does not provide container-level restriction on which containers can communicate with other containers, systems, hosts, IPs, etc. Basic network security is insufficient to prevent one container from attacking another, calling out to a Command and Control botnet, or other malicious behavior. If you are using a cloud services provider you can leverage their security zones and virtual network capabilities to segregate containers and specify what they are allowed to communicate with, over which ports. If you are working on-premise, we recommend you investigate products which enable you to define equivalent security restrictions. In this way each application has an analogue to a security group, which enables you to specify which inbound and outbound ports are accessible to and from which IPs, and can protect containers from unwanted access.
- Blast Radius: An good option when running containers in cloud services, particularly IaaS clouds, is to run different containers under different cloud user accounts. This limits the resources available to any given container. If a given account or container set is compromised, the same cloud service restrictions which prevent tenants from interfering with each other limit possible damage between accounts and projects. For more information see our post on limiting blast radius with user accounts.
In Docker’s early years, when people talked about ‘container’ security, they were really talking about how to secure the Linux operating system underneath Docker. Security was more about the platform and traditional OS security measures. If an attacker gained control of the host OS, they could pretty much take control of anything they wanted in containers. The problem was that security of containers, their contents, and even the Docker engine were largely overlooked. This is one reason we focused our research on the things that make containers – and the tools that build them – secure.
That said, no discussion of container security can be complete without some mention of OS security. We would be remiss if we did not talk about host/OS/engine security, at least a bit. Here we will cover some of the basics. But we will not go into depth on securing the underlying OS. We could not do that justice within this research, there is already a huge amount of quality documentation available on the operating system of your choice, and there are much more knowledgable sources to address your concerns and questions on OS security.
- Kernel Hardening: Docker security depends fundamentally on the underlying operating system to limit access between ‘users’ (containers) on the system. This resource isolation model is built atop a virtual map called Namespaces, which maps specific users or group of users to a subset of resources (e.g.: networks, files, IPC, etc.) within their Namespace. Containers should run under a specified user ID. Hardening starts with a secure kernel, strips out any unwanted services and features, and then configuring Namespaces to limit (segregate) resource access. It is essential to select an OS platform which supports Namespaces, to constrain which kernel resources the container can access and control user/group resource utilization. Don’t mix Docker and non-Docker services – the trust models don’t align correctly. You will want to script setup and configuration of your kernel deployments to ensure consistency. Periodically review your settings as operating system security capabilities evolve.
- Docker Engine: Docker security has come a long way, and the Docker engine can now perform a lot of the “heavy lifting” for containers. Docker now has full support for Linux kernel features including Namespaces and Control Groups (cgroups) to isolate containers and container types. We recommend advanced isolation via Linux kernel features such as SELinux or AppArmor, on top of GRSEC compatible kernels. Docker exposes these Linux kernel capabilities at either the Docker daemon level or the container level, so you have some flexibility in resource allocation. But there is still work to do to properly configure your Docker deployment.
- Container Isolation: We have discussed resource isolation at the kernel level, but you should also isolate Docker engine/OS groups – and their containers – at the network layer. For container isolation we recommend mapping groups of mutually trusted containers to separate machines and/or network security groups. For containers running critical services or management tools, consider running one container per VM/physical server for on-premise applications, or grouping them into into a dedicated cloud VPC to limit attack surface and minimize an attacker’s ability to pivot, should a service or container be compromised.
- Cloud Container Services: Several cloud services providers offer to tackle platform security issues on your behalf, typically abstracting away some lower-level implementation layers by offering Containers as a Service. By delegating underlying platform-level security challenges to your cloud provider, you can focus on application-layer issues and realize the benefits of containers, without worrying about platform security or scalability.
Platform security for containers is a huge field, and we have only scratched the surface. If you want to learn more the OS platform providers, Docker, and many third-party security providers offer best practice guidance, research papers, and blogs which discuss this in greater detail.
Note that the majority of security controls in the post are preventative – efforts to prevent what we expect an attacker to attempt. We set a secure baseline to make it difficult for attackers to compromise containers – and if they do, to limit the damage they can cause. In our next and final post in this series we will discuss monitoring, logging, and auditing events in a container system. We will focus on examining what is really going on, and discovering what we don’t know in terms of security.
Posted at Tuesday 8th November 2016 2:00 am
(0) Comments •
I have received some great feedback on my post last week on bastion accounts and networks. Mostly that I left some gaps in my explanation which legitimately confused people. Plus, I forgot to include any pretty pictures. Let’s work through things a bit more.
First, I tended to mix up bastion accounts and networks, often saying “account/networks”. This was a feeble attempt to discuss something I mostly implement in Amazon Web Services that can also apply to other providers. In Amazon an account is basically an AWS subscription. You sign up for an account, and you get access to everything in AWS. If you sign up for a second account, all that is fully segregated from every other customer in Amazon. Right now (and I think this will change in a matter of weeks) Amazon has no concept of master and sub accounts: each account is totally isolated unless you use some special cross-account features to connect parts of accounts together. For customers with multiple accounts AWS has a mechanism called consolidated billing that rolls up all your charges into a single account, but that account has no rights to affect other accounts. It pays the bills, but can’t set any rules or even see what’s going on.
It’s like having kids in college. You’re just a checkbook and an invisible texter.
If you, like Securosis, use multiple accounts, then they are totally segregated and isolated. It’s the same mechanism that prevents any random AWS customer from seeing anything in your account. This is very good segregation. There is no way for a security issue in one account to affect another, unless you deliberately open up connections between them. I love this as a security control: an account is like an isolated data center. If an attacker gets in, he or she can’t get at your other data centers. There is no cost to create a new account, and you only pay for the resources you use. So it makes a lot of sense to have different accounts for different applications and projects. Free (virtual) data centers for everyone!!!
This is especially important because of cloud metastructure. All the management stuff like web consoles and APIs that enables you to do things like create and destroy entire class B networks with a couple API calls. If you lump everything into a single account, more administrators (and other power users) need more access, and they all have more power to disrupt more projects. This is compartmentalization and segregation of duties 101, but we have never before had viable options for breaking everything into isolated data centers. And from an operational standpoint, the more you move into DevOps and PaaS, the harder it is to have everyone running in one account (or a few) without stepping on each other.
These are the fundamentals of my blast radius post.
One problem comes up when customers need a direct connection from their traditional data center to the cloud provider. I may be all rah rah cloud awesome, but practically speaking there are many reasons you might need to connect back home. Managing this for multiple accounts is hard, but more importantly you can run into hard limits due to routing and networking issues.
That’s where a bastion account and network comes in. You designate an account for your Direct Connect. Then you peer into that account (in AWS using cross-account VPC peering support) any other accounts that need data center access. I have been saying “bastion account/network” because in AWS this is a dedicated account with its own dedicated VPC (virtual network) for the connection. Azure and Google use different structures, so it might be a dedicated virtual network within a larger account, but still isolated to a subscription, or sub-account, or whatever mechanism they support to segregate projects. This means:
- Not all your accounts need this access, so you can focus on the ones which do.
- You can tightly lock down the network configuration and limit the number of administrators who can change it.
- Those peering connections rely on routing tables, and you can better isolate what each peered account or network can access.
- One big Direct Connect essentially “flattens” the connection into your cloud network. This means anyone in the data center can route into and attack your applications in the cloud. The bastion structure provides multiple opportunities to better restrict network access to destination accounts. It is a way to protect your cloud(s) from your data center.
- A compromise in one peered account cannot affect another account. AWS networking does not allow two accounts peered to the same account to talk to each other. So each project is better isolated and protected, even without firewall rules.
For example the administrator of a project can have full control over their account and usage of AWS services, without compromising the integrity of the connection back to the data center, which they cannot affect – they only have access to the network paths they were provided. Their project is safe, even if another project in the same organization is totally compromised.
Hopefully this helps clear things up. Multiple accounts and peering is a powerful concept and security control. Bastion networks extend that capability to hybrid clouds. If my embed works, below you can see what it looks like (a VPC is a virtual network, and you can have multiple VPCs in a single account):
Posted at Monday 7th November 2016 8:36 pm
(0) Comments •
By Adrian Lane
This post is focused on security testing your code and container, and verifying that both conform to security and operational practices. One of the major advances over the last year or so is the introduction of security features for the software supply chain, from both Docker itself and a handful of third-party vendors. All the solutions focus on slightly different threats to container construction, with Docker providing tools to certify that containers have made it through your process, while third-party tools are focused on vetting the container contents. So Docker provides things like process controls, digital signing services to verify chain of custody, and creation of a Bill of Materials based on known trusted libraries. In contrast, third-party tools to harden container inputs, analyze resource usage, perform static code analysis, analyze the composition of libraries, and check against known malware signatures; they can then perform granular policy-based container delivery based on the results. You will need a combination of both, so we will go into a bit more detail:
Container Validation and Security Testing
- Runtime User Credentials: We could go into great detail here about runtime user credentials, but will focus on the most important thing: Don’t run the container processes as
root, as that provides attackers access to attack other containers or the Docker engine. If you get that right you’re halfway home for IAM. We recommend using specific user accounts with restricted permissions for each class of container. We do understand that roles and permissions change over time, which requires some work to keep permission maps up to date, but this provides a failsafe when developers change runtime functions and resource usage.
- Security Unit Tests: Unit tests are a great way to run focused test cases against specific modules of code – typically created as your dev teams find security and other bugs – without needing to build the entire product every time. This can cover things such as XSS and SQLi testing of known attacks against test systems. Additionally, the body of tests grows over time, providing a regression testbed to ensure that vulnerabilities do not creep back in. During our research, we were surprised to learn that many teams run unit security tests from Jenkins. Even though most are moving to microservices, fully supported by containers, they find it easier to run these tests earlier in the cycle. We recommend unit tests somewhere in the build process to help validate the code in containers is secure.
- Code Analysis: A number of third-party products perform automated binary and white box testing, failing the build if critical issues are discovered. We recommend you implement code scans to determine if the code you build into a container is secure. Many newer tools have full RESTful API integration within the software delivery pipeline. These tests usually take a bit longer to run, but still fit within a CI/CD deployment framework.
- Composition Analysis: A useful technique is to check library and supporting code against the CVE (Common Vulnerabilities and Exposures) database to determine whether you are using vulnerable code. Docker and a number of third parties provide tools for checking common libraries against the CVE database, and they can be integrated into your build pipeline. Developers are not typically security experts, and new vulnerabilities are discovered in common tools weekly, so an independent checker to validate components of your container stack is essential.
- Resource Usage Analysis: What resources does the container use? What external systems and utilities does it depend upon? To manage the scope of what containers can access, third-party tools can monitor runtime access to environment resources both inside and outside the container. Basically, usage analysis is an automated review of resource requirements. These metrics are helpful in a number of ways – especially for firms moving from a monolithic to a microservices architecture. Stated another way, this helps developers understand what references they can remove from their code, and helps Operations narrow down roles and access privileges.
- Hardening: Over and above making sure what you use is free of known vulnerabilities, there are other tricks for securing applications before deployment. One is to check the contents of the container and remove items that are unused or unnecessary, reducing attack surface. Don’t leave hard-coded passwords, keys, or other sensitive items in the container – even though this makes things easy for you, it makes them much easier for attackers. Some firms use manual scans for this, while others leverage tools to automate scanning.
- App Signing and Chain of Custody: As mentioned earlier, automated builds include many steps and small tests, each of which validates that some action was taken to prove code or container security. You want to ensure that the entire process was followed, and that somewhere along the way some well-intentioned developer did not subvert the process by sending along untested code. Docker now provides the means to sign code segments at different phases of the development process, and tools to validate the signature chain. While the code should be checked prior to being placed into a registry or container library, the work of signing images and containers happens during build. You will need to create specific keys for each phase of the build, sign code snippets on test completion but before the code is sent onto the next step in the process, and – most importantly – keep these keys secured so an attacker cannot create their own code signature. This gives you some guarantee that the vetting process proceeded as intended.
Posted at Monday 7th November 2016 5:30 pm
(0) Comments •
Mike and Rich had a call this week with another prospect who was given some pretty bad cloud advice. We spend a little time trying to figure out why we keep seeing so much bad advice out there (seriously, BIG B BAD not OOPSIE bad). Then we focus on the key things to look for to figure out when someone is leading you down the wrong path in your cloud migration.
Oh… and for those with sensitive ears, time to engage the explicit flag.
Watch or listen:
Posted at Monday 7th November 2016 2:00 pm
(2) Comments •
By Adrian Lane
As we mentioned in our last post, most people don’t seem to consider the build environment when thinking about container security, but it’s important. Traditionally, the build environment is the domain of developers, and they don’t share a lot of details with outsiders (in this case, Operations folks). But this is beginning to change with Continuous Integration (CI) or full Continuous Deployment (CD), and more automated deployment. The build environment is more likely to go straight into production. This means that operations, quality assurance, release management, and other groups find themselves having to cooperate on building automation scripts and working together more closely. Collaboration means a more complex, distributed working environment, with more stakeholders having access. DevOps is rapidly breaking down barriers between groups, even getting some security teams to contribute test scripts and configuration updates. Better controls are needed to restrict who can alter the build environment and update code, and an audit process to validate who did what.
Don’t forget why containers are so attractive to developers. First, a container simplifies building and packaging application code – abstracting the app from its physical environment – so developers can worry about the application rather than its supporting systems. Second, the container model promotes lightweight services, breaking large applications down into small pieces, easing modification and scaling – especially in cloud and virtual environments. Finally, a very practical benefit is that container startup is nearly instant, allowing agile scaling up and down in response to demand. It is important to keep these in mind when considering security controls, because any control that reduces one of these core advantages will not be considered, or is likely to be ignored.
Build environment security breaks down into two basic areas. The first is access and usage of the basic tools that form the build pipeline – including source code control, build tools, the build controller, container management facilities, and runtime access. At Securosis we often call this the “management plane”, as these interfaces — whether API or GUI – are used to set access policies, automate behaviors and audit activity. Second is security testing of your code and the container, validating it conforms to security and operational practices. This post will focus on the former.
Securing the Build
Here we discuss the steps to protect your code – more specifically to protect build systems, to ensure they implement the build process you intended. This is conceptually very simple, but there are many pieces to this puzzle, so implementation can get complicated.
People call this Secure Software Delivery, Software Supply Chain Management, and Build Server Security – take your pick. It is management of the assemblage of tools which oversee and implement your process. For our purposes today these terms are synonymous.
Following is a list of recommendations for securing platforms in the build environment to ensure secure container construction. We include tools from Docker and others which that automate and orchestrate source code, building, the Docker engine, and the repository. For each tool you will employ a combination of identity management, roles, platform segregation, secure storage of sensitive data, network encryption, and event logging. Some of the subtleties follow.
- Source Code Control: Stash, Git, GitHub, and several variants are common. Source code control is one of the tools with a wide audience, because it is now common for Security, Operations, and Quality Assurance to all contribute code, tests, and configuration data. Distributed access means all traffic should run over SSL or VPN connections. User roles and access levels are essential for controlling who can do what, but we recommend requiring token-based or certificate-based authentication, with two-factor authentication a minimum for all administrative access.
- Build Tools and Controllers: The vast majority of development teams we spoke with use build controllers like Bamboo and Jenkins, with these platforms becoming an essential part of their automated build processes. These provide many pre-, post- and intra-build options, and can link to a myriad of other facilities, complicating security. We suggest full network segregation of the build controller system(s), and locking down network connections down to source code controller and docker services. If you can, deploy build servers as on-demand containers – this ensures standardization of the build environment and consistency of new containers. We recommend you limit access to the build controllers as tightly as possible, and leverage built-in features to restrict capabilities when developers need access. We also suggest locking down configuration and control data to prevent tampering with build controller behavior. We recommend keeping any sensitive data, such as
ssh keys, API access keys, database credentials, and the like in a secure database or data repository (such as a key manager,
.dmg file, or vault) and pulling credentials on demand to ensure sensitive data never sits on disk unprotected. Finally, enable logging facilities or add-ons available for the build controller, and stream output to a secure location for auditing.
- Docker: You will use Docker as a tool for pre-production as well as production, building the build environment and test environments to vet new containers. As with build controllers like Jenkins, you’ll want to limit Docker access in the build environment to specific container administrator accounts. Limit network access to accept content only from the build controller system(s) and whatever trusted repository or registry you use.
Our next post will discuss validation of individual containers and their contents.
Posted at Sunday 6th November 2016 9:00 pm
(0) Comments •
By Mike Rothman
Now that we have gotten through 80% of the Endpoint Advanced Protection lifecycle we can focus on remediation, and then how to start getting value from these new alternatives.
Once you have detailed information from the investigation, what are the key decision points? As usual, to simplify we step back to the who, what, where, when, and how of the situation. And yes, any time we can make difficult feel seem like being back in grade school, we do.
- Who? The first question is about organizational dynamics. In this new age, when advanced attackers seem to be the norm, who should take lead in remediation? Without delving into religion or other politics, the considerations are really time and effectiveness. Traditionally IT Operations has tools and processes for broad changes, reimaging, or network-based workarounds. But for advanced malware or highly sensitive devices, or when law enforcement is involved, you might also want a small Security team which can remediate targeted devices.
- What? This question is less relevant because you are remediating a device, right? There may be some question of whether to prevent further outbreaks at the network level by blocking certain sites, applications, users, or all of the above, but ultimately we are talking about endpoints.
- Where? One of the challenges of dealing with endpoints is that you have no idea where a device will be at any point in time. So remote remediation is critical to any Endpoint Advanced Protection lifecycle. There are times you will need to reimage a machine, and that’s not really feasible remotely. But having a number of different options for remediation depending on device location can ensure minimal disruption to impacted employees.
- When? This is one of the most challenging decisions, because there are usually reasonable points for both sides of the argument: whether to remediate devices immediately, or whether to quarantine the device and observe the adversary a bit to gain intelligence. We generally favor quick and full eradication, which requires leveraging retrospection to figure all impacted devices (even if they aren’t currently participating in the attack) and cleaning devices as quickly as practical. But there are times which call for more measured remediation.
- How? This question is whether reimaging the device, or purging malware without reimaging, is the right approach. We favor reimaging because of the various ways attackers can remain persistent on a device. Even if you think a device has been cleaned… perhaps it really wasn’t. But with the more granular telemetry gathered by today’s endpoint investigation and forensics tools (think DVR playback), it is possible to reliably back out all the changes made, even within the OS innards. Ultimately the decision comes back to the risk posed by the device, as well as disruption to the employee. The ability to both clean and reimage is key to the remediation program.
There is a broad range of available actions, so we advocate flexibility in remediation – as in just about everything. We don’t think there is any good one-size-fits-all approach any more; each remediation needs to be planned according to risk, attacker sophistication, and the skills and resources available between Security and Operations teams. Taking all that into account, you can choose the best approach.
One of the most frustrating aspects of doing security is having to spend money on things you know don’t really work. Traditional endpoint protection suites fit into that category. Which begs the question: are Endpoint Advanced Protection products robust enough, effective enough, and broad enough to replace the EPP incumbents?
To answer this question you must consider it from two different standpoints. First, the main reason you renew your anti-malware subscription each year is for that checkbox on a compliance checklist. So get a sense of whether your assessor/auditor would you a hard time if you come up with something that doesn’t use signatures to detect malicious activity. If they are likely to push back, maybe find a new assessor. Kidding aside, we haven’t seen much pushback lately, in light of the overwhelming evidence that Endpoint Advanced Detection/Prevention is markedly more effective at blocking current attacks. That said, it would be foolish to sign a purchase order to swap out protection on 10,000 devices without at least putting a call into your assessor and understanding whether there is precedent for them to accept a new style of agent.
You will also need to look at your advanced endpoint offering for feature parity. Existing EPP offerings have been adding features (to maintain price points) for a decade. A lot of stuff you don’t need has been added, but maybe there is some you do use. Make sure replacing your EPP won’t leave a gap you will just need to fill with another product.
Keep in mind that some EPP features are now bundled into operating systems. For example, full disk encryption is now available free as part of the operating system. In some cases you need to manage these OS-level capabilities separately, but that weighs against an expensive renewal which doesn’t effectively protect endpoints.
Finally, consider price. Pretty much every enterprise tells us they want to reduce the number of security solutions they need. And supporting multiple agents and management consoles to protect endpoints doesn’t make much sense. In your drive to consolidate, play off aggressive new EAP vendors against desperate incumbents willing to perform unnatural acts to keep business.
Endpoint protection has been a zero-sum game for a while. Pretty much every company has some kind of endpoint protection strategy. So every deal that one vendor wins is lost by at least one competitor. Vendors make it very easy to migrate to their products by providing tools and services to facilitate the transition. Of course you need to verify what’s involved in moving wholesale to a new product, but the odds are it will be reasonably straightforward.
Many new EAP tools are managed in the cloud. Typically that saves you from needing to install an onsite management server to test and deploy. This makes things much easier and facilitates migration – employees can connect to a cloud-based software installation/distribution engine, without needing to bring devices to HQ for upgrades. Some organizations still resist cloud-based management; if this sounds like you, you’ll want to check with the vendor to ensure they can support on-premise installation.
Finally, when planning the migration you need to consider which security functions should be implemented on each category of devices, as defined by the risk they pose. Earlier in this series we talked about categorizing devices into risk buckets, and implementing controls based on the risk they present. You can install or enable different EAP modules depending on the needs of the employee or device.
The vendor may well make it worth your while to license all their capabilities on all your devices. There is nothing wrong with that, if the price is right. But do not consider only purchase price – keep in mind the total cost of managing the various capabilities across all your devices. Also consider the impact on employees in terms of device performance and user experience. Not every device needs application whitelisting, for example. Or EDR, given the challenge of moving endpoint telemetry across the network.
Finally, any new EAP offering needs to play nice with existing enterprise security tools. Here are a few, with their integration points.
- Network Controls: If you detect an attack on an endpoint and isolate the C&C (Command and Control) network it’s connecting to, wouldn’t it be great to automagically block that address so other devices don’t connect to that bot network? That’s why many EAP vendors also offer network security devices, or at least partner with those players to offer an integrated experience.
- Security Monitoring/Analytics: An EAP product – especially EDR functionality – generates a bunch of telemetry which can be useful within your security monitoring environment. So the ability to send it directly to a SIEM or security analytics program helps leverage it in any analyses you perform.
- Forensics/Case Management: If you can foresee a situation where you’ll want to prosecute an attacker, you need the ability to integrate with your existing case management product. This is about protecting the chain of custody of captured data, and allowing more sophisticated forensics tools to use endpoint data to better determine what malware does to a device.
- Operations Platform: Finally, we need to highlight potential integration with an IT ops platform, especially as it relates to endpoint hygiene and asset management. An EAP products gathers much more detailed device data, which can be very useful to Operations.
Security is too complicated for any tool to stand on its own, so any EAP offering’s ability to send and receive data, to and from your other security tools, is a key selection criteria.
With that we have run through the Endpoint Advanced Protection lifecycle. At this point in time we see legitimate alternatives to the ineffective EPP products which have been holding you and your organization hostage for years. But before jumping in with both feet test the tool, plan and stage your migration, and most importantly implement a risk-based approach to protecting endpoints. There are many alternatives for protecting devices, so it’s more important than ever to match your security controls to the risk presented by the device.
Posted at Friday 4th November 2016 9:06 pm
(0) Comments •
In an earlier post I mentioning bastion accounts or virtual networks. Amazon calls these “transit VPCs” and has a good description. Before I dive into details, the key difference is that I focus on using the concept as a security control, and Amazon for network connectivity and resiliency. That’s why I call these “bastion accounts/networks”.
Here is the concept and where it comes from:
- As I have written before, we recommend you use multiple account with a partitioned network architecture structure, which often results in 2-4 accounts per cloud application stack (project). This limits the ‘blast radius’ of an account compromise, and enables tighter security control on production accounts.
- The problem is that a fair number of applications deployed today still need internal connectivity. You can’t necessarily move everything up to the cloud right away, and many organizations have entirely legitimate reasons to keep some things internal. If you follow our multiple-account advice, this can greatly complicate networking and direct connections to your cloud provider.
- Additionally, if you use a direct connection with a monolithic account & network at your cloud provider, that reduces security on the cloud side. Your data center is probably the weak link – unless you are as good at security as Amazon/Google/Microsoft. But if someone compromises anything on your corporate network, they can use it to attack cloud assets.
- One answer is to create a bastion account/network. This is a dedicated cloud account, with a dedicated virtual network, fo the direct connection back to your data center. You then peer the bastion network as needed with any other accounts at your cloud provider. This structure enables you to still use multiple accounts per project, with a smaller number of direct connections back to the data center.
- It even supports multiple bastion accounts, which only link to portions of your data center, so they only gain access to the necessary internal assets, thus providing better segregation. Your ability to do this depends a bit on your physical network infrastructure, though.
- You might ask how this is more secure. It provides more granular access to other accounts and networks, and enables you to restrict access back to the data center. When you configure routing you can ensure that virtual networks in one account cannot access another account. If you just use a direct connect into a monolithic account, it becomes much harder to manage and maintain those restrictions.
- It also supports more granular restrictions from your data center to your cloud accounts (some of which can be enforced at a routing level – not just firewalls), and because you don’t need everything to phone home, accounts which don’t need direct access back to the data center are never exposed.
A bastion account is like a weird-ass DMZ to better control access between your data center and cloud accounts; it enables multiple account architectures which would otherwise be impossible. You can even deploy virtual routing hardware, as per the AWS post, for more advanced configurations.
It’s far too late on a Friday for me to throw a diagram together, but if you really want one or I didn’t explain clearly enough, let me know via Twitter or a comment and I’ll write it up next week.
Posted at Friday 4th November 2016 9:05 pm
(0) Comments •
By Adrian Lane
After a somewhat lengthy hiatus – sorry about that – I will close out this series over the next couple days.
In this post I want to discuss container threat models – specifically for Docker containers. Some of these are known threats and issues, some are purely lab exercises for proof-of-concept, and others are threat vectors which attackers have yet to exploit – likely because there is so much low-hanging fruit for them elsewhere.
So what are the primary threats to container environments?
One area that needs protection is the build environment. It’s not first on most people’s lists for container security, but it’s first on mine because it’s the easiest place to insert malicious code. Developers tend to loathe security in development as it slows them down. This is why there is an entire industry dedicated to test data management and masked data: developers tend to do an end-run around security if it slows down their build and testing process.
What kinds of threats are we talking about specifically? Things like malicious or moronic source code changes. Malicious or moronic alterations to automated build controllers. Configuration scripts with errors, or with credentials sitting around. The addition of insecure libraries or back-rev/insecure versions of existing code. We want to know if the runtime code has been scanned for vulnerabilities. And we worry about a failure to audit all the above and catch any errors.
What the hell is in the container? What does it do? Is that even the correct version of the container? These are common questions I hear a lot from operations folks. They have no idea. Nor do they know what permissions the container has or requires – all too often lazy developers run everything as
root, breaking operational security models and opening up the container engine and underlying OS to various attacks. And security folks are unaware of what – if any – container hardening may have been performed. You want to know the container’s contents have been patched, vetted, hardened, and registered prior to deployment.
So what are the threats to worry about? We worry a container will attack or infect another container. We worry a container may quietly exfiltrate data, or just exhibit any other odd behavior. We worry containers have been running a long time, and not rotated to newer patched versions. We worry about whether the network has been properly configured to limit damage from a compromise. And we worry about attackers probing containers, looking for vulnerabilities.
Finally, the underlying platform security is a concern. We worry that a container will attack the underlying host OS or the container engine. If it succeeds it’s pretty much game over for that cluster of containers, and you may have given malicious code resources to pivot and attack other systems.
If you are in the security industry long enough, you see several patterns repeat over and over. One is how each hot new tech becomes all the rage, finds its way into your data center, and before you have a chance to fully understand how it works, someone labels it “business critical”. That’s about when security and operations teams get mandated to secure that hot new technology. It’s a natural progression – every software platform needs to focus on attaining minimum usability, scalability, and performance levels before competitors come and eat their lunch. After a certain threshold of customer adoption is reached – when enterprises really start using it – customers start asking, “Hey, how do we secure this thing?”
The good news is that Docker has reached that point in its evolutionary cycle. Security is important to Docker customers, so it has become important to Docker as well. They have now implemented a full set of IAM capabilities: identity management, authentication, authorization, and (usually) single sign-on or federation – along with encrypted communications to secure data in transit. For the rest of the features enterprises expect: configuration analysis, software assessment, monitoring, logging, encryption for data at rest, key management, development environment security, etc. – you’re looking at a mixture of Docker and third-party solution providers to fill in gaps. We also see cloud providers like Azure and AWS mapping their core security services over the container environment, providing different security models from what you might employ on-premise. This is an interesting time for container security in general… and a bit confusing, as you have a couple different ways to address any given threat. Next we will delve into how to address these threats at each stage of the pipeline, with build environment security.
Posted at Thursday 3rd November 2016 12:40 am
(1) Comments •
The following steps are very specific to AWS, but with minimal modification they will work for other cloud platforms which support multi factor authentication. And if your cloud provider doesn’t support MFA and the other features you need to follow these steps… find another provider.
- Register with a dedicated email address that follows this formula: firstname.lastname@example.org. Instead of project name you could use a business unit, cost code, or some other team identifier. The environment is dev/test/prod/whatever. The most important piece is the random seed added to the email address. This prevents attackers from figuring out your naming scheme, and then your account with email.
- Subscribe the project administrators, someone from central ops, and someone from security to receive email sent to that address.
- Establish a policy that the email account is never otherwise directly accessed or used.
- Disable any access keys (API credentials) for the root account.
- Enable MFA and set it up with a hardware token, not a soft token.
- Use a strong password stored in a password manager.
- Set the account security/recovery questions to random human-readable answers (most password managers can create these) and store the answers in your password manager.
- Write the account ID and username/email on a sticker on the MFA token and lock it in a central safe that is accessible 24/7 in case of emergency.
- Create a full-administrator user account even if you plan to use federated identity. That one can use a virtual MFA device, assuming the virtual MFA is accessible 24/7. This becomes your emergency account in case something really unusual happens, like your federated identity connection breaking down (it happens – I have a call with someone this week who got locked out this way).
After this you should never need to use your root account. Always try to use a federated identity account with admin rights first, then you can drop to your direct AWS user account with admin rights if your identity provider connection has issues. If you need the root account it’s a break-glass scenario, the worst of circumstances. You can even enforce dual authority on the root account by separating who has access to the password manager and who has access to the physical safe holding the MFA card.
Setting all this up takes less than 10 minutes once you have the process figured out. The biggest obstacle I run into is getting new email accounts provisioned. Turns out some email admins really hate creating new accounts in a timely manner. They’ll be first up against the wall when the revolution comes, so they have that going for them. Which is nice.
Posted at Wednesday 2nd November 2016 6:26 pm
(0) Comments •
Yesterday I warned against building a monolithic cloud infrastructure to move into cloud computing. It creates a large blast radius, is difficult to secure, costs more, and is far less agile than the alternative. But I, um… er… uh… didn’t really mention an alternative.
Here is how I recommend you start a move to the cloud. If you have already started down the wrong path, this is also a good way to start getting things back on track.
- Pick a starter project. Ideally something totally new, but migrating an existing project is okay, so long as you can rearchitect it into something cloud native.
- Applications that are horizontally scalable are often good fits. These are stacks without too many bottlenecks, which allow you to break up jobs and distribute them. If you have a message queue, that’s often a good sign. Data analytics jobs are also a very nice fit, especially if they rely on batch processing.
- Anything with a microservice architecture is also a decent prospect.
- Put together a cloud team for the project, and include ops and security – not just dev. This team is still accountable, but they need extra freedom to learn the cloud platform and adjust as needed. They have additional responsibility for documenting and reporting on their activities to help build a blueprint for future projects.
- Train the team. Don’t rely on outside consultants and advice – send your own people to training specific to their role and the particular cloud provider.
- Make clear that the project is there to help the organization learn, and the goal is to design something cloud native – not merely to comply with existing policies and standards. I’m not saying you should (or can) throw those away, but the team needs flexibility to re-interpret them and build a new standard for the cloud. Meet the objectives of the requirements, and don’t get hung up on existing specifics.
- For example, if you require a specific firewall product, throw that requirement out the window in favor of your cloud provider’s native capabilities. If you require AV scanning on servers, dump it in favor of immutable instances with remote access disabled.
- Don’t get hung up on being cloud provider agnostic. Learn one provider really well before you start branching out. Keep the project on your preferred starting provider, and dig in deep.
- This is also a good time to adopt DevOps practices (especially continuous integration). It is a very effective way to manage cloud infrastructure and platforms.
- Once you get that first successful project up and running, then use that team to expand knowledge to the next team and the next project.
- Let each project use its own cloud accounts (around 2-4 per project is normal). If you need connections back to the data center, then look at a bastion/transit account/virtual network and allow the project accounts to peer with the bastion account.
- Whitelist that team for direct
ssh access to the cloud provider to start, or use a jump box/VPN. This reduces the hang-ups of having to route everything through private network connections.
- Use an identity broker (Ping/Okta/RSA/IBM/etc.) instead of allowing the team to create their own user accounts at the cloud provider. Starting off with federated identity avoids some problems you will otherwise hit later.
And that’s it: start with a single project, staff it and train people on the platform they plan to use, build something cloud native, and then take those lessons and use them on the next one.
I have seen companies start with 1-3 of these and then grow them out, sometimes quite quickly. Often they simultaneously start building some centralized support services so everything isn’t in a single team’s silo. Learn and go native early on, at a smaller scale, rather than getting overwhelmed by starting too big. Yeah yeah, too simple, but it’s surprising how rarely I see organizations start out this way.
Posted at Tuesday 1st November 2016 6:49 pm
(0) Comments •
By Mike Rothman
As we discussed previously, despite all the cool innovation happening to effectively prevent compromises on endpoints, the fact remains that you cannot stop all attacks. That means detecting the compromise quickly and effectively, and then figuring out how far the attack has spread within your organization, continues to be critical.
The fact is, until fairly recently endpoint detection and forensics was a black art. Commercial endpoint detection tools were basically black boxes, not really providing visibility to security professionals. And the complexity of purpose-built forensics tools put this capability beyond the reach of most security practitioners. But a new generation of endpoint detection and response (EDR) tools is now available, with much better visibility and more granular telemetry, along with a streamlined user experience to facilitate investigations – regardless of analyst capabilities.
Of course it is better to have a more-skilled analyst than a less-skilled one, but given the hard truth of the security skills gap, our industry needs to provide better tools to make those less-skilled analysts more productive, faster. Now let’s dig into some key aspects of EDR.
In order to perfrom any kind of detection, you need telemetry from endpoints. This begs the question of how much to collect from each device, and how long to keep it. This borders on religion, but we remain firmly in the camp that more data is better than less. Some tools can provide a literal playback of activity on the endpoint, like a DVR recording of everything that happened. Others focus on log events and other metadata to understand endpoint activity.
You need to decide whether to pull data from the kernel or from user space, or both. Again, we advocate for data, and there are definite advantages to pulling data from the kernel. Of course there are downsides as well, including potential device instability from kernel interference.
Again recommend the risk-centric view on protecting endpoints, as discussed in our prevention post. Some devices possess very sensitive information, and you should collect as much telemetry as possible. Other devices present less risk to the enterprise, and may only warrant log aggregation and periodic scans.
There are also competing ideas about where to store the telemetry captured from all these endpoint devices. Some technologies are based upon aggregating the data in an on-premise repository, others perform real-time searches using peer-to-peer technology, and a new model involves sending the data to a cloud-based repository for larger scale-analysis.
Again, we don’t get religious about any specific approach. Stay focused on the problem you are trying to solve. Depending on the organization’s sensitivity, storing endpoint data in the cloud may not be politically feasible. On the other hand it might be very expensive to centralize data in a highly distributed organization. So the choice of technology comes down to the adversary’s sophistication, along with the types and locations of devices to be protected.
It’s not like threat intelligence is a new concept in the endpoint protection space. AV signatures are a form of threat intel – the industry just never calls it that. What’s different is that now threat intelligence goes far beyond just hashes of known bad files, additionally looking for behavioral patterns that indicate an exploit. Whether the patterns are called Indicators of Compromise (IoC), Indicators or Attack (IoA), or something else, endpoints can watch for them in real time to detect and identify attacks.
This new generation of threat intelligence is clearly more robust than yesterday’s signatures. But that understates the impact of threat intel on EDR. These new tools provide retrospection, which is searching the endpoint telemetry data store for newly emerging attack patterns. This allows you to see if a new attack has been seen in the recent past on your devices, before you even knew it was an attack.
The goal of detection/forensics is to shorten the window between compromise and detection. If you can search for indicators when you learn about them (regardless of when the attack happens), you may be able to find compromised devices before they start behaving badly, and presumably trigger other network-based detection tactics.
A key aspect of selecting any kind of advanced endpoint protection product is to ensure the vendor’s research team is well staffed and capable of keeping up with the pace of emerging attacks. The more effective the security research team is, the more emerging attacks you will be able to look for before an adversary can compromise your devices. This is the true power of threat intelligence.
Once you have all of the data gathered and have enriched it with external threat intelligence, you are ready to look for patterns that may indicate compromised devices. Analytics is now a very shiny term in security circles, which we find very amusing. Early SIEM products offered analytics – you just needed to tell them what to look for. And it’s not like math is a novel concept for detecting security attacks. But security marketers are going to market, so notwithstanding the particular vernacular, more sophisticated analytics do enable more effective detection of sophisticated attacks today.
But what does that even mean? First we should define probably the term machine learning, because every company claims they do this to find zero-day attacks and all other badness with no false positives or latency. No, we don’t believe that hype. But the advance of analytical techniques, harnessed by math ninja known as data scientists, enables detailed analysis of every attack to find commonalities and patterns. These patterns can then be used to find malicious code or behavior in new suspicious files. Basically security research teams sets up their math machines to learn about these patterns. Ergo machine learning. Meh.
The upshot is that these patterns can be leveraged for both static analysis (what the file looks like) and dynamic analysis (what the software does), making detection faster and more accurate.
Once you have detected a potentially compromised devices you need to engage your response process. We have written extensively about incident response (including Using TI in Incident Response and Incident Response in the Cloud Age), so we won’t go through the details of the IR process again here. Though as we have described, advanced endpoint protection tools now provide both more granular telemetry, and a way to investigate an attack within the management console.
Additionally, these tools increasingly integrate with other response tools in use within your environment. Advanced endpoint protection products bring several capabilities to response, including:
- Attack Visualization: In many cases, being able to visualize the attack on a device is very instructive for understanding how the malware works and what it does to devices. The management consoles of some EAP products offer a visual map to follow the activity of malware on a device – including the process the attack impacted, kernel-level activity, and/or API calls. This timeline of sorts must also specify the files involved in the attack and track network connectivity.
- Understanding Outbreaks: As discussed above, a key aspect of EAP products is their ability to aggregate telemetry and search after the fact to determine if other devices have been attacked by similar malware. This provides invaluable insight into how the attack has proliferated through your environment, and identifies specific devices in need of remediation or quarantine.
- Forensics: You also need the endpoint agent to be able to gather raw telemetry from the device and provide tools to analyze the data. At times, especially when skilled forensicators are involved, they need full data to really dig into what the malware did. A key aspect of forensic analysis is the need to enforce chain of custody for collected data, especially if prosecution is an option.
- Ease of Use: EAP tools have been built for more general security practitioners, rather than only forensics ninja, so user experience has been a focus for helping less experienced professionals be more productive. This requires a much easier workflow for drilling down into attacks, and pivoting to find the root cause.
- Integration with Enterprise Tools: Another key criteria for EAP products is making sure they play nice with tools already in use. You’ll want to be able to send data directly to a SIEM for further correlation and analysis. You’ll also want to integrate with a case management system to track investigations. Finally, think about integrations with network security controls (including firewalls and web filters) to block C&C sites and other malicious addresses discovered on endpoints, preventing other devices from contacting known-bad Internet addresses.
Finally we should acknowledge another very shiny concept in security circles: hunting. It seems every practitioner aspires to be a hunter nowadays. OK, maybe that’s a little exaggerated, but it’s a cool gig. Hunters go out and proactively look for adversary activity on networks and systems, as opposed to waiting for monitors to alert, and then investigating.
Psychologically, hunting is great for security teams because it puts the team more in control of their environment. Instead of waiting for a tool to tell you things are bad, you can go out and figure it out yourself.
But the reality is that hunting is primarily relevant to the most sophisticated and advanced security teams. It requires staff to look around, and unfortunately most organizations are not sufficiently staffed to achieve core operational goals, so there isn’t much chance they have folks sitting around, available to proactively look for bad stuff.
Keep in mind the tools used by hunters are largely the same ones useful to practitioners focused on validating attacks on endpoints. A hunter needs to be able to analyze granular telemetry from endpoints and other devices. They need to search through telemetry to find activity patterns that could be malicious. They need to forensically investigate a device when they find something suspicious. Hunters also need to retrospectively look for indicators of attack to understand which devices have been targeted. Pretty much what EDR tools do.
To be clear, we aren’t maligning hunting at all. If your organization can devote the resources to stand up a hunting function, that’s awesome. Our point is simply that the tools needed to hunt are pretty much the same tools used by responders to verify alerts.
That’s detection and response as part of an Endpoint Advanced Protection lifecycle. Our next post will wrap up with the sticky questions that need to be answered – including remediation once you find a compromised device, whether an EAP product can replace your existing AV, and how to integrate these tools with existing network and security management controls.
Posted at Tuesday 1st November 2016 1:15 pm
(0) Comments •