Building a Multi-cloud Logging Strategy: Issues and Pitfalls

As we begin our series on Multi-cloud logging, we start with reasons some traditional logging approaches don’t work. I don’t like to start with a negative tone, but we need to point out some challenges and pitfalls which often beset firms on first migration to cloud. That, and it helps frame our other recommendations later in this series. Let’s take a look at some common issues by category.

Tooling

Scale & Performance: Most log management and SIEM platforms were designed and first sold before anyone had heard of clouds, Kafka, or containers. They were architected for ‘hub-and-spoke’ deployments on flat networks, when ‘Scalability’ meant running on a bigger server. This is important because the infrastructure we now monitor is agile – designed to auto-scale up when we need processing power, and back down to reduce costs. The ability to scale up, down, and out is essential to the cloud, but often missing from older logging products which require manual setup, lacking full API enablement and auto-scale capability.
Data Sources: We mentioned in our introduction that some common network log sources are unavailable in the cloud. Contrawise, as automation and orchestration of cloud resources are via API calls, API logs become an important source. Data formats for these new log sources may change, as do the indicators used to group events or users within logs. For example servers in auto-scale groups may share a common IP address. But functions and other ‘serverless’ infrastructure are ephemeral, making it impossible to differentiate one instance from the next this way. So your tools need to ingest new types of logs, faster, and change their threat detection methods by source.
Identity: Understanding who did what requires understandings identity. An identity may be a person, service, or device. Regardless, the need to map it, and perhaps correlate it across sources, becomes even more important in hybrid and multi-cloud environments
Volume: When SIEM first began making the rounds, there were only so many security tools and they were pumping out only so many logs. Between new security niches and new regulations, the array of log sources sending unprecedented amounts of logs to collect and analyze grows every year. Moving from traditional AV to EPP, for example, brings with it a huge log volume increase. Add in EDR logs and you’re really into some serious volumes. On the server side, moving from network and server logs to add application layer and container logs brings a non-trivial increase in volume. There are only so many tools designed to handle modern event rates (X billio events per day) and volumes (Y terabytes per day) without buckling under the load, and more importantly, there are only so many people who know how to deploy and operate them in production. While storage is plentiful and cheap in the cloud, you still need to get those logs to the desired storage from various on-premise and cloud sources – perhaps across IaaS, PaaS, and SaaS. If you think that’s easy call your SaaS vendor and ask how to export all your logs from their cloud into your preferred log store (S3/ADLS/GCS/etc.). That old saw from Silicon Valley, “But does it scale?” is funny but really applies in some cases.
Bandwidth: While we’re on the topic of ridiculous volumes, let’s discuss bandwidth. Network bandwidth and transport layer security between on-premise and cloud and inter-cloud is non-trivial. There are financial costs, as well as engineering and operational considerations. If you don’t believe me ask your AWS or Azure sales person how to move, say, 10 terabytes a day between those two. In some cases architecture only allows a certain amount of bandwidth for log movement and transport, so consider this when planning migrations and add-ons.

Structure

Multi-account Multi-cloud Architectures: Cloud security facilitates things like micro-segmentation, multi-account strategies, closing down all unnecessary network access, and even running different workloads in different cloud environments. This sort of segmentation makes it much more difficult for attackers to pivot if they gain a foothold. It also means you will need to consider which cloud native logs are available, what you need to supplement with other tooling, and how you will stitch all these sources together. Expecting to dump all your events into a syslog style service and let it percolate back on-premise is unrealistic. You need new architectures for log capture, filtering, and analysis. Storage is the easy part.
Monitoring “up the Stack”: As cloud providers manage infrastructure, and possibly applications as well, your threat detection focus must shift from networks to applications. This is both because you lack visibility into network operations, but also because cloud network deployments are generally more secure, prompting attackers to shift focus. Even if you’re used to monitoring the app layer from a security perspective, for example with a big WAF in front of your on-premise servers, do you know whether you vendor has a viable cloud offering? If you’re lucky enough to have one that works in both places, and you can deploy in cloud as well, answer this (before you initiate the project): Where will those logs go, and how will you get them there?
Storage vs. Ingestion: Data storage in cloud services, especially object storage, is so cheap it is practically free. And long-term data archival cloud services offer huge cost advantages over older on-premise solutions. In essence we are encouraged to store more. But while storage is cheap, it’s not always cheap to ingest more data into the cloud because some logging and analytics services charge based upon volume (gigabytes) and event rates (number of events) ingested into the tool/service/platform. Example are Splunk, Azure Eventhubs, AWS Kinesis, and Google Stackdriver. Many log sources for the cloud are verbose – both number of events and amount of data generated from each. So you will need to architect your solution to be economically efficient, as well as negotiate with your vendors over ingestion of noisy sources such as DNS and proxies, for example. A brief side note on ‘closed’ logging pipelines: Some vendors want to own your logging pipeline on top of your analytics toolset. This may sound convenient because it “just works” (mostly for their business model), but beware lock-in, both in terms of cost overruns from lack of ability to deduplicate or filter pre-ingestion (the meter catches every event), but also from the opportunity cost of lost analytical capabilities other tools could provide, but not if you can’t feed them a copy of your log stream. The fact that you can afford to move a bunch of logs from place to place doesn’t mean it’s easy. Some logging architectures are not conducive to sending logs to more than one place at a time, and once you are in their system, exporting all logs (not just alerts) to another analytical tool can be incredibly difficult and resource intensive, because events can be ‘cooked’ into their own proprietary format, which you then have to reverse during export to make sense for other analytical tools.

Process

What to Filter and When: Compliance, regulatory, and contractual commitments prompt organizations to log everything and store it all forever (OK, not quite, but just about). And not just in production, but pre-production, development, and test systems. Combine that with overly chatty cloud logging systems (What do you plan to do with logs of every single API call into and inside your entire cloud?), and you are quickly overloaded. This results in both slower processing and higher costs. Dealing with this problem combines deciding what must be kept vs. filtered; what needs to be analyzed vs. captured for posterity; what is relevant today for security analysis and model building, vs. irrelevant tomorrow. One of the decision points you’ll want to address earlier is what you data consider perishable/real-time vs. historical/latency-tolerant.
Speed: For several years there has been a movement away from batch processing, and moving to real-time analysis (footnote: batch can be very slow [days] or very fast [micro-batching within 1-2 second windows], so we use ‘batch’ to mean anything not real-time, more like daily or less frequent). Batch mode, as well as normalization and storage prior to analysis, is becoming antiquated. The use of stream processing infrastructure, machine learning, and “stateless security” enable and even facilitate analysis of events as they are received. Changing the process to analyze in real time is needed to keep pace with attackers and fully automated attacks.
Automated Response: Many large corporations and government agencies suffered tremendously in 2017 from fast-spreading ‘ransomworms’ (also called ‘nukeware’) such as Wannacry/NotPetya in 2017. Response models tuned for stealthy low-and-slow IP and PII exfiltration attacks need revisiting. Once fast-movers execute you cannot detect your way past the grenades they leave in your datacenter. They are very obvious and very loud. The good news is that the cloud model inherently enables micro-segmentation and automated response. The cloud also doesn’t rely on ancient identity and network protocols which enable lateral movement, and continue to plague even the best on-premise security shops. Don’t forget that bad practices in the cloud won’t save you from even untargeted shenanigans. Remember the MongoDB massacre of January 2017? Fast response to things that look wrong is key to dropping the net on the bad guys as they try to pwn you. Knowing exactly what you have, its known-good state, and how to leverage new cloud capabilities, are all advantages the blue team needs to leverage.

Again, our point is not to bash older products, but to point out that cloud environments demand you re-think how you use tools and revisit your deployment models. Most can work with re-engineered deployment. We generally prefer to deploy known technologies when appropriate, which helps reduce the skills gap facing most security and IT teams. But in some cases you will find new and different ways to supplement existing logging infrastructure, and likely run multiple analysis capabilities in parallel.

Next up in our series: Multi-cloud logging architectures and design.

-Adrian & Gal