Network Security Fundamentals: Monitor EverythingBy Mike Rothman
As we continue on our journey through the fundamentals of network security, the idea of network monitoring must be integral to any discussion. Why? Because we don’t know where the next attack is coming, so we need to get better at compressing the window between successful attack and detection, which then drives remediation activities. It’s a concept I coined back at Security Incite in 2006 called React Faster, which Rich subsequently improved upon by advocating Reacting Faster and Better.
React Faster (and better)
I’ve written extensively on the concept of React Faster, so here’s a quick description I penned back in 2008 as part of an analysis of Security Management Platforms, which hits the nail on the head.
New attacks are happening at a fast and furious pace. It is a fool’s errand to spend time trying to anticipate where the issues are. REACT FASTER first acknowledges that all attacks cannot be stopped. Thus, focus remains on understanding typical traffic and application usage trends and monitoring for anomalous behavior, which could indicate an attack. By focusing on detecting attacks earlier and minimizing damage, security professionals both streamline their activities and improve their effectiveness.
Rich’s corollary made the point that it’s not enough to just react faster, but you need to have a plan for how to react:
Don’t just react – have a response plan with specific steps you don’t jump over until they’re complete. Take the most critical thing first, fix it, move to the next, and so on until you’re done. Evaluate, prioritize, contain, fix, and clean.
So monitoring done well compresses the time between compromise and detection, and also accelerates root cause analysis to determine what the response should involve.
Network Security Data Sources
It’s hard to argue with the concept of reacting faster and collecting data to facilitate that activity. But with an infinite amount of data to collect, where do we start? What do we collect? How much of it? For how long? All of these are reasonable questions that need answers as you construct your network monitoring strategy. The major data sources from your network security infrastructure include:
- Firewall: Every monitoring strategy needs to correspond to the most prevalent attack vectors, and that means from the outside in. Yes, the insider threat is real, but script kiddies are alive and well and that means we need to start by looking at our Internet-facing devices. First we pull log and activity information from our firewalls and UTM devices on the perimeter. We look for strange patterns, which usually indicate something is wrong. We want to keep this data long enough to ensure we have sufficient data in the event of a well-executed low and slow attack, which means months rather than days.
- IPS: The next layer in tends to be IPS, looking for patterns of traffic that indicate a known attack. We want the alerts first and foremost. But we also want to collect the raw IPS logs as well. Just because the IPS doesn’t think specific traffic is an attack doesn’t mean it isn’t. It could be a dreaded 0-day, so we want to pull all the data we can off this box as well, since the forensic analysis can pinpoint when attacks first surfaced and also provide guidance as to the extent of the compromise.
- Vulnerability scans: Are those devices vulnerable to a specific attack? Vulnerability scan data is one of the key inputs to SIEM/correlation products. The best way to reduce false positives is not to fire an alert if the target is not vulnerable. Thus we keep scan data on hand, and use it both for real-time analysis and also forensics. If an attack happens during a window of vulnerability (like while you debate the merits of a certain patch with the ops guys), you need to know that.
- Network Flow Data: I’ve always been a big fan of network flow analysis and continue to be mystified that market never took off, given the usefulness of understanding how traffic flows within and out of a network. All is not lost, since a number of security management products use flow data in their analyses and a few lower end management products use flow data as well. Each flow record is small, so there is no reason not to keep a lot of it. Again, we use this data to both pinpoint potential badness, and also replay attacks to understand how they spread within the organization.
- Device Change Logs: If your network devices get compromised, it’s pretty much game over. Traffic can be redirected, logging suppressed, and lots of other badness can result. So keep track of device configuration and more importantly when those changes happen – which helps isolate the root causes of breaches. Yes, if the logs are turned off, you lose visibility, which can itself indicate an issue. Through the wonders of SNMP, you should collect data from all your routers, switches, and other pipes.
- Content security: Now we can climb the stack a bit to pull information off the content security gateways, since a lot of attacks still show up via phishing emails and malware-laden web links. Again, we aren’t trying to pull this data in necessarily to stop an attack (hopefully the anti-spam box will figure out you aren’t interested in the little blue pill), but rather to gather more information about the attack vectors and how an attack proliferates through your environment. Reacting faster is about learning all we can about what is compromised and responding in the most efficient and effective manner.
Keeping things focused and pragmatic, you’d like to gather all this data all the time across all the networks. Of course, Uncle Reality comes to visit and understandably, collection of everything everywhere isn’t an option. So how do you prioritize? The objective is to use the data you already have. Most organizations have all of the devices listed above. So all the data sources exist, and should be prioritized based on importance to the business.
Yes, you need to understand which parts of the infrastructure are most important. I’m not a fan of trying to “value” assets, but a couple of categories can be used. Like “not critical,” “critical,” and “if this goes down I’m fired.” It doesn’t have to be a lot more complicated than that. Thus, start aggregating the data for the devices and segments where availability, performance, or security failures will get you tossed. Then you go to the critical devices, and so on. Right, you never get to the stuff that isn’t some form of critical.
Collection, Aggregation and Architecture
So where do we put all this data? You need some kind of aggregation platform. Most likely it’s a log management looking thing, at least initially. Those data sources listed above are basically log records and the maturity of log management platforms mean you can store a lot of data pretty efficiently.
Obviously collecting data doesn’t make it useful. Thus you can’t really discuss log aggregation without a discussion of correlation and analysis of that data. But that is a sticky topic and really warrants its own discussion in the Network Security Fundamentals series. Additionally, it’s now feasible to actually buy a log aggregation service (as opposed to building it yourself), so in future research I’ll also delve into the logic of outsourcing log aggregation. For the purposes of this post, let’s assume you are building your own log collection environment. It’s also worth mentioning that depending on the size of your organization (and collection requirements) there are lower cost and open source options for logging that work well.
In terms of architecture, you want to avoid a situation where management data overwhelms the application traffic on a network. To state the obvious, the closer to the devices you can collect, the less traffic you’ll have running all over the place. So you need to look at a traditional tiered approach. That means collector(s) in each location (you don’t want to be sending raw logs across lower speed WAN links) and then a series of aggregation points depending on the nature of your analysis.
Since most monitoring data gets used for forensic purposes as well, you can leave the bulk of the data at the aggregation points and only send normalized summary data upstream for reporting and analysis. To be clear, a sensor in every nook and cranny of your network will drive up costs, so exercise care to gather only as much data as you need, within the cost constraints of your environment.
As you look to devices for collection, one of the criteria to consider is compression and normalization of the data. For most compliance purposes, you’ll need to keep the raw logs and flows, but can achieve 15-20:1 compression on log data, as well as normalizing where appropriate to facilitate analysis. And speaking of analysis…
Full Packet Capture
We’ve beaten full packet capture into submission over the past few weeks. Rich just posted on the topic in more detail, but any monitoring strategy needs to factor in full network packet capture. To be clear, you don’t want to capture every packet that traverses your network. Rather the network traffic coming into or leaving the really important parts of your environment.
We believe the time is right for full packet capture for most larger organizations that need to be able to piece together an attack quickly after an compromise. At this point, doing any kind of real time analysis on a full packet stream isn’t realistic (at least not in a sizable organization), but this data is critical for forensics purposes. Stay tuned – this is a topic we’ll be digging much deeper into over the rest of the year.
Point of diminishing returns
As with everything in your environment, there can be too much of a good thing. So you need to avoid the point of diminishing returns, where more data becomes progressively less useful. Each organization will have its own threshold for pain in terms of collection, but keep an eye on a couple of metrics to indicate when enough is enough:
- Speed: When your collection systems start to get bogged down, it’s time to back off. During an incident, you need searching speed and large datasets can hinder that. “Backing off” can mean a lot of different things, but mostly it means reducing the amount of time you keep the data live in the system. So you can play around with the archiving windows to find the optimal amount of data to keep.
- Accuracy: Yes, when a collection system gets overwhelmed it starts to drop data. Vendors will dance on a thimble to insist this isn’t the case, but it is. So periodically making sure all the cross-tabs in your management reports actually add up is a good idea. Better you identify these gaps than your ops teams do. If they have to tell you your data and reports are crap, you can throw your credibility out the window.
- Storage overwhelm: When the EMC rep sends you a case of fancy champagne over the holidays, you may be keeping too much data. Keep in mind that collecting lots of data requires lots of storage and despite the storage commodity curve, it still can add up. So you may want to look at archiving and then eventually discarding data outside a reasonable analysis window. If you’ve been compromised for years, no amount of stored data will save your job.
Remember, this post dealing with the data you want to collect from the network, but that’s not the only stuff we should be monitoring. Not by a long shot, so we’ll be discussing collection at other layers of the computing stack: servers, databases, applications, etc. over time.
Next in the Network Security Fundamentals series I’ll tackle the “C” word of security management – correlation – which drives much of the analysis we do with all of this fancy data we collect.
Excellent stuff. Conveniently I was talking to my supervisor about prioritisation of security aspects within a network yesterday. On our side we were talking about the ability to rank intrusive threats during IDS selection & evaluation depending on the deployment of the network, but the argument holds I think.
The process you seem to be going through maps perfectly onto IDS construction as well, as you would expect: Identify (threats, architecture, resources), monitor, detect, correlate, react.