In the last post on Data Collection we introduced the complicated process of gathering data. Now we need to understand how to put it into a manageable form for analysis, reporting, and long-term storage for forensics.
Aggregation
SIEM platforms collect data from thousands of different sources because these events provide the data we need to analyze the health and security of our environment. In order to get a broad end-to-end view, we need to consolidate what we collect onto a single platform. Aggregation is the process of moving data and log files from disparate sources into a common repository. Collected data is placed into a homogenous data store – typically purpose-built flat file repositories or relational databases – where analysis, reporting, and forensics occur; and archival policies are applied.
The process of aggregation – compiling these dissimilar event feeds into a common repository – is fundamental to Log Management and most SIEM platforms. Data aggregation can be performed by sending data directly into the SIEM/LM platform (which may be deployed in multiple tiers), or an intermediary host can collect log data from the source and periodically move it into the SIEM system. Aggregation is critical because we need to manage data in a consistent fashion: security, retention, and archive policies must be systematically applied. Perhaps most importantly, having all the data on a common platform allows for event correlation and data analysis, which are key to addressing the use cases we have described.
There are some downsides to aggregating data onto a common platform. The first is scale: analysis becomes exponentially harder as the data set grows. Centralized collection means huge data stores, greatly increasing the computational burden on the SIEM/LM platform. Technical architectures can help scale, but ultimately these systems require significant horsepower to handle an enterprise’s data. Systems that utilize central filtering and retention policies require all data to be moved and stored – typically multiple times – increasing the burden on the network.
Some systems scale using distributed processing, where filtering and analysis occur outside the central repository, typically at the distributed data collection point. This reduces the compute burden on the central server and allows processing to occur on smaller, more manageable data sets. It does require that policies, along with the code to process them, be distributed and kept current throughout the network. Distributed agent processes are a handy way to “divide and conquer”, but increase IT administration requirements. This strategy also adds a computational burden o the data collection points, degrading their performance and potentially slowing enough to drop incoming data.
Data Normalization
If the process of aggregation is to merge dissimilar event feeds into one common platform, normalization takes it one step further by reducing the records to just common event attributes. As we mentioned in the data collection post, most data sources collect exactly the same base event attributes: time, user, operation, network address, and so on. Facilities like syslog
not only group the common attributes, but provide means to collect supplementary information that does not fit the basic template. Normalization is where known data attributes are fed into a generic template, and anything that doesn’t fit is simply omitted from the normalized event log. After all, to analyze we want to compare apple to apples, so we throw away an oranges for the sake of simplicity.
Depending upon the SIEM or Log Management vendor, the original non-normalized records may be kept in a separate repository for forensics purposes prior to later archival or deletion, or they may simply be discarded. In practice, discarding original data is a bad idea, since the full records are required for any kind of legal enforcement. Thus, most products keep the raw event logs for a user-specified period prior to archival. In some cases, the SIEM platform keeps a link to the original event in the normalized event log which provides ‘drill-down’ capability to easily reference extra information collected from the device.
Normalization allows for predicable and consistent storage for all records, and indexes these records for fast searching and sorting, which is key when battling the clock in investigating an incident. Additionally, normalization allows for basic and consistent reporting and analysis to be performed on every event regardless of the data source. When the attributes are consistent, event correlation and analysis – which we will discuss in our next post – are far easier.
Technically normalization is no longer a requirement on current platforms. Normalization was a necessity in the early days of SIEM, when storage and compute power were expensive commodities, and SIEM platforms used relational database management systems for back-end data management. Advances in indexing and searching unstructured data repositories now make it feasible to store full source data, retaining original data, and eliminating normalization overhead.
Enriching the Future
In reality, we are seeing a number of platforms doing data enrichment, adding supplemental information (like geo-location, transaction numbers, application data, etc.) to logs and events to enhance analysis and reporting. Enabled by cheap storage and Moore’s Law, and driven by ever-increasing demand to collect more information to support security and compliance efforts, we expect more platforms to increase enrichment. Data enrichment requires a highly scalable technical architecture, purpose-built for multi-factor analysis and scale, making tomorrow’s SIEM/LM platforms look very similar to current business intelligence platforms.
But that just scratches the surface in terms of enrichment, because data from the analysis can also be added to the records. Examples include identity matching across multiple services or devices, behavioral detection, transaction IDs, and even rudimentary content analysis. It is somewhat like having the system take notes and extrapolate additional meaning from the raw data, making the original record more complete and useful. This is a new concept for SIEM, so the enrichment will ultimately encompass is anyone’s guess. But as the core functions of SIEM have standardized, we expect vendors to introduce new ways to derive additional value from the sea of data they collect.
Reader interactions
5 Replies to “Understanding and Selecting SIEM/LM: Aggregation, Normalization, and Enrichment”
@adrian
>I can still run the report and get the same data
Well, you are correct here, you will get the same data: but you might be getting this data from a $3,000 product (text search) or from a $300,000 product (full normalization, categorization, enrichment, etc)
>The performance advantage when omitting normalization
Phrasing it like this would be a bit deceptive – this is a bit like saying “a cost advantage of walking to NY from CA vs flying.” It misses that you win at cost and lose at – possibly huge – value. Also, there are SIEM that normalize and still perform faster than [many] indexing tools – NitroSecurity, for example.
Thanks for paying attention to these details!
@Anton – I see your point on this. I contend that I can still run the report and get the same data, but I lose the ability to produce clean, easy to read reports without some form of normalization support. My goal was to communicate three things:
1. The motivation and evolution of normalization
2. The performance advantage when omitting normalization
3. Point out that some tools do not require you normalize, typically when data is stored as flat files.
But I failed to point out what you lose when you omit normalization, and that is significant. I will reword this section in our final report to include your comments, and discuss this more as an option or trade-off.
Thanks for the comment!
-Adrian
Eh…not really.
So:
Aggregate = collect in one big pile.
Normalize = convert to common schema.
If you aggregate, you can run a search for “‘built TCP’ and ‘action=Permit'” in order to see both Cisco and Juniper log. With more advanced indexing, you can see some things from both log types of the logs, returned in your query.
However, if you’d like to see a uniform report such as:
device | user | src | dst | whatever
===================================
PIX | adrian | 10.1.10.1 | ….
———————————–
NS | anton | 111.0.10.2 | ….
======================================
etc
you do need to normalize i.e. extract relevant pieces of all firewall log types in a single schema.
And, yes, you can cheat and extract it AFTER you search, but that’d still be “in-memory normalization”
@Anton – And I thought you had forgotten about us!
I want to make sure that I have not created confusion by using terms incorrectly. I see your argument if we are talking about aggregation, but why must I normalize as well? I do not believe a product like Splunk, for example, normalizes incoming data, yet I can run reports across Juniper and CISCO. Centralized data aggregation should be enough.
-Adrian
I don’t hate you guys, but …
>Technically normalization is no longer a requirement on
>current platforms.
… is completely delusional. Without normalization, you cannot even run a report across 2 different firewalls, say Cisco and Juniper. Or report on login failures across Unix and Windows.
Normalization is where a large percentage of SIEM value lies and no amount of “smart” indexing will change it.