Understanding and Selecting SIEM/LM: Data Management

We covered SIEM and Log Management deployment architectures in depth to underscore how different models are used to deal with scalability and data management issues. In some cases these deployment choices are driven by the underlying data handling mechanism within the product. In other words each platform stores and manages data differently – these decisions have significant impact on product scalability, data management, and reporting & forensics capabilities. Here we discuss the different internal data storage models, with advantages and disadvantages of each.

Relational Database

In the early days of this technology, most SIEM and log management systems were built on relational database engines to store events and log records. In this model the SIEM platform maps data attributes from each data source into database columns, so each event is stored in a single database row. There are numerous advantages to this model, including:

Data Validation – As data is inserted into the column, the database verifies data type and range settings. Integrity check failures indicate corrupted files and are omitted from the import, with notification to administrators.
Event Consistency – An event from a Cisco router now looks just like an event from a Juniper router, and vice-versa, as events are normalized before being stored in the table.
Reporting – Reports are easier to generate from validated data columns, and the database can format data when generating the report. Reports run far faster thanks to column indices, effectively filtering and ordering events.
Analytics – An RDBMS facilitates complex queries across all available attributes, inspected content, and correlation.

This model for data storage has fallen out of favor due to the overhead of data insertion: as each row is inserted the database must perform the checks and periodically rebuild indices. As daily event volumes scaled from millions to hundreds of millions and billions, this overhead became problematic and resulted in significant scalability issues with SIEM offerings built on RDBMS.

Further, data that does not fit into the tables defined in the relational model is typically left out. Unless there is some other method to maintain the fidelity and integrity of the original event records, this is problematic for forensics. This “selective memory” can also result in data accuracy issues, as truncated records may not correlate properly and can hamper analysis.

As a result SIEM/LM architectures based on RDBMS are waning, as products in this space re-architect their backend data stores to address these issues. On the other hand, RDBMS storage is not totally dead – some vendors have instead chosen to streamline data insertion, basically by turning off some RDBMS checks and integrity verification. Others use an RDBMS to supplement a flat file architecture (described below), leveraging the advantages above for reporting and forensics.

Flat File

Flat files, or just ‘files’, are now the most common way to store events for SIEM and Log Management. Files are serve as a blank canvas for the vendor; as they can introduce any structure they choose to help define, format, and delineate events. Anything that helps with correlation and speeds up future searches is included, and each vendor has their own secret sauce for building files. Each file typically contains a day’s events, possibly from a single source, with each event clearly delineated. The files (in some cases each event) can be tagged with additional information – this is called “log enrichment”. These tags offer some of the contextual benefits of a relational database, and help to define attributes. Some even include a control structure similar to VSAM files. The events may be stored in their raw form, or be normalized prior to insertion. Flat files offer several advantages.

Performance – Since normalization (to the degree necessary) happens before data insertion, there is very little work to be performed prior to insertion compared to a relational database. Data is stored as quickly as the physical media can handle, and often available immediately for searching and analysis.
Flexibility – Stored events are not limited to specific normalized columns as they are in a relational database, but can take any form. Changes to internal file formats are much easier.
Search – Searches can be performed without understanding the underlying structures, using simple keyword search. At least one log management vendor provides a Google-style search capability across data files. Alternately, search can rely upon tags and keywords established by the vendor.

The flat file tradeoffs are twofold. First, any data management capabilities – such as indexing and data integrity – must be built from scratch by the vendor, since no RDBMS capabilities are provided by the underlying platform. This means the SIEM/LM vendor must provide any needed facilities for data integrity, normalization, filtering, and indexing. Second, there is an efficiency tradeoff. Some vendors tag, index, and normalize prior to insertion; others initially record raw events, later re-reading the data in order to normalize it, and then rewrite the reformatted data. The later method offers faster insertion, at the expense of greater total storage and processing requirements.

The good news is that a few years ago most vendors saw the scalability wall of RDBMS approaching, and began investing in their own back-end data management environments. At this point many platforms feature purpose-built high-performance data stores, and we believe this will be the underlying architecture for these products moving forward.

Of course, we don’t live in an either/or world, so many of the platforms combine some RDBMS capabilities with flat file aspects. Yes, the answer can be ‘both’.

Understanding and Selecting SIEM/LM: Data Management

Relational Database

Flat File

2 Comments