We covered SIEM and Log Management deployment architectures in depth to underscore how different models are used to deal with scalability and data management issues. In some cases these deployment choices are driven by the underlying data handling mechanism within the product. In other words each platform stores and manages data differently – these decisions have significant impact on product scalability, data management, and reporting & forensics capabilities. Here we discuss the different internal data storage models, with advantages and disadvantages of each.
Relational Database
In the early days of this technology, most SIEM and log management systems were built on relational database engines to store events and log records. In this model the SIEM platform maps data attributes from each data source into database columns, so each event is stored in a single database row. There are numerous advantages to this model, including:
- Data Validation – As data is inserted into the column, the database verifies data type and range settings. Integrity check failures indicate corrupted files and are omitted from the import, with notification to administrators.
- Event Consistency – An event from a Cisco router now looks just like an event from a Juniper router, and vice-versa, as events are normalized before being stored in the table.
- Reporting – Reports are easier to generate from validated data columns, and the database can format data when generating the report. Reports run far faster thanks to column indices, effectively filtering and ordering events.
- Analytics – An RDBMS facilitates complex queries across all available attributes, inspected content, and correlation.
This model for data storage has fallen out of favor due to the overhead of data insertion: as each row is inserted the database must perform the checks and periodically rebuild indices. As daily event volumes scaled from millions to hundreds of millions and billions, this overhead became problematic and resulted in significant scalability issues with SIEM offerings built on RDBMS.
Further, data that does not fit into the tables defined in the relational model is typically left out. Unless there is some other method to maintain the fidelity and integrity of the original event records, this is problematic for forensics. This “selective memory” can also result in data accuracy issues, as truncated records may not correlate properly and can hamper analysis.
As a result SIEM/LM architectures based on RDBMS are waning, as products in this space re-architect their backend data stores to address these issues. On the other hand, RDBMS storage is not totally dead – some vendors have instead chosen to streamline data insertion, basically by turning off some RDBMS checks and integrity verification. Others use an RDBMS to supplement a flat file architecture (described below), leveraging the advantages above for reporting and forensics.
Flat File
Flat files, or just ‘files’, are now the most common way to store events for SIEM and Log Management. Files are serve as a blank canvas for the vendor; as they can introduce any structure they choose to help define, format, and delineate events. Anything that helps with correlation and speeds up future searches is included, and each vendor has their own secret sauce for building files. Each file typically contains a day’s events, possibly from a single source, with each event clearly delineated. The files (in some cases each event) can be tagged with additional information – this is called “log enrichment”. These tags offer some of the contextual benefits of a relational database, and help to define attributes. Some even include a control structure similar to VSAM files. The events may be stored in their raw form, or be normalized prior to insertion. Flat files offer several advantages.
- Performance – Since normalization (to the degree necessary) happens before data insertion, there is very little work to be performed prior to insertion compared to a relational database. Data is stored as quickly as the physical media can handle, and often available immediately for searching and analysis.
- Flexibility – Stored events are not limited to specific normalized columns as they are in a relational database, but can take any form. Changes to internal file formats are much easier.
- Search – Searches can be performed without understanding the underlying structures, using simple keyword search. At least one log management vendor provides a Google-style search capability across data files. Alternately, search can rely upon tags and keywords established by the vendor.
The flat file tradeoffs are twofold. First, any data management capabilities – such as indexing and data integrity – must be built from scratch by the vendor, since no RDBMS capabilities are provided by the underlying platform. This means the SIEM/LM vendor must provide any needed facilities for data integrity, normalization, filtering, and indexing. Second, there is an efficiency tradeoff. Some vendors tag, index, and normalize prior to insertion; others initially record raw events, later re-reading the data in order to normalize it, and then rewrite the reformatted data. The later method offers faster insertion, at the expense of greater total storage and processing requirements.
The good news is that a few years ago most vendors saw the scalability wall of RDBMS approaching, and began investing in their own back-end data management environments. At this point many platforms feature purpose-built high-performance data stores, and we believe this will be the underlying architecture for these products moving forward.
Of course, we don’t live in an either/or world, so many of the platforms combine some RDBMS capabilities with flat file aspects. Yes, the answer can be ‘both’.
Reader interactions
2 Replies to “Understanding and Selecting SIEM/LM: Data Management”
A clear explanation of the complex topic of SIEM data management
As always, excellent analysis Adrian. Visionary vendors designed their SIEM/LMs with purpose-built data storage from the outset, while still leveraging general purpose databases for configuration and asset records. A well-designed, proprietary storage technology can support not only high insertion rates and fast queries, but offers a hybrid approach to storage by normalizing select fields in indexed columns, while at the same time storing raw events and flows.
Another consideration is scalability, both vertically and horizontally. Storage at the collection point and intelligent search capability–spreading queries across the distributed database–provides practically unlimited growth. Need higher event or flow collection rates? Did a new office in Burkina Faso just open? Add another node and you get incremental collection and processing capability, and local storage to avoid backhauling data until it’s needed.
In addition, when you build your own data storage, you can customize it for special needs like HA without the need for complex and costly bolt-on software and hardware. Extend the model to the entire architecture and you have a truly open model to accommodate what we’re seeing more and more: the demand for customers and partners to integrate into an ecosystem like Q1 Labs’ Security Intelligence Operating System.