Some of our first customer conversations about big data and SIEM centered on how to integrate the two platforms. Several customers wanted to know how they could pull data from different existing log management and analytics systems into a big data platform. Most were told by their vendors that big data was; and they wanted to know what that integration would look like and how it would affect operations. Likely you won’t be doing the integration, but you will need to live with the design choices of your vendor. The benefit depends on their implementation choices.

There are three basic models for integration of big data with SIEM:

Log Management Container

Log Storage Cluster

Some vendors have chosen to integrate big data while keeping their basic architecture: semi-relational or flat file system which supports SIEM functions, and fronts a big data cluster which handles log management tasks. We say ‘semi-relational’ because it is typically a relational platform such as Oracle or SQL Server, but stripped of many relational constructs to improve data insertion rates. SIEM’s event processing and near real-time alerts remains unchanged: event streams are processed as they arrive, and a specific subset of events and profile information stored within a relational – or proprietary flat file – database. Data stored within the big data cluster may be correlated, but normalization and enrichment is only performed at the SIEM layer. Raw events are streamed to the big data cluster for long-term storage, possibly compressed. SIEM functions may be supported by queries to reference specific data points within the big data archive but support is limited. In essence big data is used to scale event storage and accommodate events – regardless of type or format.


Like the example above, in this scenario real-time analysis is performed on the incoming event stream, and basic analysis performed in a semi-relational or flat file database. The difference here is functional rather than architectural. The two databases are truly peers – each provides half the analysis capability. The big data cluster periodically re-calculates behavioral profiles and risk scores, and shares these with SIEM’s real-time analysis component. It also processes complex activity chains and multiple events tied to specific locations, users or applications may that indicate malicious behavior. The big data cluster does a lot of heavy lifting to mine events, and shares these updated profiles with SIEM to hone policy enforcement. The big data cluster allows provides a direct view for Security Operations Centers (SOC) to run ad hoc queries on a complete set of events to look for outliers and anomalous activity.

Full Integration

The next option is to leverage only big data for event analysis and long term log storage. Today most of these SIEM platforms use proprietary file systems – not relational databases or big data. These proprietary systems were born of the same need to scale to accommodate more data with less insertion overhead than big data. These proprietary repositories were designed to provide clustered data management, distributing queries across multiple machines. But they are not big data – they don’t have the essential characteristics we defined earlier, and often don’t have the the 3Vs either.


You will notice that both peer-to-peer and log management oriented versions use two databases; one relational and one big data. There is really no good reason to maintain a relational database alongside a big data cluster – other than the time it takes to migrate and test the migration. That aside, it is a simple engineering effort to swap out a relational platform with a big data cluster. Big data clusters can be assembled to perform ultra fast queries, or efficient large scale analysis, or leverage both types of queries on a single data set. Many relational features are irrelevant to security analytics, so they are either stripped out for performance or remain present, reducing performance. Again, there is no reason relational databases must be part of SIEM – the only impediment is the need to re-engineer the platform to swap the new cluster in. This does not exist today but expect it in the months to come.

Continuing this line of thought, it is very interesting to think of ways to further optimize a SIEM system. You can run more than one big data cluster, each focused on a specific type of operation. So one cluster would run fully indexed SQL queries for fast retrieval while another might run MapReduce queries to find statistical outliers. In terms of implementation, you might choose Cassandra for its index capabilities and native compression, and Hadoop for MapReduce and large-scale storage. The graphic to the right shows this possibility. It is also possible to have one data cluster with multiple query engines running against the same data set. The choice is up to your SIEM vendor, but the low cost of data storage and processing capacity mean the performance boost even from redundant data stores is still likely to outweigh the costs of added processing. The fit for security analytics is largely conjecture, but we have seen both models scale well for various other data analyses.


Those of you keeping score at home have noticed I am throwing in a fourth option: the standalone or non-integration model. Some of our readers are not actually interested in SIEM at all – they just want to collect security events and run their own reports and analysis without SIEM. It is perfectly feasible to build a standalone big data cluster for security and event analytics. Choose a platform optimized for your queries (fast, or efficient, or both if it is worth building multiple optimized clusters), the types of data to mine, and developer comfort. But understand that you will need to build a lot yourself. A wide variety of excellent tools and logging utilities are available as open source or shareware, but you will responsibility for design, organization, and writing your own analytics. Starting from scratch is not necessarily bad but all development (tools, queries, reports, etc.) will fall to your team. Should you choose to integrate with SIEM or log management, you will feed events to the big data cluster much as you would with the log management example above.

A few final comments: You can think of these three models as an evolutionary cycle, with most SIEM vendors moving from log management to Peer-to-peer. Keeping these models in mind, figure out what your vendor is doing – they pretty much all claim more big data integration than they really provide; those running on a proprietary file layout tend to claim they have arrived at the promised land of perfect and effortless data storage. But vendors have a good reason to fudge the truth: you.

For years, SIEM vendors have been hearing about their shoddy integration between SIM, SEM, and log management. Their failures to integrate management consoles, policies, and other operational tasks were a sore spot with customers; creating a competitive divide for a number of years. Vendors got tired of black eyes for lack of integration, so many vendors claim better integration than they can really deliver. For example some use big data primarily as a log management repository and drill-down archive for Security Operations Center (SOC) support, but claim their architecture is ‘peer-to-peer’.

Honestly, the level of architectural integration matters much less than the performance of key functions. Deeper integration implies better performance, but as long as the platform works does what you want don’t get hung up on labels. Big data adoption is still very young, and all the vendors will continue to incorporate big data capabilities into their platforms while they evolve to better take advantage of them.

Our next post will cover some key operational consideration to be aware of – running big data and writing custom policies are entirely new arenas, and you need to be prepared.