The first four posts our the SIEM series dealt with understanding what SIEM is, and what problems it solves. Now we move into how to select the right product/solution/service for your organization, and that involves digging into the technology behind SIEM and log management platforms. We start with the foundation of every SIEM and Log Management platform: data collection. This is where we collect data from the dozens of different types of devices and applications we monitor. ‘Data’ has a pretty broad meaning – here it typically refers to event and log records but can also include flow records, configuration data, SQL queries, and any other type of standard data we want to pump into the platform for analysis.

It may sound easy, but being able to gather data from every hardware and software vendor on the planet in a scalable and reliable fashion is incredibly difficult. With over 20 vendors in the Log Management and SIEM space, and each vendor using different terms to differentiate their products, it gets very confusing. In this series we will define vendor-neutral terms to describe the technical underpinnings and components of log data collection, to level-set what you really need to worry about. In fact, while log files are what is commonly collected, we will use the term “data collection”, as we recommend gathering more than just log files.

Data Collection Overview

Conceptually, data collection is very simple: we just gather the events from different devices and applications on our network to understand what is going on. Each device generates an event each time something happens, and collects the events into a single repository known as a log file (although it could actually be a database). There are only four components to discuss for data collection, and each one provides a pretty straight-forward function. Here are the functional components:

Fig 1. Agent data collector

Fig 2. Direct connections to the device

Fig 3. Log file collection

  1. Source: There are many different sources – including applications, operating systems, firewalls, routers & switches, intrusion detection systems, access control software, and virtual machines – that generate data. We can even collect network traffic, either directly from the network for from routers that support Netflow-style feeds.
  2. Data: This is the artifact telling us what actually happened. The data could be an event, which is nothing more than a finite number of data elements to describe what happened. For example, this might record someone logging into the system or a service failure. Minimum event data includes the network address, port number, device/host name, service type, operation being performed, result of the operation (success or error code), user who performed the operation, and timestamp. Or the data might just be configuration information or device status. In practice, event logs are pretty consistent across different sources – they all provide this basic information. But each offers additional data, including context. Additional data types may include things such as NetFlow records and configuration files. In practice, most of the data gathered will be events and logs, but we don’t want to arbitrarily restrict our scope.
  3. Collector: This connects to a source device, directly or indirectly, to collect the events. Collectors take different forms: they can be agents residing on the source device (Fig. 1), remote code communicating over the network directly with the device (Fig. 2), an agent writing code writing to a dedicated log repository (Fig. 3), or receivers accepting a log file stream. A collector may be provided by the SIEM vendor or a third party (normally the vendor of the device being monitored). Further, the collector functions differently, depending upon the idiosyncrasies of the device. In most cases the source need only be configured once, and events will be pushed directly to the collector or into a neutral log file read by it. In some cases, the collector must continually request data be sent, polling the source at regular intervals.
  4. Protocol: This is how collector communicates with the source. This is an oversimplification, of course, but think of it as a language or dialect the two agree upon for communicating events. Unfortunately there are lots of them! Sometimes the collector uses an API to communicate directly with the source (e.g., OPSEC LEA APIs, MS WMI, RPC, or SDEE). Sometimes events are streamed over networking protocols such as SNMP, Netflow, or IPFIX. Sometimes the source drops events into a common file/record format, such as syslog, Windows Event Log, or syslog-ng, which is then read by the collector. Additionally, third party applications such as Lasso and Snare provide these features as a service.

Data collection is conceptually simple, but the thousands of potential variations makes implementation a complex mess. It resembles a United Nations meeting: you have a whole bunch of people talking in different languages, each with a particular agenda of items they feel are important, and different ways they want to communicate information. Some are loquacious and won’t shut up, while others need to be poked and prodded just to extract the simplest information. In a nutshell, it’s up to the SIEM and Log Management platforms to act as the interpreters, gathering the information and putting it into some useful form.


Each model for data collection has trade-offs. Agents can be a powerful proxy, allowing the SIEM platform to use robust (sometimes proprietary) connection protocols to safely and reliably move information off devices; in this scenario device setup and configuration is handled during agent installation. Agents can also take full advantage of native device features, and can tune and filter the event stream. But agents have fallen out of favor somewhat. SIEM installations cover thousands of devices, which means agents can be a maintenance nightmare, requiring considerable time to install and maintain. Further, agents’ processing and data storage requirements on the device can affect stability and performance. Finally, most agents require administrative access, which creates am additional security concern on each device.

Another common technique streams events to log files, such as syslog or the Windows Event Log. These may reside on the device, streamed to another server, or sent directly to the log management system. The benefit of this method is that data arrives already formatted using a common protocol and layout. Further, if the events are collected in a file, this removes concerns about synchronization issues and uncollected events lost prior to collection – both problems when working directly with some devices. Unfortunately general-purpose logging systems require some data normalization, which can lose detail.

Some older devices, especially dedicated control systems, simply do not offer full-feature logging, and require API-level integration to collect events. These specialized devices are much more difficult to work with, and require dedicated full-time connections to collect event trails, creating both a maintenance nightmare and a performance penalty on the devices. In these cases you do not have a choice, but need a synchronous connection in order to capture events.

Understand that data collection is not an either/or proposition. Depending on the breadth of your monitoring efforts, you may need to use every technique on some subset of device types and applications. Go into the project with your eyes open, recognizing the different types of collection, and the associated nuances and complexity of each.

In the next post we’ll talk about what to do with all this collected data: prepare it for analysis, which means normalization.