Understanding and Selecting SIEM/LM: Aggregation, Normalization, and Enrichment

In the last post on Data Collection we introduced the complicated process of gathering data. Now we need to understand how to put it into a manageable form for analysis, reporting, and long-term storage for forensics. Aggregation SIEM platforms collect data from thousands of different sources because these events provide the data we need to analyze the health and security of our environment. In order to get a broad end-to-end view, we need to consolidate what we collect onto a single platform. Aggregation is the process of moving data and log files from disparate sources into a common repository. Collected data is placed into a homogenous data store – typically purpose-built flat file repositories or relational databases – where analysis, reporting, and forensics occur; and archival policies are applied. The process of aggregation – compiling these dissimilar event feeds into a common repository – is fundamental to Log Management and most SIEM platforms. Data aggregation can be performed by sending data directly into the SIEM/LM platform (which may be deployed in multiple tiers), or an intermediary host can collect log data from the source and periodically move it into the SIEM system. Aggregation is critical because we need to manage data in a consistent fashion: security, retention, and archive policies must be systematically applied. Perhaps most importantly, having all the data on a common platform allows for event correlation and data analysis, which are key to addressing the use cases we have described. There are some downsides to aggregating data onto a common platform. The first is scale: analysis becomes exponentially harder as the data set grows. Centralized collection means huge data stores, greatly increasing the computational burden on the SIEM/LM platform. Technical architectures can help scale, but ultimately these systems require significant horsepower to handle an enterprise’s data. Systems that utilize central filtering and retention policies require all data to be moved and stored – typically multiple times – increasing the burden on the network. Some systems scale using distributed processing, where filtering and analysis occur outside the central repository, typically at the distributed data collection point. This reduces the compute burden on the central server and allows processing to occur on smaller, more manageable data sets. It does require that policies, along with the code to process them, be distributed and kept current throughout the network. Distributed agent processes are a handy way to “divide and conquer”, but increase IT administration requirements. This strategy also adds a computational burden o the data collection points, degrading their performance and potentially slowing enough to drop incoming data. Data Normalization If the process of aggregation is to merge dissimilar event feeds into one common platform, normalization takes it one step further by reducing the records to just common event attributes. As we mentioned in the data collection post, most data sources collect exactly the same base event attributes: time, user, operation, network address, and so on. Facilities like syslog not only group the common attributes, but provide means to collect supplementary information that does not fit the basic template. Normalization is where known data attributes are fed into a generic template, and anything that doesn’t fit is simply omitted from the normalized event log. After all, to analyze we want to compare apple to apples, so we throw away an oranges for the sake of simplicity. Depending upon the SIEM or Log Management vendor, the original non-normalized records may be kept in a separate repository for forensics purposes prior to later archival or deletion, or they may simply be discarded. In practice, discarding original data is a bad idea, since the full records are required for any kind of legal enforcement. Thus, most products keep the raw event logs for a user-specified period prior to archival. In some cases, the SIEM platform keeps a link to the original event in the normalized event log which provides ‘drill-down’ capability to easily reference extra information collected from the device. Normalization allows for predicable and consistent storage for all records, and indexes these records for fast searching and sorting, which is key when battling the clock in investigating an incident. Additionally, normalization allows for basic and consistent reporting and analysis to be performed on every event regardless of the data source. When the attributes are consistent, event correlation and analysis – which we will discuss in our next post – are far easier. Technically normalization is no longer a requirement on current platforms. Normalization was a necessity in the early days of SIEM, when storage and compute power were expensive commodities, and SIEM platforms used relational database management systems for back-end data management. Advances in indexing and searching unstructured data repositories now make it feasible to store full source data, retaining original data, and eliminating normalization overhead. Enriching the Future In reality, we are seeing a number of platforms doing data enrichment, adding supplemental information (like geo-location, transaction numbers, application data, etc.) to logs and events to enhance analysis and reporting. Enabled by cheap storage and Moore’s Law, and driven by ever-increasing demand to collect more information to support security and compliance efforts, we expect more platforms to increase enrichment. Data enrichment requires a highly scalable technical architecture, purpose-built for multi-factor analysis and scale, making tomorrow’s SIEM/LM platforms look very similar to current business intelligence platforms. But that just scratches the surface in terms of enrichment, because data from the analysis can also be added to the records. Examples include identity matching across multiple services or devices, behavioral detection, transaction IDs, and even rudimentary content analysis. It is somewhat like having the system take notes and extrapolate additional meaning from the raw data, making the original record more complete and useful. This is a new concept for SIEM, so the enrichment will ultimately encompass is anyone’s guess. But as the core functions of SIEM have standardized, we expect vendors to introduce new ways to derive additional value from the sea of data they collect. Other Posts in Understanding and Selecting SIEM/LM Introduction. Use Cases,

Read Post

Totally Transparent Research is the embodiment of how we work at Securosis. It’s our core operating philosophy, our research policy, and a specific process. We initially developed it to help maintain objectivity while producing licensed research, but its benefits extend to all aspects of our business.

Going beyond Open Source Research, and a far cry from the traditional syndicated research model, we think it’s the best way to produce independent, objective, quality research.

Here’s how it works:

  • Content is developed ‘live’ on the blog. Primary research is generally released in pieces, as a series of posts, so we can digest and integrate feedback, making the end results much stronger than traditional “ivory tower” research.
  • Comments are enabled for posts. All comments are kept except for spam, personal insults of a clearly inflammatory nature, and completely off-topic content that distracts from the discussion. We welcome comments critical of the work, even if somewhat insulting to the authors. Really.
  • Anyone can comment, and no registration is required. Vendors or consultants with a relevant product or offering must properly identify themselves. While their comments won’t be deleted, the writer/moderator will “call out”, identify, and possibly ridicule vendors who fail to do so.
  • Vendors considering licensing the content are welcome to provide feedback, but it must be posted in the comments - just like everyone else. There is no back channel influence on the research findings or posts.
    Analysts must reply to comments and defend the research position, or agree to modify the content.
  • At the end of the post series, the analyst compiles the posts into a paper, presentation, or other delivery vehicle. Public comments/input factors into the research, where appropriate.
  • If the research is distributed as a paper, significant commenters/contributors are acknowledged in the opening of the report. If they did not post their real names, handles used for comments are listed. Commenters do not retain any rights to the report, but their contributions will be recognized.
  • All primary research will be released under a Creative Commons license. The current license is Non-Commercial, Attribution. The analyst, at their discretion, may add a Derivative Works or Share Alike condition.
  • Securosis primary research does not discuss specific vendors or specific products/offerings, unless used to provide context, contrast or to make a point (which is very very rare).
    Although quotes from published primary research (and published primary research only) may be used in press releases, said quotes may never mention a specific vendor, even if the vendor is mentioned in the source report. Securosis must approve any quote to appear in any vendor marketing collateral.
  • Final primary research will be posted on the blog with open comments.
  • Research will be updated periodically to reflect market realities, based on the discretion of the primary analyst. Updated research will be dated and given a version number.
    For research that cannot be developed using this model, such as complex principles or models that are unsuited for a series of blog posts, the content will be chunked up and posted at or before release of the paper to solicit public feedback, and provide an open venue for comments and criticisms.
  • In rare cases Securosis may write papers outside of the primary research agenda, but only if the end result can be non-biased and valuable to the user community to supplement industry-wide efforts or advances. A “Radically Transparent Research” process will be followed in developing these papers, where absolutely all materials are public at all stages of development, including communications (email, call notes).
    Only the free primary research released on our site can be licensed. We will not accept licensing fees on research we charge users to access.
  • All licensed research will be clearly labeled with the licensees. No licensed research will be released without indicating the sources of licensing fees. Again, there will be no back channel influence. We’re open and transparent about our revenue sources.

In essence, we develop all of our research out in the open, and not only seek public comments, but keep those comments indefinitely as a record of the research creation process. If you believe we are biased or not doing our homework, you can call us out on it and it will be there in the record. Our philosophy involves cracking open the research process, and using our readers to eliminate bias and enhance the quality of the work.

On the back end, here’s how we handle this approach with licensees:

  • Licensees may propose paper topics. The topic may be accepted if it is consistent with the Securosis research agenda and goals, but only if it can be covered without bias and will be valuable to the end user community.
  • Analysts produce research according to their own research agendas, and may offer licensing under the same objectivity requirements.
  • The potential licensee will be provided an outline of our research positions and the potential research product so they can determine if it is likely to meet their objectives.
  • Once the licensee agrees, development of the primary research content begins, following the Totally Transparent Research process as outlined above. At this point, there is no money exchanged.
  • Upon completion of the paper, the licensee will receive a release candidate to determine whether the final result still meets their needs.
  • If the content does not meet their needs, the licensee is not required to pay, and the research will be released without licensing or with alternate licensees.
  • Licensees may host and reuse the content for the length of the license (typically one year). This includes placing the content behind a registration process, posting on white paper networks, or translation into other languages. The research will always be hosted at Securosis for free without registration.

Here is the language we currently place in our research project agreements:

Content will be created independently of LICENSEE with no obligations for payment. Once content is complete, LICENSEE will have a 3 day review period to determine if the content meets corporate objectives. If the content is unsuitable, LICENSEE will not be obligated for any payment and Securosis is free to distribute the whitepaper without branding or with alternate licensees, and will not complete any associated webcasts for the declining LICENSEE. Content licensing, webcasts and payment are contingent on the content being acceptable to LICENSEE. This maintains objectivity while limiting the risk to LICENSEE. Securosis maintains all rights to the content and to include Securosis branding in addition to any licensee branding.

Even this process itself is open to criticism. If you have questions or comments, you can email us or comment on the blog.