Splunk and Unstructured DataBy Adrian Lane
“What the heck is up with Splunk”? It’s a question I have been getting a lot lately. From end users and SIEM vendors. Larry Walsh posted a nice article on how Splunk Disrupts Security Log Auditing. His post prodded me into getting off my butt and blogging about this question.
I wanted to follow up on Splunk after I wrote the post on Amazon’s SimpleDB as it relates to what I am calling the blob-ification of data. Basically creating so much data that we cannot possibly keep it in a structured environment. Mike Rothman more accurately called it ” … the further decomposition of application architecture”. In this case we collect some type of data from some type of device, put it onto some type of storage, and then we use a Google-esque search tool to find what we are looking for. And the beauty of Google is that it does not care if it is a web page or voice mail transcript – it will find what you are looking for if you give it reasonable search criteria. In essence that is the value Splunk provides a tool to find information in a sea of data.
It is easy to locate information within a structure repository with known attributes and data types, and we know where certain pieces of information are stored. With unstructured data we may not know what we have or where it is located. For some time normalization techniques were used to introduce structure and reduce storage requirements, but that was a short-lived/low performance approach. Adding attributes to raw data and just linking back to those attributes is far more efficient. Enter Splunk. Throw the data into flat files and index those files. Techniques of tokenization, tagging, and indexing help categorize data with the ultimate goal of correlating events and reporting on unstructured data of differing types. Splunk is not the only vendor who does – several SIEM and Log Management vendors do the same or similar. My point is not that one vendor is better than another, but point out the general trend. It is interesting that Splunk’s success in this area has even taken their competitors by surprise.
Larry’s point …
“The growth Splunk is achieving is due, in part, to penetrating deeper into the security marketplace and disrupting the conventional log management and auditing vendors.”
… is accurate. But they are are able to do this because of the increased volume of data we are collecting. People are data pack-rats. From experience, less than 1% of the logged data I collect has any value. Far too, often organizations do not invest the time to determine what can be thrown away. Many are too chicken to throw useless data away. They don’t want to discard data, just in case it has value, just in case you need it, just in case it contains the needle in the haystack you need for a forensic investigation. I don’t want to be buried under the wash of useless data. My recommendation is to take the time to understand what data you have, determine what you need, and throw the rest away.
The pessimist in me knows that this is unlikely to happen. We are not going to start throwing data away. Storage and computing power are cheap, and we are going to store every possible piece of data we can. Amazon S3 will be the digital equivalent of those U-Haul Self Storage places where you keep your grandmother’s china and all the crap you really don’t want, but think has value. That means we must have Google-like search approaches and indexing strategies that vendors like Splunk provide just to navigate the stuff. Look for unstructured search techniques to be much sought after as the data volumes continue to grow out of control.
Hopefully the vendors will begin tagging data with an expiration date.