“What the heck is up with Splunk”? It’s a question I have been getting a lot lately. From end users and SIEM vendors. Larry Walsh posted a nice article on how Splunk Disrupts Security Log Auditing. His post prodded me into getting off my butt and blogging about this question.
I wanted to follow up on Splunk after I wrote the post on Amazon’s SimpleDB as it relates to what I am calling the blob-ification of data. Basically creating so much data that we cannot possibly keep it in a structured environment. Mike Rothman more accurately called it ” … the further decomposition of application architecture”. In this case we collect some type of data from some type of device, put it onto some type of storage, and then we use a Google-esque search tool to find what we are looking for. And the beauty of Google is that it does not care if it is a web page or voice mail transcript – it will find what you are looking for if you give it reasonable search criteria. In essence that is the value Splunk provides a tool to find information in a sea of data.
It is easy to locate information within a structure repository with known attributes and data types, and we know where certain pieces of information are stored. With unstructured data we may not know what we have or where it is located. For some time normalization techniques were used to introduce structure and reduce storage requirements, but that was a short-lived/low performance approach. Adding attributes to raw data and just linking back to those attributes is far more efficient. Enter Splunk. Throw the data into flat files and index those files. Techniques of tokenization, tagging, and indexing help categorize data with the ultimate goal of correlating events and reporting on unstructured data of differing types. Splunk is not the only vendor who does – several SIEM and Log Management vendors do the same or similar. My point is not that one vendor is better than another, but point out the general trend. It is interesting that Splunk’s success in this area has even taken their competitors by surprise.
Larry’s point …
“The growth Splunk is achieving is due, in part, to penetrating deeper into the security marketplace and disrupting the conventional log management and auditing vendors.”
… is accurate. But they are are able to do this because of the increased volume of data we are collecting. People are data pack-rats. From experience, less than 1% of the logged data I collect has any value. Far too, often organizations do not invest the time to determine what can be thrown away. Many are too chicken to throw useless data away. They don’t want to discard data, just in case it has value, just in case you need it, just in case it contains the needle in the haystack you need for a forensic investigation. I don’t want to be buried under the wash of useless data. My recommendation is to take the time to understand what data you have, determine what you need, and throw the rest away.
The pessimist in me knows that this is unlikely to happen. We are not going to start throwing data away. Storage and computing power are cheap, and we are going to store every possible piece of data we can. Amazon S3 will be the digital equivalent of those U-Haul Self Storage places where you keep your grandmother’s china and all the crap you really don’t want, but think has value. That means we must have Google-like search approaches and indexing strategies that vendors like Splunk provide just to navigate the stuff. Look for unstructured search techniques to be much sought after as the data volumes continue to grow out of control.
Hopefully the vendors will begin tagging data with an expiration date.
Reader interactions
8 Replies to “Splunk and Unstructured Data”
Adrian – interesting post – but Splunk’s problem isnt logs or other logging components – its time. The real problem is in mapping events from one system to those on another and what that does if the mapping fails (most often because the time-stamps from one system are not comparable to those from another).
As to the massive expansion of data – that also is just not the case. The data that is presenting itself for analysis now is specifically the same size as it used to be in most instances, just adjusted for the increase in transactions – and that is the problem.
What Log Management needs is tools which insure that its reports actually mean something in the real world and without that the data in the log is arguably worthless. The key byproduct of this in my opinion is that the D&O policies of the officers involved are likely unenforceable based therein but hey… it is what it is right?
That said, logging is easy and security once a Knowledge Representation Model is created is also easy since all the IN SCOPE data is clearly defined and can be addressed therein.
Just my two cents.
Todd Glassey CISM CIFI
I suppose if the only value of log data is reactive, in the sense that you get an alert from another source and now you want to research logs, the Splunk approach is good. It’s my understanding that the primary use case for Splunk is actually in Operations where, for example, a network management system generates an alert based on some SNMP-oriented analysis. In reaction to the SNMP-based alert for some IP address, the admin goes to Splunk and enters the IP address to view logs associated with that IP address.
SIEM is really about being proactive, i.e. generating actionable alerts based on log analysis. Based on my years of experience, in order to do this, you need to first “normalize” at least some portion of the logs (the events deemed “interesting”) and then perform some type of analysis, be it statistical or rules.
I am not necessarily saying SIEM has achieved this goal considering the April 2009 Verizon Business report which said that of the methods by which the investigated breaches were discovered (Discovery Methods, page 37), 83% were discovered by third parties or non-security employees going about their normal business. Only 6% were found by event monitoring or log analysis.
So in conclusion, Splunk is definitely helpful and SIEM needs to improve dramatically.
You absolutely, positively MUST get rid of data on a regular basis. Anybody who advocates otherwise is ignoring the HUGE legal liability in keeping everything indefinitely. The “it’s too hard” whine is laziness and needs to be confronted harshly. Of the commenters above who say you should just keep it all, I wonder how many of them have had this conversation, and the larger data retention conversation, with counsel lately. This area is one of the top concerns consistently discussed within the ABA InfoSec Committee and eDiscovery & Digital Evidence Committee. In both cases, attorneys are baffled why anybody would think to keep everything indefinitely that isn’t explicitly required by law. The liability is potentially huge. fwiw.
I do not disagree with your points, but offer a different perspective.
Let’s say you are keeping data for compliance. FERPA is a good example. Most say they don’t know what data to keep. But like several compliance regulations, you get to define what you need to keep! All you have to do is establish a written policy as to what you are keeping and why. That means work, but figuring our what to keep, writing the policy and tuning the data collection. And no one really wants to do that work, and no one wants to be wrong down the line even though there is no regulatory penalty for being wrong if you document the policy. But you need to take the time to do the research. If it becomes important in the future, change the policy and collect the data.
Attorney’s are a whole different story. They are paid to minimize liability. Invariably they will say keep everything. And why not? They do not manage the data. And they get paid for discovery and research, so combing mountains of information means they are paid more.
Is it safer to keep everything? Yeah. Is it easier to keep everything? In the short term it requires no work. In the long term the answer is ‘no’ as the reports, discovery and the analysis are _much_ harder.
Adrian,
I think you’re glossing over the difficulties with discarding data a bit. Throwing out broken furniture and ugly silver is one thing, but discarding log data which you might need if you get sued in a year, or discover a breach in 6 months, is a much more serious matter.
What level does the decision to discard data need to be made at? Does a lawyer with high hourly rate have to read the logs and say there’s nothing there? Can you have a NOC op or help desk rep scan the logs and dump them if they don’t spot anything suspicious? What if it included an IP which suddenly becomes important & interesting *next* month?
The potential downside to discarding data is immense, and storage is cheap enough that it’s often much easier & safer to keep it than perform an adequate analysis.
I could discard a lot of personal email safely, but to know there’s nothing worth saving, I’d have to reread it first, and I don’t have time to do that. This helps avoid mistaken deletion, too…
@Steve – You raise a really good point with multi-line or multi-record applications logs. A database transaction may be a single log entry, or 100k. Sometimes it is not be possible to filter records because it may not be possible to determine which are needed and what is not. Oh, and while it is fairly obvious you work for Splunk, I just wanted to note this just in case there was any doubt by other readers.
@Erik – No reason to thank me, it is an interesting trend and I was a little surprised by how quickly we got to this point. And yes, filtering and summarization technologies to reduce the data set size is I get to keep important information on disk longer, before it is moved off to long term storage. Sooner or later the need will arise.
-Adrian
Adrian,
Thanks for mentioning Splunk, and your post brings up interesting points.
We recommend that people dump “everything” into splunk and just keep it. I’d go further and say that i’d bet that far less than 1% of that data is ever looked-at/reported on/etc. As you point out, its likely harder and more risky to remove data than keep it. This clearly changes when you talk about multiple T per day ( average large system these days ), where even for a wealthy company, the IO required is very expensive and not sure the data has value/risk. My gut is that data generation growth is clearly outpacing the size/price curve per GB, and will likely do so until massively more scaleable and cost effective media is available.
For the time being, keeping everything is likely the best starting point.
At the same time, we have seen models that look a lot like email spam filtering, where “uninteresting” data is routed to different instances that have shorter retention policies. Summarization is used to capture and compress the data hopefully with no information loss. Not a great practice for compliance, but for trouble shooting and analytics can work. Longer term its an interesting area for research and something that due to the size of data we deal with needs to be solved.
Thanks again, and interesting topic.
e.
cto/co-founder, splunk
Adrian,
Thought I’d add a few comments about what Splunk users have told us. There is definitely a trend towards collecting broader sources of logs, as well as other IT data (events, alerts, etc.) from all levels of the IT and network infrastructure. Sometimes it’s driven by a specific compliance mandate, but often it’s because security analysts need the data for comprehensive incident investigations. One of the most often cited need is to collect complex, multi-line application logs–especially from critical custom apps (think financial institutions). Whether they’re assessing the impact of an external attacker who penetrated past their traditional defenses, or the malicious insider who is abusing trusted access privileges, the application logs can be a key component of the investigation. Splunk collects these custom logs and can search and report on them without the need for custom parsers or connectors.
It does make sense, within the guidelines set forth by compliance and governance mandates, to be able to purge information over time. Systems should be flexible so that users can purge data with variable criteria–retention depending on data source, data type, etc.