I periodically write for Dark Reading, contributing to their Database Security blog. Today I posted What Data Discovery Tools Really Do, introducing how data discovery works within relational database environments. As is the case with many of the posts I write for them, I try not to use the word ‘database’ to preface every description, as it gets repetitive. But sometimes that context is really important.

Ben Tomhave was kind enough to let me know that the post was referenced on the eDiscovery and Digital evidence mailing list. One comment there was, “One recurring issue has been this: If enterprise search is so advanced and so capable of excellent granularity (and so touted), why is ESI search still in the boondocks?” I wanted to add a little color to the post I made on Dark Reading as well as touch on an issue with data discovery for ESI.

Automated data discovery is a relatively new feature for data management, compliance, and security tools. Specifically in regard to relational databases, the limitations of these products have only been an issue in the last couple years due to growing need – particularly in accuracy of analysis. The methodologies for rummaging around and finding stuff are effective, but the analysis methods have a little way to go. That’s why we are beginning to see labeling and content inspection. With growing use of flat file and quasi-relational databases, look for labeling and Google type search to become commonplace.

In my experience, metadata-based data discovery was about 85% effective. Having said that, the number is totally bogus. Why? Most of the stuff I was looking for was easy to find, as the databases were constructed by someone was good at database design, using good naming conventions and accurate column definitions. In reality you can throw the 85% number out, because if a web application developer is naming columns “Col1, Col2, Col3, … Col56”, and defining them as text fields up to 50 characters long, your effectiveness will be 0%. If you do not have labeling or content analysis to support the discovery process, you are wasting your time. Further, with some of the ISAM and flat file databases, the discovery tools do not crawl the database content properly, forcing some vendors to upgrade to support other forms of data management and storage. Given the complexity of environments and the mixture of data and database types, both discovery and analysis components must continue to evolve.

Remember that a relational database is highly structured, with columns and tables being fully defined at the time of creation. Data that is inserted goes through integrity checks, and in some cases, must conform to referential integrity checks as well. Your odds of automated tools finding useful information in such databases is far higher because you have definitive descriptions. In flat files or scanned documents? All bets are off.

As part of a project I conducted in early 2009, I spoke with a bunch of attorneys in California and Arizona regarding issues of legal document discovery and management. In that market, document discovery is a huge business and there is a lot of contention in legal circles regarding its use. In terms of legal document and data discovery, the process and tools are very different from database data discovery. From what I have witnessed and from explanations by people who sit on steering committees for issues pertaining to legal ESI, very little of the data is ever in a relational database. The tools I saw were pure keyword and string pattern matching on flat files. Some of the large firms may have document management software that is a little more sophisticated, but much of it is pure flat file server scanning with reports, because of the sheer volume of data. What surprised me during my discussions was that document management is becoming a huge issue as large legal firms are attempting to win cases by flooding smaller firms with so many documents that they cannot even process the results of the discovery tools. They simply do not have adequate manpower and it undermines their ability to process their casefiles. The fire around this market has to do with politics and not technology. The technology sucks too, but that’s secondary suckage.