During the big data research project I found myself thinking about how I would secure a NoSQL database if I was responsible for a cluster. One area I can’t help thinking about is Database Activity Monitoring; how I would implement a solution for big databases? The only currently available solution I am aware of is very limited in what it provides. And I think the situation to stay that way for a long time. The ways to collect data with big data clusters, and to deploy monitoring, are straightforward. But analyzing queries will remain a significant engineering challenge. NoSQL tasks are processed very differently than on relational platforms, and the information at your disposal is significantly less.

First some background: With Database Activity Monitoring, you judge a user’s behavior by looking at the queries they send to the database. There are two basic analysis techniques for relational databases: either to examine the metadata associated with relational database queries, or to examine the structure and content of queries themselves. The original and most common method is metadata examination – we look at data including user identity, time of day, origin location of the query, and origin application of the query. Just as importantly we examine which objects are requested – such as column definitions – to see if a user may be requesting sensitive data. We might even look at frequency of queries or quantity of data returned. All these data points can indicate system misuse.

The second method is to examine the query structure and variables provided by the user. There are specific indicators in the where clause of a relational query that can indicate SQL injection or logic attacks on the database. There are specific patterns, such as “1=1”, designed to confuse the query parser into automatically taking action. There are content ‘fingerprints’, such as social secuirty number formats, which indicate sensitive data. And there are adjustments to the from clause, or even usage of optional query elements, designed to mask attacks from the Database Activity Monitor. But the point is that relational query grammars are known, finite, and fully cataloged. It’s easy for databases and monitors to validate structure, and then by proxy user intent.

With big data tasks – most often MapReduce – it’s not quite so easy. MapReduce is a means of distributing a query across many nodes, and reassembling the results from each node. These tasks look a lot more like code than structured relational queries. But it gets worse: the query model could be text search, or an XPath XML parser, or SPARQL. A monitor would need to parse very different query types.

Unfortunately we don’t necessarily know the data storage model of the database, which complicates things. Is it graph data, tuple-store, quasi-relational, or document storage? We get no hints from the selection’s structure or data type, because in a non-relational database that data is not easily accessible. There is no system table to quickly consult for table and column types.

Additionally, the rate at which data moves in and out of the cluster makes dynamic content inspection infeasible. We don’t know the database storage structure and cannot even count on knowing the query model without some inspection and analysis. And – I really hate to say this because the term is so overused and abused – but understanding the intention of a MapReduce task is a halting problem: it’s at least difficult, and perhaps impossible, to dynamically determine whether it is malicious.

So where does that leave us?

I suspect that Database Activity Monitoring for NoSQL databases cannot be as effective as relational database monitoring for a very long time. I expect solutions to work purely by analyzing available metadata available for the foreseeable future, and they will restrict themselves to cookie-cutter MapReduce/YARN deployments in Hadoop environments. I imagine that query analysis engines will need to learn their target database (deployment, data storage scheme, and query type) and adapt to the platforms, which will take several cycles for the vendors to get right. I expect it to be a very long time before we see truly useful systems – both because of the engineering difficulty and because of the diversity of available platforms. I wish I could say that I have seen innovative new approaches to this problem, and they are just over the horizon, but I have not. With so many customers using these systems and pumping tons of information into them – much of it sensitive – demand for security will come. And based on what’s available today I expect the tools to lean heavily toward logging tools and WAF.

That’s my opinion.