I started this series on recommendations for securing NoSQL clusters a couple weeks ago, so sorry for the delay posting the rest of the series. I had some difficulty contacting the people I spoke with during the first part of this “big data” research project, and some vendors were been slow to respond with current product capabilities. As I hoped, launching this series “shook the tree of knowledge”, and several people responded to my inquiries. It has taken a little more time than I thought to schedule calls and parse through the data, but I am finally restarting, and should be able to quickly post the rest of the research. The first step is to describe what NoSQL is and how it differs from the other databases you have been protecting, so you can understand the challenges. Let’s get started… NoSQL Overview In our last research paper on this subject, we defined NoSQL platforms with a set of characteristics which differentiate them from big iron/proprietary MPP/”cloud in the box” platforms. Yes, some folks slapped “big data” stickers on the same ‘ol stuff they have always sold, but most people we speak with now understand that those platforms are not Hadoop-like, so we can dispense with discussion of the essential characteristics. If you need to characterize NoSQL platforms please review our introductory sections on securing big data clusters. Fundamentally, NoSQL is clustered data analytics and management. And please stop calling it “big data” – we are really discussing a building-block approach to databases. Rather than the packaged relational systems we have grown accustomed to over the last two decades, we now assemble different pieces (data management, data storage, orchestration, etc.) to satisfy specific requirements. The early architects of this trend had a chip on their collective shoulders, and they were trying to escape the shadow of ubiquitous relational databases, so they called their movement ‘NoSQL’. That term nicely illustrates their opposition to relational platforms. Not that they do not support SQL – many in fact do support Structured Query Language syntax. Worse, the term “big data” was used by the press to describe this trend, assigning a label taken from the most obvious – but not most important – characteristic of these platforms. Unfortunately that term serves the movement poorly. ‘NoSQL’ is not much better but it is a step in the right direction. These databases can be tailored to focus on speed, size, analytic capabilities, failsafe operation, or some other goal, and they enable computation on a massive scale for very little money. But just as importantly, they are fully customizable to meet different needs – often simultaneously! So what does a NoSQL database look like? The “poster child” is Hadoop. The Hadoop framework (the combination of Hadoop File System (HDFS) with other services such as YARN, Hive, Pig, etc.) is the general architecture employed by most NoSQL clusters, but many more are in wide use – including Cassandra, MongoDB, Couch, AWS SimpleDB, and Riak. There are over 125 known variations, but those account for the majority of customer usage today. NoSQL platforms scale and perform so well because of two key principles: distribution of data management and query processing across many servers (possibly thousands), combined with a modular architecture that allows different services to be swapped in as needed. Architecturally a Hadoop cluster looks like this: It is useful to think of the Hadoop framework as a ‘stack’, much like the famous LAMP stack. These pieces are normally grouped together, but you can mix and match and add to the stack as needed. For example Sqoop and Hive are replacement data access services. You can select a big data environment specifically to support columnar, graph, document, XML, or multidimensional data – all collectively called ‘NoSQL’ because they are not constrained by relational database constructs or a relational query parser. You can install different query engines depending on the type of data being stored. Or you can extend HDFS functionality with logging tools like Scribe. The entire stack can be configured and extended as needed. I have included Lustre, GFS, and GPFS, as they are all technically alternatives to HDFS but not as widely used. But the point is that this modular approach offers great flexibility at the expense of more difficult security, because each option brings its own security options and deficiencies. We get big, cheap, and easy data management and processing – with lagging security capabilities. This diagram illustrates some of the many variables in play. It seems like every customer uses a slightly different setup – tweaking for performance, manageability, and programmer preference. Many of the people we spoke with are on their second or third stack architecture, having replaced components which did not scale or perform as needed. This flexibility is great for functionality, but makes it much more difficult for third-party vendors to produce monitoring or configuration assessment tools. There are few constants (or even knowns) for NoSQL clusters, and things are too chaotic to definitively identify best or worst practices across all configurations. For convenient reference, we offer a list of key differences between relational and NoSQL platforms which impact security. Relational platforms typically scale by replacing a server with a larger one, rather than by adding many new servers. Relational systems have a “walled garden” security model: you attach to the database through well-defined interfaces, but internal workings are generally not exposed. Relational platforms come with many tools such as built-in encryption, SQL validation, centralized administration, full support for identity management, built-in roles, administrative segregation of duties, and labeling capabilities. You can add many of these features to a NoSQL cluster, but still face the fundamental problem of securing a dynamic constellation of many servers. This makes configuration management, patching, and server validation particularly challenging. Despite a few security detractors, NoSQL facilitates data management and very fast analysis on large-scale data warehouses, at very low cost. Cheap, fast, and easy are the three pillars this movement has been built upon – data analytics for the masses. A NoSQL