This is the last post in our Security Analytics with Big Data series. We will end with a discussion of deployment issues and concerns for any big data deployment, and focus on issues specific to leveraging SIEM. Please remember to post comments or ask questions and I will answer in the comments. Install any big data cluster or SIEM solution that leverages big data, and you will notice that the documentation focuses on how to get up and running quickly and all the wonderful things you can do with the platform. The issues you really want to consider are left unsaid. You have to go digging for problems, but better find them now than after you deploy. There are several important items, but the single biggest challenge today is finding talent to help program and manage big data. Talent, or Lack Thereof One of the principal benefits of big data clusters is the ability to apply different programmatic interfaces, or select different query and data management paradigms. This is how we are able to do complex analytics. This is how we get better analyses from the cluster. The problem is that you cannot use it if you cannot code it. The people who manage your SIEM are probably not developers. If you have a Security Operations Center (SOC), odds are many of them have some scripting and programming experience, but probably not with big data. Today’s programmatic interfaces mean you need programmers, and possibly data architects, who understand how to mine the data. There is another aspect. When we talk to big data project architects, like SOC personnel trying to identify attacks in event data, they don’t always know what they are looking for. They find valuable information hidden in the data, but this isn’t simply the magic of querying a big data cluster – the value comes from talented personnel, including statisticians, writing queries and analyzing the results. After a few dozen – or hundred – rounds of query and review, they start finding interesting things. People don’t use SIEM this way. They want to quickly set a policy and have it enforced. They want alerts on malicious activity with minimal work. Those of you not using SIEM, who are building a security analytics cluster from scratch, should not even start the project without an architect to help with system design. Working from your project goals, the architect will help you with platform selection and basic system design. Building the system will take some doing as well as you need someone to help manage the cluster and programmers to build the application logic and data queries. And you will need someone versed in attacker behaviors to know what to look for and help the programmer stitch things together. There are only a finite number of qualified people out there today who can perform these roles. As we like to say in development, the quality of the code is directly linked to the quality of the developer. Bad developer, crappy code. Fortunately many big data scientists, architects, and programmers are well educated, but most of them are new to both big data and security. That brilliant intern out of Berkeley is going to make mistakes, so expect some bumps along the way. This is one area where you need to consider leveraging the experience of your SIEM vendor and third parties in order to see your project through. Policy Development Big data policy development is hard in the short term. Because as we mentioned above you cannot code your own policies without a programmer – and possibly a data architect and a statistician. SIEM vendors will eventually strap on abstraction interfaces to simplify big data query development but we are not there yet. Because of this, you will be more dependent on your SIEM vendor and third party service providers than before. And your SIEM vendor has yet to build out all the capabilities you want from their big data infrastructure. They will get there, but we are still early in the big data lifecycle. In many cases the ‘advancements’ in SIEM will be to deliver previously advertised capabilities which now work as advertised. In other cases they will offer considerably deeper analysis because the queries run against more data. Most vendors have been working in this problem space for a decade and understand the classic technical limitations, but they finally have tools to address those issues. So they are addressing their thorniest issues first. And they can buttress existing near-real time queries with better behavioral profiles, provide slightly better risk analysis by looking at more data, of more types. One more facet of this difficulty merits a public discussion. During a radical shift in data management systems, it is foolish to assume that a new (big data or other) platform will use the same queries, or produce exactly the same results. Vet new and revised queries on the new platforms to verify they yield correct information. As we transition to new data management frameworks and query interfaces, the way we access and locate data changes. That is important because, even if we stick to a SQL-like query language and run equivalent queries, we may not get exactly the same results. Whether better, worse, or the same, you need to assess the quality of the new results. Data Sharing and Privacy We have talked about the different integration models. Some customers we spoke with want to leverage existing (non-security) information in their security analytics. Some are looking at creating partial copies of data stored in more traditional data mining systems, with the assumption that lower cost commodity storage make the iterative cost trivial. Others are looking to derive data from their existing clusters and import that information into Hadoop or their SIEM system. There is no ‘right’ way to approach this, and you need to decide based on what you want to accomplish, whether existing infrastructure provides benefits big data cannot, and any network bandwidth issues with moving information between these systems. If you