Security Analytics with Big Data: Deployment IssuesBy Adrian Lane
This is the last post in our Security Analytics with Big Data series. We will end with a discussion of deployment issues and concerns for any big data deployment, and focus on issues specific to leveraging SIEM. Please remember to post comments or ask questions and I will answer in the comments.
Install any big data cluster or SIEM solution that leverages big data, and you will notice that the documentation focuses on how to get up and running quickly and all the wonderful things you can do with the platform. The issues you really want to consider are left unsaid. You have to go digging for problems, but better find them now than after you deploy. There are several important items, but the single biggest challenge today is finding talent to help program and manage big data.
Talent, or Lack Thereof
One of the principal benefits of big data clusters is the ability to apply different programmatic interfaces, or select different query and data management paradigms. This is how we are able to do complex analytics. This is how we get better analyses from the cluster. The problem is that you cannot use it if you cannot code it. The people who manage your SIEM are probably not developers. If you have a Security Operations Center (SOC), odds are many of them have some scripting and programming experience, but probably not with big data. Today’s programmatic interfaces mean you need programmers, and possibly data architects, who understand how to mine the data.
There is another aspect. When we talk to big data project architects, like SOC personnel trying to identify attacks in event data, they don’t always know what they are looking for. They find valuable information hidden in the data, but this isn’t simply the magic of querying a big data cluster – the value comes from talented personnel, including statisticians, writing queries and analyzing the results. After a few dozen – or hundred – rounds of query and review, they start finding interesting things.
People don’t use SIEM this way. They want to quickly set a policy and have it enforced. They want alerts on malicious activity with minimal work.
Those of you not using SIEM, who are building a security analytics cluster from scratch, should not even start the project without an architect to help with system design. Working from your project goals, the architect will help you with platform selection and basic system design. Building the system will take some doing as well as you need someone to help manage the cluster and programmers to build the application logic and data queries. And you will need someone versed in attacker behaviors to know what to look for and help the programmer stitch things together. There are only a finite number of qualified people out there today who can perform these roles. As we like to say in development, the quality of the code is directly linked to the quality of the developer. Bad developer, crappy code. Fortunately many big data scientists, architects, and programmers are well educated, but most of them are new to both big data and security. That brilliant intern out of Berkeley is going to make mistakes, so expect some bumps along the way.
This is one area where you need to consider leveraging the experience of your SIEM vendor and third parties in order to see your project through.
Big data policy development is hard in the short term. Because as we mentioned above you cannot code your own policies without a programmer – and possibly a data architect and a statistician. SIEM vendors will eventually strap on abstraction interfaces to simplify big data query development but we are not there yet.
Because of this, you will be more dependent on your SIEM vendor and third party service providers than before. And your SIEM vendor has yet to build out all the capabilities you want from their big data infrastructure. They will get there, but we are still early in the big data lifecycle. In many cases the ‘advancements’ in SIEM will be to deliver previously advertised capabilities which now work as advertised. In other cases they will offer considerably deeper analysis because the queries run against more data. Most vendors have been working in this problem space for a decade and understand the classic technical limitations, but they finally have tools to address those issues. So they are addressing their thorniest issues first. And they can buttress existing near-real time queries with better behavioral profiles, provide slightly better risk analysis by looking at more data, of more types.
One more facet of this difficulty merits a public discussion. During a radical shift in data management systems, it is foolish to assume that a new (big data or other) platform will use the same queries, or produce exactly the same results. Vet new and revised queries on the new platforms to verify they yield correct information. As we transition to new data management frameworks and query interfaces, the way we access and locate data changes. That is important because, even if we stick to a SQL-like query language and run equivalent queries, we may not get exactly the same results. Whether better, worse, or the same, you need to assess the quality of the new results.
Data Sharing and Privacy
We have talked about the different integration models. Some customers we spoke with want to leverage existing (non-security) information in their security analytics. Some are looking at creating partial copies of data stored in more traditional data mining systems, with the assumption that lower cost commodity storage make the iterative cost trivial. Others are looking to derive data from their existing clusters and import that information into Hadoop or their SIEM system. There is no ‘right’ way to approach this, and you need to decide based on what you want to accomplish, whether existing infrastructure provides benefits big data cannot, and any network bandwidth issues with moving information between these systems.
If you are considering moving sensitive data into your big data cluster, consider how you intend to protect it. This was not a question with traditional SIEM – both because the security model for relational databases was different and because big data is now leveraging more types of data. But your choice will be to secure the cluster itself, as we will discuss next, or to apply security to the data. Several firms we spoke with are using tokenization to substitute sensitive data prior to loading it into the cluster. The benefits of format and data type preservation are less critical in non-relational systems – one of the main benefits of tokenization in payment systems – but it does provide a means of referencing original data values if they are needed. Some firms use masking to strip out sensitive data while retaining value for analytics. Others use format preserving encryption for specific columns.
Which option to choose is not easy to answer – it depends on how you want to use the data, and requires identifying a solution that will actually scale well enough to meet your needs. Still, if security of sensitive data is at issue, it is often easier to secure the data within the cluster than the cluster itself, given the current state of big data security.
Big Data Platform Security
NoSQL platforms generally offer poor security. The security features built into Hadoop are neither complete nor well thought out. With the exception of a couple commercial big data vendors who bundle security tools into their solutions, out of the box you do not get enough control to secure a NoSQL cluster.
Consider the following types of security controls:
- Data Encryption: To protect data at rest, ensure administrators or other applications cannot gain direct access to files, and prevent leaked information from exposure. We recommend file/OS level encryption because it scales as you add nodes and is transparent to NoSQL operations.
- Authentication and Authorization: Ensure that secure administrative passwords are in place and that application users must authenticate before gaining access to the cluster. Developer, user, and administrator roles should all be segregated. These capabilities are built into some distributions, and can link to internal directory management systems.
- Node Authentication: There is little protection from adding unwanted nodes and applications to a big data cluster, especially in cloud and virtual environments where it is trivial to copy a machine image and start a new instance. Tools like Kerberos help to ensure rogue nodes don’t issue queries or receive copies of the data.
- Key Management: Data encryption is only as strong as key security; so use an external key management system to secure keys and, if possible, help validate key usage.
- Logging: Logging is built into Hadoop and many other clusters. It may seem nonsensical to log system event data when using big data as a SIEM, but consider the security of the cluster as distinct from the security of all other network devices and applications. We recommend that you enable built-in logging or leverage one of the many open-source or commercial logging tools to capture a subset of system events.
- Network Protocol Security: SSL or TLS is built-in or available on most NoSQL distributions. If privacy is at all important, look to implement protocol security to keep your data private.
- Node Validation: Leverage tools to pre-configure, patch, and validate nodes before they are added to the cluster to ensure baseline security. Most customers we spoke with use it in virtual or cloud environments, which offer incredibly simple tools for pre-deployment validation.
If you are buying a SIEM based system you will need to check to see how the vendor secures their system. They likely use a subset of these tools, but as they move from a monolithic single data repository model to a cluster these controls become more important. Again, if you hear the work ‘proprietary’, you need to dig into how the vendor addresses these security challenges.