Posted at Monday 26th July 2010 5:16 pm
(1) Comments •
By Adrian Lane
We have covered the internals of token servers and talked about architecture and integration of token services. Now we need to look at some of the different deployment models and how they match up to different types of businesses. Protecting medical records in multi-company environments is a very different challenge than processing credit cards for thousands of merchants.
Central Token Server
The most common deployment model we see today is a single token server that sits between application servers and the back end transaction servers. The token server issues one or more tokens for each instance of sensitive information that it recieves. For most applications it becomes a reference library, storing sensitive information within a repository and providing an index back to the real data as needed. The token service is placed in line with existing transaction systems, adding a new substitution step between business applications and back-end data processing.
As mentioned in previous posts, this model is excellent for security as it consolidates all the credit card data into a single highly secure server; additionally, it is very simple to deploy as all services reside in a single location. And limiting the number of locations where sensitive data is stored and accessed both improves security and reduces auditing, as there are fewer systems to review.
A central token server works well for small businesses with consolidated operations, but does not scale well for larger distributed organizations. Nor does it provide the reliability and uptime demanded by always-on Internet businesses. For example:
- Latency: The creation of a new token, lookup of existing customers, and data integrity checks are computationally complex. Most vendors have worked hard to alleviate this problem, but some still have latency issues that make them inappropriate for financial/point of sale usage.
- Failover: If the central token server breaks down, or is unavailable because of a network outage, all processing of sensitive data (such as orders) stops. Back-end processes that require tokens halt.
- Geography: Remote offices, especially those in remote geographic locations, suffer from network latency, routing issues, and Internet outages. Remote token lookups are slow, and both business applications and back-end processes suffer disproportionately in the event of disaster or prolonged network outages.
To overcome issues in performance, failover, and network communications, several other deployment variations are available from tokenization vendors.
Distributed Token Servers
With distributed token servers, the token databases are copies and shared among multiple sites. Each has a copy of the tokens and encrypted data. In this model, each site is a peer of the others, with full functionality.
This model solves some of the performance issues with network latency for token lookup, as well as failover concerns. Since each token server is a mirror, if any single token server goes down, the others can share its load. Token generation overhead is mitigated, as multiple servers assist in token generation and distribution of requests balances the load. Distributed servers are costly but appropriate for financial transaction processing.
While this model offers the best option for uptime and performance, synchronization between servers requires careful consideration. Multiple copies means synchronization issues, and carefully timed updates of data between locations, along with key management so encrypted credit card numbers can be accessed. Finally, with multiple databases all serving tokens, you increase the number of repositories that must be secured, maintained, and audited increases substantially.
Partitioned Token Servers
In a partitioned deployment, a single token server is designated as ‘active’, and one or more additional token servers are ‘passive’ backups. In this model if the active server crashes or is unavailable a passive server becomes active until the primary connection can be re-established. The partitioned model improves on the central model by replicating the (single, primary) server configuration. These replicas are normally at the same location as the primary, but they may also be distributed to other locations. This differs from the distributed model in that only one server is active at a time, and they are not all peers of one another.
Conceptually partitioned servers support a hybrid model where each server is active and used by a particular subset of endpoints and transaction servers, as well as as a backup for other token servers. In this case each token server is assigned a primary responsibility, but can take on secondary roles if another token server goes down. While the option exists, we are unaware of any customers using it today.
The partitioned model solves failover issues: if a token server fails, the passive server takes over. Synchronization is easier with this model as the passive server need only mirror the active server, and bi-directional synchronization is not required. Token servers leverage the mirroring capabilities built into the relational database engines, as part of their back ends, to provide this capability.
Next we will move on to use cases.
Posted at Monday 26th July 2010 3:08 pm
(0) Comments •
Our last post covered the core functions of the tokenization server. Today we’ll finish our discussion of token servers by covering the externals: the primary architectural models, how other applications communicate with the server(s), and supporting systems management functions.
There are three basic ways to build a token server:
- Stand-alone token server with a supporting back-end database.
- Embedded/integrated within another software application.
- Fully implemented within a database.
Most of the commercial tokenization solutions are stand-alone software applications that connect to a dedicated database for storage, with at least one vendor bundling their offering into an appliance. All the cryptographic processes are handled within the application (outside the database), and the database provides storage and supporting security functions. Token servers use standard Database Management Systems, such as Oracle and SQL Server, but locked down very tightly for security. These may be on the same physical (or virtual) system, on separate systems, or integrated into a load-balanced cluster. In this model (stand-alone server with DB back-end) the token server manages all the database tasks and communications with outside applications. Direct connections to the underlying database are restricted, and cryptographic operations occur within the tokenization server rather than the database.
In an embedded configuration the tokenization software is embedded into the application and supporting database. Rather than introducing a token proxy into the workflow of credit card processing, existing application functions are modified to implement tokens. To users of the system there is very little difference in behavior between embedded token services and a stand-alone token server, but on the back end there are two significant differences. First, this deployment model usually involves some code changes to the host application to support storage and use of the tokens. Second, each token is only useful for one instance of the application. Token server code, key management, and storage of the sensitive data and tokens all occur within the application. The tightly coupled nature of this model makes it very efficient for small organizations, but does not support sharing tokens across multiple systems, and large distributed organizations may find performance inadequate.
Finally, it’s technically possible to manage tokenization completely within the database without the need for external software. This option relies on stored procedures, native encryption, and carefully designed database security and access controls. Used this way, tokenization is very similar to most data masking technologies. The database automatically parses incoming queries to identify and encrypt sensitive data. The stored procedure creates a random token – usually from a sequence generator within the database – and returns the token as the result of the user query. Finally all the data is stored in a database row. Separate stored procedures are used to access encrypted data. This model was common before the advent of commercial third party tokenization tools, but has fallen into disuse due to its lack for advanced security features and failure to leverage external cryptographic libraries & key management services.
There are a few more architectural considerations:
- External key management and cryptographic operations are typically an option with any of these architectural models. This allows you to use more-secure hardware security modules if desired.
- Large deployments may require synchronization of multiple token servers in different, physically dispersed data centers. This support must be a feature of the token server, and is not available in all products. We will discuss this more when we get to usage and deployment models.
- Even when using a stand-alone token server, you may also deploy software plug-ins to integrate and manage additional databases that connect to the token server. This doesn’t convert the database into a token server, as we described in our second option above, but supports communications for distributed systems that need access to either the token or the protected data.
Since tokenization must be integrated with a variety of databases and applications, there are three ways to communicate with the token server:
- Application API calls: Applications make direct calls to the tokenization server procedural interface. While at least one tokenization server requires applications to explicitly access the tokenization functions, this is now a rarity. Because of the complexity of the cryptographic processes and the need for precise use of the tokenization server; vendors now supply software agents, modules, or libraries to support the integration of token services. These reside on the same platform as the calling application. Rather than recoding applications to use the API directly, these supporting modules accept existing communication methods and data formats. This reduces code changes to existing applications, and provides better security – especially for application developers who are not security experts. These modules then format the data for the tokenization API calls and establish secure communications with the tokenization server. This is generally the most secure option, as the code includes any required local cryptographic functions – such as encrypting a new piece of data with the token server’s public key.
- Proxy Agents: Software agents that intercept database calls (for example, by replacing an ODBC or JDBC component). In this model the process or application that sends sensitive information may be entirely unaware of the token process. It sends data as it normally does, and the proxy agent intercepts the request. The agent replaces sensitive data with a token and then forwards the altered data stream. These reside on the token server or its supporting application server. This model minimizes application changes, as you only need to replace the application/database connection and the new software automatically manages tokenization. But it does create potential bottlenecks and failover issues, as it runs in-line with existing transaction processing systems.
- Standard database queries: The tokenization server intercepts and interprets the requests. This is potentially the least secure option, especially for ingesting content to be tokenized.
While it sounds complex, there are really only two functions to implement:
- Send new data to be tokenized and retrieve the token.
- When authorized, exchange the token for the protected data.
The server itself should handle pretty much everything else.
Finally, as with any major application, the token server includes various management functions. But due to security needs, these tend to have additional requirements:
- User management, including authentication, access, and authorization – for user, application, and database connections. Additionally, most tokenization solutions include extensive separation of duties controls to limit administrative access to the protected data.
- Backup and recovery for the stored data, system configuration and, if encryption is managed on the token server, encryption keys. The protected data is always kept encrypted for backup operations.
- Logging and reporting – especially logging of system changes, administrative access, and encryption key operations (such as key rotation). These reports are often required to meet compliance needs, especially for PCI.
In our next post we’ll go into more detail on token server deployment models, which will provide more context for all of this.
Posted at Friday 23rd July 2010 4:22 pm
(0) Comments •
By Mike Rothman
There is nothing like a good old-fashioned mud-slinging battle. As long as you aren’t the one covered in mud, that is. I read about the Death of Snort and started laughing. The first thing they teach you in marketing school is when no one knows who you are, go up to the biggest guy in the room and kick them in the nuts. You’ll get your ass kicked, but at least everyone will know who you are.
That’s exactly what the folks at OISF (who drive the Suricata project) did, and they got Ellen Messmer of NetworkWorld to bite on it. Then she got Marty Roesch to fuel the fire and the end result is much more airtime than Suricata deserves. Not that it isn’t interesting technology, but to say it’s going to displace Snort any time soon is crap. To go out with a story about Snort being dead is disingenuous. But given the need to drive page views, the folks at NWW were more than willing to provide airtime. Suricata uses Snort signatures (for the most part) to drive its rule base. They’d better hope it’s not dead.
But it brings up a larger issue of when a technology really is dead. In reality, there are few examples of products really dying. If you ended up with some ConSentry gear, then you know the pain of product death. But most products are around around ad infinitum, even if they aren’t evolved. So those products aren’t really dead, they just become irrelevant. Take Cisco MARS as an example. Cisco isn’t killing it, it’s just not being used as a multi-purpose SIEM, which is how it was positioned for years. Irrelevant in the SIEM discussion, yes. Dead, no.
Ultimately, competition is good. Suricata will likely push the Snort team to advance their technology faster than in the absence of an alternative. But it’s a bit early to load Snort onto the barbie – even if it is the other white meat. Yet, it usually gets back to the reality that you can’t believe everything you read. Actually you probably shouldn’t believe much that you read. Except our stuff, of course.
Photo credit: “Roasted pig (large)” originally uploaded by vnoel
Posted at Friday 23rd July 2010 2:00 pm
(0) Comments •
By Adrian Lane
A couple weeks ago I was sitting on the edge of the hotel bed in Boulder, Colorado, watching the immaculate television. A US-made 30” CRT television in “standard definition”. That’s cathode ray tube for those who don’t remember, and ‘standard’ is the marketing term for ‘low’. This thing was freaking horrible, yet it was perfect. The color was correct. And while the contrast ratio was not great, it was not terrible either. Then it dawned on me that the problem was not the picture, as this is the quality we used to get from televisions. Viewing an old set, operating exactly the same way they always did, I knew the problem was me. High def has so much more information, but the experience of watching the game is the same now as it was then. It hit me just how much our brains were filling in missing information, and we did not mind this sort of performance 10 years ago because it was the best available. We did not really see the names on the backs of football jerseys during those Sunday games, we just thought we did. Heck, we probably did not often make out the numbers either, but somehow we knew who was who. We knew where our favorite players on the field were, and the red streak on the bottom of the screen pounding a blue colored blob must be number 42. Our brain filled in and sharpened the picture for us.
Rich and I had been discussing experience bias, recency bias, and cognitive dissonance during out trip to Denver. We were talking about our recent survey and how to interpret the numbers without falling into bias traps. It was an interesting discussion of how people detect patterns, but like many of our conversations devolved into how political and religious convictions can cloud judgement. But not until I was sitting there, watching television in the hotel; did I realize how much our prior experiences and knowledge shape perception, derived value, and interpreted results. Mostly for the good, but unquestionably some bad.
Rich also sent me a link to a Michael Shermer video just after that, in which Shermer discusses patterns and self deception. You can watch the video and say “sure, I see patterns, and sometimes what I see is not there”, but I don’t think videos like this demonstrate how pervasive this built in feature is, and how it applies to every situation we find ourself in.
The television example of this phenomena was more shocking than some others that have popped into my head since. I have been investing in and listening to high-end audio products such as headphones for years. But I never think about the illusion of a ‘soundstage’ right in front of me, I just think of it as being there. I know the guitar player is on the right edge of the stage, and the drummer is in the back, slightly to the left. I can clearly hear the singer when she turns her head to look at fellow band members during the song. None of that is really in front of me, but there is something in the bits of the digital facsimile on my hard drive that lets my brain recognize all these things, placing the scene right there in front of me.
I guess the hard part is recognizing when and how it alters our perception.
On to the Summary:
Webcasts, Podcasts, Outside Writing, and Conferences
Favorite Securosis Posts
Other Securosis Posts
Favorite Outside Posts
Project Quant Posts
Research Reports and Presentations
Top News and Posts
Blog Comment of the Week
Remember, for every comment selected, Securosis makes a $25 donation to Hackers for Charity. This week’s best comment goes to Jay Jacobs, in response to FireStarter: an Encrypted Value Is Not a Token.
@Adrian – I must be missing the point, my apologies, perhaps I’m just approaching this from too much of a cryptonerd perspective. Though, I’d like to think I’m not being overly theoretical.
To extend your example, any merchant that wants to gain access to the de-tokenized content, we will need to make a de-tokenization interface available to them. They will have the ability to get at the credit card/PAN of every token they have. From the crypto side, if releasing keys to merchants is unacceptable, require that merchants return ciphertext to be decrypted so the key is not shared… What’s the difference between those two?
Let’s say my cryptosystem leverages a networked HSM. Clients connect and authenticate, send in an account number and get back ciphertext. In order to reverse that operation, a client would have to connect and authenticate, send in cipher text and receive back an account number. Is it not safe to assume that the ciphertext can be passed around safely? Why should systems that only deal in that ciphertext be in scope for PCI when an equivalent token is considered out of scope?
Conversely, how do clients authenticate into a tokenization system? Because the security of the tokens (from an attackers perspective) is basically shifted to that authentication method. What if it’s a password stored next to the tokens? What if it’s mutual SSL authentication using asymmetric keys? Are we just back to needing good key management and access control?
My whole point is that, from my view point, I think encrypting data is getting a bad wrap when the problem is poorly implemented security controls. I don’t see any reason to believe that we can’t have poorly implemented tokenization systems.
If we can’t control access into a cryptosystem, I don’t see why we’d do any better controlling access to a token system. With PCI DSS saying tokenization is “better”, my guess is we’ll see a whole bunch of mediocre token systems that will eventually lead us to realize that hey, we can build just as craptastic tokenization systems as we have cryptosystems.
Posted at Friday 23rd July 2010 6:12 am
(0) Comments •
By Mike Rothman
Now that we’ve been through all the high-level process steps and associated subprocesses for monitoring firewalls, IDS/IPS, and servers; the next step is to start similarly digging into the processes for managing firewalls and IDS/IPS.
But before we begin let’s revisit all the processes and subprocesses for monitoring. We put all the high-level and subprocesses into one graphic, as a central spot for links into each step.
As with all our research, we appreciate any feedback you have on this process and the subprocess steps. It’s critical that we get this right, since we start developing metrics and building a cost model directly from these steps. So if you see something you don’t agree with, or perhaps do things a bit differently, let us know.
Posted at Thursday 22nd July 2010 4:47 pm
(4) Comments •
By Adrian Lane
In our previous post we covered token creation, a core feature of token servers. Now we’ll discuss the remaining behind-the-scenes features of token servers: securing data, validating users, and returning original content when necessary. Many of these services are completely invisible to end users of token systems, and for day to day use you don’t need to worry about the details. But how the token server works internally has significant effects on performance, scalability, and security. You need to assess these functions during selection to ensure you don’t run into problems down the road.
For simplicity we will use credit card numbers as our primary example in this post, but any type of data can be tokenized. To better understand the functions performed by the token server, let’s recap the two basic service requests. The token server accepts sensitive data (e.g., credit card numbers) from authenticated applications and users, responds by returning a new or existing token, and stores the encrypted value when creating new tokens. This comprises 99% of all token server requests. The token server also returns decrypted information to approved applications when presented a token with acceptable authorization credentials.
Authentication is core to the security of token servers, which need to authenticate connected applications as well as specific users. To rebuff potential attacks, token servers perform bidirectional authentication of all applications prior to servicing requests. The first step in this process is to set up a mutually authenticated SSL/TLS session, and validate that the connection is started with a trusted certificate from an approved application. Any strong authentication should be sufficient, and some implementations may layer additional requirements on top.
The second phase of authentication is to validate the user who issues a request. In some cases this may be a system/application administrator using specific administrative privileges, or it may be one of many service accounts assigned privileges to request tokens or to request a given unencrypted credit card number. The token server provides separation of duties through these user roles – serving requests only from only approved users, through allowed applications, from authorized locations. The token server may further restrict transactions – perhaps only allowing a limited subset of database queries.
Although technically the sensitive data doesn’t might not be encrypted by the token server in the token database, in practice every implementation we are aware of encrypts the content. That means that prior to being written to disk and stored in the database, the data must be encrypted with an industry-accepted ‘strong’ encryption cipher. After the token is generated, the token server encrypts the credit card with a specific encryption key used only by that server. The data is then stored in the database, and thus written to disk along with the token, for safekeeping.
Every current tokenization server is built on a relational database. These servers logically group tokens, credit cards, and related information in a database row – storing these related items together. At this point, one of two encryption options is applied: either field level or transparent data encryption. In field level encryption just the row (or specific fields within it) are encrypted. This allows a token server to store data from different applications (e.g., credit cards from a specific merchant) in the same database, using different encryption keys. Some token systems leverage transparent database encryption (TDE), which encrypts the entire database under a single key. In these cases the database performs the encryption on all data prior to being written to disk. Both forms of encryption protect data from indirect exposure such as someone examining disks or backup media, but field level encryption enables greater granularity, with a potential performance cost.
The token server will have bundles the encryption, hashing, and random number generation features – both to create tokens and to encrypt network sessions and stored data.
Finally, some implementations use asymmetric encryption to protect the data as it is collected within the application (or on a point of sale device) and sent to the server. The data is encrypted with the server’s public key. The connection session will still typically be encrypted with SSL/TLS as well, but to support authentication rather than for any claimed security increase from double encryption. The token server becomes the back end point of decryption, using the private key to regenerate the plaintext prior to generating the proxy token.
Any time you have encryption, you need key management. Key services may be provided directly from the vendor of the token services in a separate application, or by hardware security modules (HSM), if supported. Either way, keys are kept separate from the encrypted data and algorithms, providing security in case the token server is compromised, as well as helping enforce separation of duties between system administrators. Each token server will have one or more unique keys – not shared by other token servers – to encrypt credit card numbers and other sensitive data. Symmetric keys are used, meaning the same key is used for both encryption and decryption. Communication between the token and key servers is mutually authenticated and encrypted.
Tokenization systems also need to manage any asymmetric keys for connected applications and devices.
As with any encryption, the key management server/device/functions must support secure key storage, rotation, and backup/restore.
Token storage is one of the more complicated aspects of token servers. How tokens are used to reference sensitive data or previous transactions is a major performance concern. Some applications require additional security precautions around the generation and storage of tokens, so tokens are not stored in a directly reference-able format. Use cases such as financial transactions with either single-use or multi-use tokens can require convoluted storage strategies to balance security of the data against referential performance. Let’s dig into some of these issues:
- Multi-token environments: Some systems provide a single token to reference every instance of a particular piece of sensitive data. So a credit card used at a specific merchant site will be represented by a single token regardless of the number of transactions performed. This one to one mapping of data to token is easy from a storage standpoint, but fails to support some business requirements. There are many use cases for creating more than one token to represent a single piece of sensitive data, such asnonymizing patient data across different medical record systems and credit cards used in multiple transactions with different merchants. Most token servers support the multiple-token model, enabling an arbitrary number of tokens to map to a given piece of data entity.
- Token lookup: Looking up a token in a token server is fairly straightforward: the sensitive data acts as the primary key by which data is indexed. But as the stored data is encrypted, incoming data must first be encrypted prior to performing the lookup. For most systems this is fast and efficient. For high volume servers used for processing credit card numbers the lookup table becomes huge, and token references take significant time to process. The volatility of the system makes traditional indexing unrealistic, so data is commonly lumped together by hash, grouped by merchant ID or some other scheme. In the worst case the token does not exist and must be created. The process is to encrypt the sensitive data, perform the lookup, create a new token if one does not already exist, and (possibly) perform token validation (e.g., LUHN checks). Since not all schemes work well for each use case, you will need to investigate whether the vendor’s performance is sufficient for your application. This is a case where pre-generated sequences or random numbers are used for their performance advantage over tokens based upon hashing or encryption.
- Token collisions: Token servers deployed for credit card processing have several constraints: they must keep the same basic format as the original credit card, expose the real last four digits, and pass LUHN checks. This creates an issue, as the number of tokens that meet these criteria are limited. The number of LUHN-valid 12-digit numbers creates a high likelihood of the same token being created and issued – especially in multi-token implementations. Investigate what precautions your vendor takes to avoid or mitigate token collisions.
In our next post we will discuss how token servers communicate with other applications, and the supporting IT services they rely upon.
Posted at Thursday 22nd July 2010 1:18 pm
(0) Comments •
Alex Hutton has a wonderful must-read post on the Verizon security blog on Evidence Based Risk Management.
Alex and I (along with others including Andrew Jaquith at Forrester, as well as Adam Shostack and Jeff Jones at Microsoft) are major proponents of improving security research and metrics to better inform the decisions we make on a day to day basis. Not just generic background data, but the kinds of numbers that can help answer questions like “Which security controls are most effective under XYZ circumstances?”
You might think we already have a lot of that information, but once you dig in the scarcity of good data is shocking. For example we have theoretical models on password cracking – but absolutely no validated real-world data on how password lengths, strengths, and forced rotation correlate with the success of actual attacks. There’s a ton of anecdotal information and reports of password cracking times – especially within the penetration testing community – but I have yet to see a single large data set correlating password practices against actual exploits.
I call this concept outcomes based security, which I now realize is just one aspect/subset of what Alex defines as Evidence Based Risk Management.
We often compare the practice of security with the practice of medicine. Practitioners of both fields attempt to limit negative outcomes within complex systems where external agents are effectively impossible to completely control or predict. When you get down to it, doctors are biological risk managers. Both fields are also challenged by having to make critical decisions with often incomplete information. Finally, while science is technically the basis of both fields, the pace and scope of scientific information is often insufficient to completely inform decisions.
My career in medicine started in 1990 when I first became certified as an EMT, and continued as I moved on to working as a full time paramedic. Because of this background, some of my early IT jobs also involved work in the medical field (including one involving Alex’s boss about 10 years ago). Early on I was introduced to the concepts of Evidence Based Medicine that Alex details in his post.
The basic concept is that we should collect vast amounts of data on patients, treatments, and outcomes – and use that to feed large epidemiological studies to better inform physicians. We could, for example, see under which circumstances medication X resulted in outcome Y on a wide enough scale to account for variables such as patient age, gender, medical history, other illnesses, other medications, etc.
You would probably be shocked at how little the practice of medicine is informed by hard data. For example if you ever meet a doctor who promotes holistic medicine, acupuncture, or chiropractic, they are making decisions based on anecdotes rather than scientific evidence – all those treatments have been discredited, with some minor exceptions for limited application of chiropractic… probably not what you used it for.
Alex proposes an evidence-based approach – similar to the one medicine is in the midst of slowly adopting – for security. Thanks to the Verizon Data Breach Investigations Report, Trustwave’s data breach report, and little pockets of other similar information, we are slowly gaining more fundamental data to inform our security decisions.
But EBRM faces the same near-crippling challenge as Evidence Based Medicine. In health care the biggest obstacle to EBM is the physicians themselves. Many rebel against the use of the electronic medical records systems needed to collect the data – sometimes for legitimate reasons like crappy software, and at other times due to a simple desire to retain direct control over information. The reason we have HIPAA isn’t to protect your health care data from a breach, but because the government had to step in and legislate that doctors must release and share your healthcare information – which they often considered their own intellectual property.
Not only do many physicians oppose sharing information – at least using the required tools – but they oppose any restrictions on their personal practice of medicine. Some of this is a legitimate concern – such as insurance companies restricting treatments to save money – but in other cases they just don’t want anyone telling them what to do – even optional guidance. Medical professionals are just as subject to cognitive bias as the rest of us, and as a low-level medical provider myself I know that algorithms and checklists alone are never sufficient in managing patients – a lot of judgment is involved.
But it is extremely difficult to balance personal experience and practices with evidence, especially when said evidence seems counterintuitive or conflicts with existing beliefs.
We face these exact same challenges in security:
- Organizations and individual practitioners often oppose the collection and dissemination of the raw data (even anonymized) needed to learn from experience and advance based practices.
- Individual practitioners, regulatory and standards bodies, and business constituents need to be willing to adjust or override their personal beliefs in the face of hard evidence, and support evolution in security practices based on hard evidence rather than personal experience.
Right now I consider the lack of data our biggest challenge, which is why we try to participate as much as possible in metrics projects, including our own. It’s also why I have an extremely strong bias towards outcome-based metrics rather than general risk/threat metrics. I’m much more interested in which controls work best under which circumstances, and how to make the implementation of said controls as effective and efficient as possible.
We are at the very beginning of EBRM. Despite all our research on security tools, technologies, vulnerabilities, exploits, and processes, the practice of security cannot progress beyond the equivalent of witch doctors until we collectively unite behind information collection, sharing, and analysis as the primary sources informing our security decisions.
Seriously, wouldn’t you really like to know when 90-day password rotation actually reduces risk vs. merely annoying users and wasting time?
Posted at Wednesday 21st July 2010 6:07 pm
(8) Comments •
By Mike Rothman
Now that we’ve decomposed each steps in the Monitoring process with gory detail, we need to wrap things up by talking about monitoring and maintaining device health. The definition of a device varies – depending on your ability to invest in tools you might have all sorts of different methods for collecting, storing, and analyzing events/logs.
One of the most commonly overlooked aspects of implementing a monitoring process is the effort required to actually keep things up and running. It includes stuff like up/down checks, patching, and upgrading hardware as necessary. All these functions take time, and if you are trying to really understand what it costs to monitor your environment, they must be modeled and factored into your analysis.
Here we make sure our equipment is operational and working in peak form. OK, perhaps not peak form, but definitely collecting data. Losing data has consequences for the monitoring process so we need to ensure all collectors, aggregators, and analyzers are operating – as well as patched and upgraded to ensure reliability at our scale.
This process is pretty straightforward, so let’s go through it:
- Availability checking: As with any other critical IT device you need to check availability. Is it up? You likely have an IT management system to do this up/down analysis, but there are also low-cost and no-cost ways to check availability of devices (it’s amazing what you can do with a scripting language…).
- Test data collection: The next step is to make sure the data being collected is as expected. You should have spot checks scheduled on a sample set of collectors to ensure collection works as required. You defined test cases during the Collect step, which you can leverage on an ongoing basis to ensure the accuracy and integrity of collected data.
- Update/Patch Software: The collectors run some kind of software that runs on some kind of operating system. That operating system needs to be updated every so often to address security issues, software defects, and other issues. Obviously if the collector is a purpose-built device (appliance), you may not need to specifically patch the underlying OS. At times you’ll also need to update the collection application, which is included in this step. We explored this involved process in Patch Management Quant, so we won’t go into detail again here.
- Upgrade hardware: Many of the monitoring systems nowadays use hardware-based collectors and purpose-built appliances for data aggregation, storage, and analysis. Thanks to Moore’s Law and the exponentially increasing amount of data we have to deal with, every couple years your monitoring devices hit the scalability wall. When this happens you need to research and procure new hardware, and then install it with minimum downtime for the monitoring system.
Device Type Variances
Many firewalls and IDS/IPS are deployed as appliances, meaning you will manage a separate collection mechanism for them. Many of the server monitoring techniques involve installing agents on the devices, and those may need to be patched and updated.
For this research we are talking specifically about monitoring, so we aren’t worried about keeping the actual devices up and running here (we deal with that in the NSOQ Manage process for Firewalls and IDS/IPS) – only about maintaining the collection devices. So depending on your collection architecture, you should not have much variation across the collection and analysis devices.
We go into such gory detail on these processes is because someone has to do the work. Most organizations don’t factor health monitoring & maintenance into their analysis, which skews the cost model. If you use an outsourced monitoring service you can be sure the service provider is maintaining the collectors, so you need to weigh those costs to get an apples-to-apples comparison of build vs. buy. Through the Network Security Operations Quant process we’re attempting to provide the tools to understand the cost of monitoring your environment. This will help you streamline operations and reduce cost, or make a more informed decision on whether outsourcing is right for your organization.
Posted at Wednesday 21st July 2010 2:13 pm
(3) Comments •
By Mike Rothman
Back when I went to sleepaway camp as a kid I always looked forward to Visiting Day. Mostly for the food, because after a couple weeks of camp food anything my folks brought up was a big improvement. But I admit it was great to see the same families year after year (especially the family that brought enough KFC to feed the entire camp) and to enjoy a day of R&R with your own family before getting back to the serious business of camping.
So I was really excited this past weekend when the shoe was on the other foot, and I got to be the parent visiting XX1 at her camp. First off I hadn’t seen the camp, so I had no context when I saw pictures of her doing this or that. But most of all, we were looking forward to seeing our oldest girl. She’s been gone 3 weeks now, and the Boss and I really missed her.
I have to say I was very impressed with the camp. There were a ton of activities for pretty much everyone. Back in my day, we’d entertain ourselves with a ketchup cap playing a game called Skully. Now these kids have go-karts, an adventure course, a zipline (from a terrifying looking 50 foot perch), ATVs and dirt bikes, waterskiing, and a bunch of other stuff. In the arts center they had an iMac-based video production and editing rig (yes, XX1 starred in a short video with her group), ceramics (including their own wheels and kiln), digital photography, and tons of other stuff. For boys there was rocketry and woodworking (including tabletop lathes and jigsaws). Made me want to go back to camp. Don’t tell Rich and Adrian if I drop offline for couple weeks, okay?
Everything was pretty clean and her bunk was well organized, as you can see from the picture. Just like her room at home…not! Obviously the counselors help out and make sure everything is tidy, but with the daily inspections and work wheel (to assign chores every day), she’s got to do her part of keeping things clean and orderly. Maybe we’ll even be able to keep that momentum when she returns home.
Most of all, it was great to see our young girl maturing in front of our eyes. After only 3 weeks away, she is far more confident and sure of herself. It was great to see. Her counselors are from New Zealand and Mexico, so she’s gotten a view of other parts of the world and learned about other cultures, and is now excited to explore what the world has to offer. It’s been a transformative experience for her, and we couldn’t be happier.
I really pushed to send her to camp as early as possible because I firmly believe kids have to learn to fend for themselves in the world without the ever-present influence of their folks. The only way to do that is away from home. Camp provides a safe environment for kids to figure out how to get along (in close quarters) with other kids, and to do activities they can’t at home. That was based on my experience, and I’m glad to see it’s happening for my daughter as well. In fact, XX2 will go next year (2 years younger than XX1 is now) and she couldn’t be more excited after visiting.
But there’s more! An unforeseen benefit of camp accrues to us. Not just having one less kid to deal with over the summer – which definitely helps. But sending the kids to camp each summer will force us (well, really the Boss) to let go and get comfortable with the reality that at some point our kids will grow, leave the nest, and fly on their own. Many families don’t deal with this transition until college and it’s very disruptive and painful. In another 9 years we’ll be ready, because we are letting our kids fly every summer. And from where I sit, that’s a great thing.
Photo credits: “XX1 bunk” originally uploaded by Mike Rothman
Recent Securosis Posts
Wow. Busy week on the blog. Nice.
- Pricing Cyber-Policies
- FireStarter: An Encrypted Value is Not a Token!
- Tokenization: The Tokens
- Comments on Visa’s Tokenization Best Practices
- Friday Summary: July 15, 2010
- Tokenization Architecture – The Basics
- Color-blind Swans and Incident Response
- Home Business Payment Security
- Simple Ideas to Start Improving the Economics of Cybersecurity
- Various NSO Quant Posts on the Monitor Subprocesses:
Incite 4 U
We have a failure to communicate! – Chris makes a great point on the How is that Assurance Evidence? blog about the biggest problem we security folks face on a daily basis. It ain’t mis-configured devices or other typical user stupidity. It’s our fundamental inability to communicate. He’s exactly right, and it manifests in the lack of having any funds in the credibility bank, obviously impacting our ability to drive our security agendas. Holding a senior level security job is no longer about the technology. Not by a long shot. It’s about evangelizing the security program and persuading colleagues to think security first and to do the right thing. Bravo, Chris. Always good to get a reminder that all the security kung-fu in the world doesn’t mean crap unless the business thinks it’s important to protect the data. – MR
Cyber RF – I was reading Steven Bellovin’s post on Cyberwar, and the only thing that came to mind was Sun Tsu’s quote, “Victorious warriors win first and then go to war, while defeated warriors go to war first and then seek to win.” Don’t think I am one of those guys behind the ‘Cyberwar’ bandwagon, or who likes using war metaphors for football – this subject makes me want to gag. Like most posts on this subject, there is an interesting mixture of stuff I agree with, and an equal blend of stuff I totally disagree with. But the reason I loathe the term ‘Cyberwar’ finally dawned on me: it’s not war – it’s about winning through trickery. It’s about screwing someone over for whatever reason. It’s about stealing, undermining, propagandizing, damaging and every other underhanded trick you use before you do something else underhanded. The term ‘Cyberwar’ creates a myopic over-dramatization that conjures images of guns, bombs, and dolphins with lasers strapped to their heads, when it’s really about getting what you want – whatever that may be. I prefer the term ‘Cyber – Ratfscking’, from a root term coined by Nixon staffers and perfected under the W administration. Sure, we could use plain old terms like ‘war’, ‘espionage’, and ‘theft, but they do not capture the serendipity of old tricks in a new medium. And I really don’t think the threats have been exaggerated at all, because stealing half a billion dollars in R&D from a rival nation, or changing the outcome of an election, is incredibly damaging and/or useful. But focusing on ‘war’ removes the stigma of politics from the discussion, and makes it sound like a military issue when it’s a more generalized iteration of screwing over your fellow man. – AL
The SLA hammer hits the thumb – I once received a fair bit of guff over stating that your only consistent cloud computing security control is a good, well-written contract with enforceable service level agreements. It turns out even that isn’t always enough – at least if you are in Texas and hosting with IBM. Big Blue is about to lose an $863M contract with the state of Texas due to a string of massive failures. This was a massive project to merge 28 state agencies into two secure data, centers which has been nothing but a nightmare for the agencies involved. But what the heck, the 7-year contract started in 2006 and it only took 4 years to reach the “we really mean it this time” final 30-day warning. Needless to say, I have a Google alert set for 30 days from now to see what really happens. – RM
Defining risk – Jack Jones puts up an interesting thought generator when he asks “What is a risk anyway?” This is a reasonable question we should collectively spend more time on. Risk is one of those words that gets poked, prodded, and manipulated in all sorts of ways for all sorts of purposes. The term is so muddled that no one really knows what it means. But we are expected to reduce, transfer, or mitigate risk systematically, in a way that can easily be substantiated for our auditors. No wonder we security folks are a grumpy bunch! How the hell can we do that? Jack has some ideas but mostly it’s about not trying to “characterize risks in terms of likelihood or consequence” (both of which are subjective), and focus on getting the terminology right. Good advice. – MR
No SCADA to see here – Almost any time I post something on SCADA security, someone who works in that part of the industry responds with, “there’s no problem – our systems are all proprietary and bad guys can’t possibly figure out how to shut-the-grid-down/trigger-a-flood/blow-up-a-manufacturing-plant. Not every SCADA engineer thinks like that, but definitely more than we’d like (zero would be the right number). I wonder how they feel about the new Windows malware variant that spreads via USB, and appears to target a specific SCADA system? Not that this attack is worth a 60 Minutes special, but it is yet another sign that someone seems to be targeting our infrastructure – or at minimum performing reconnaissance to learn how to break it. – RM
Buy that network person a beer – As an old networking guy, it’s a little discouraging to see the constant (and ever-present) friction between the security and networking teams. But that’s not going to change any time soon, so I have to accept it. Branden Williams makes a great point about how VLANs (and network segmentation in general) can help reduce scope for PCI – excellent for the security folks. Obviously the devil’s in the details, but clearly you have to keep devices accessing PAN on a separate network, which could mean a lot of things. But less scope is good, so if you don’t have a good relationship with the network team maybe it’s time to fix that. You should make a peace offering. I hear network folks like beer. Or maybe that was just me. – MR
Warm and fuzzy – The Microsoft blog had an article on Writing Fuzzable Code a couple weeks back that I am still trying to wrap my head around. OK, so fuzzing is an art when done right. Sure, to the average QA tester it just looks like you are hurling garbage at the application with a perverse desire to crash it – perhaps so you can heckle the programming team for their incompetence. Seriously, it’s a valuable approach to security testing and a wonderful way to flush out bad programming assumptions and execution. But the Man-in-the-middle approach they discuss is a bit of an oddball. A large percentage of firms capture network activity and replay those sessions with altered parameters and commands for fuzzing and stress testing. Sure, modification of data on the fly is an interesting way to create dynamic tests and keep the test cases up to date, but I am not certain there is enough value to justify fuzzing both producer and the consumer as part of a single test. I am still unsure whether their goal was to harden the QA scripts or the communication protocols between two applications. Or perhaps the answer is both. This scenario creates a real-world debugging problem, though – transaction processing communications can get out of synch and crash at some indeterminate time later. The issue may be due to a transaction processing error, the communication dialog, or a plain old unhandled exception. I guess my point is that this seems to save time in test case generation at the expense of being much more difficult to debug. If anyone out there has real-world experience with this form of testing (either inside or outside Microsoft) I would love to hear about your experiences. I guess Microsoft decided on the more thorough (but difficult) test model, but I’m afraid that in most cases the problems will multiply fast, and the advantage in thoroughness (over testing the producer and consumer sides separately) is not enough to justify the inevitable debugging problems. And I’m afraid that for most organizations this level of ambition will make the whole fuzzing process miserable and substantially less useful. – AL
Clarifying the final rule – Thanks to HIPAA, healthcare is one of the anchor verticals for security, so I was surprised to see very little coverage of HHS’ issuance of the final rule for meaningful use. Ed over at SecurityCurve did the legwork and has two posts (Part 1 & Part 2) clarifying what it means and what it doesn’t. The new rules are really about electronic health records (EHR), and HHS has basically declared that the existing HIPAA guidelines are sufficient. They are mandating somewhat better assessment and risk management processes, but that seems pretty squishy. Basically it gets back to enforcement. EHR is a huge can of security worms waiting to be exploited, and unless there is a firm commitment to make examples of some organizations playing fast and loose with EHR, this is more of a ho-hum. But if they do, we could finally get some forward motion on healthcare security. – MR
Posted at Wednesday 21st July 2010 6:59 am
(2) Comments •
By Mike Rothman
In our last Network Security Operations Quant post we discussed the Analyze step. Its output is an alert, which means some set of conditions have been met which may indicate an attack. Great – that means we need to figure out whether there is a real issue or not. That’s the Validate subprocess. Once an alert is validated as an attack, someone will need to deal with it, so we need the Escalate step. These two subprocesses are interdependent so it makes more sense to deal with them together in one post.
In this step you work to understand what happened to generate the alert, assess the risk to your organization, and consider ways to remediate the issue. Pretty straightforward, right? In concept yes, but in reality generally not. Validation requires security professionals to jump into detective mode to piece together the data and build a story about the attack. Think CSI, but for real and without a cool lab. Add the pressure of a potentially company-damaging breach and you can see that validation is not for the faint of heart.
Let’s jump into the subprocesses and understand how this is done. Keep in mind that every security professional may go through these steps in a different order – depending on systems, instrumentation, preferences, and skill set.
- Alert reduction: The first action after an alert fires is to verify the data. Does it make sense? Is it a plausible attack? You need to eliminate the obvious false positives (and then get that feedback to the team generating alerts), basically applying a sniff test to the alert. A typical attack touches many parts of the technology infrastructure, so you’ll have a number of alerts triggered by the same attack. Part of the art of incident investigation is to understand which alerts are related and which are not. Mechanically, once you figure out which alerts may or may not be related, you’ve got to merge the alerts in whatever system you’re using to track the alerts – even if it’s Excel.
- Identify root cause: If you have a legitimate alert, now you need to dig into the specific device(s) under attack and begin forensic analysis, to understand what is happening and the extent of the damage at the device level. This may involve log analysis, configuration checks, malware reverse engineering, memory analysis, and a host of other techniques. The focus of this analysis is to establish the root cause of the attack, so you can start figuring out what is required to fix it – whether it’s a configuration change, firewall or IDS/IPS change, a workaround, or something else. Feedback on the effectiveness of the alert – and how to make it better – then goes back to whoever manages the monitoring rules & policies.
- Determine extent of compromise: Now that you understand the attack, you need to figure out if this was an isolated situation or whether you have badness proliferating through the environment by means analyzing other devices for indications of a similar attack. This can be largely automated with scanners, but not entirely. When you find another potentially compromised device you validate once again – now much quicker, since you know what you’re looking for.
- Document: As with the other steps, documentation is critical to make sure other folks know what’s happened, that the ops teams know enough to fix the issue, and that your organization can learn from the attack (post-mortem). So here you’ll close the alert and write up the findings in sufficient detail to inform other folks of what happened, how you suggest they fix it, and what defenses need to change to make sure this doesn’t happen again.
Large Company vs. Small Company Considerations
Finding the time to do real alert validation is a major challenge, especially for smaller companies that don’t have resources or expertise for heavy malware or log analysis. The best advice we can give is to have a structured and repeatable process for validating alerts. That means defining the tools and the steps involved in investigating an alert in pretty granular detail. Why go to this level? Because a documented process will help you figure out when you are in deep water (over your head on the forensic analysis), and help you decide whether to continue digging or just re-image the machine/device and move on. Maybe it’s clear some kind of Trojan got installed on an endpoint device. Does it matter which one, and what it does to the device? Not if you are just going to re-image it and start over. Of course that doesn’t enable you to really understand how your defenses need to change to defend against this successful attack, but you should at least be able to update rules and policies to more quickly identify the effects of the attack next time, even if you don’t fully understand the root cause.
Larger companies also need this level of documentation for how alerts are validated because they tend to have tiers of analysts. The lower-level analysts run through a series of steps to try to identify the issue. If they don’t come up with a good answer, they pass the question along to a group of more highly trained analysts who can dig deeper. Finally, you may have a handful of Ninja-type forensic specialists (but hopefully at least one) who tackle the nastiest stuff. The number of levels doesn’t matter, just that you have the responsibilities and handoffs between tiers defined.
Now that you’ve identified the attack and what’s involved, it needs to be fixed. This is what escalation is about. Every company has a different idea of who needs to do what, so the escalation path is a large part of defining your policies. Don’t forget the criticality of communicating these policies and managing expectations around responsiveness. Make sure everyone understands how and when something will fall into their lap, and what to do when it happens.
The Escalate subprocess includes:
- Open trouble ticket: You need a mechanism for notifying someone of their task/responsibility. That seems obvious but many security processes fail because separate teams don’t communicate effectively, and plenty of things falls through the cracks. We aren’t saying you need a fancy enterprise-class help desk system, but you do need some mechanism to track what’s happening, who is responsible, and status – and to eventually close out issues. Be sure to provide enough information in the ticket to ensure the responsible party can do their job. Coming back to you over and over again to get essential details is horribly inefficient.
- Route appropriately: Once the ticket is opened it needs to be sent somewhere. The routing rules & responsibilities are defined (and agreed upon) in the Planning phase, so none of this should be a surprise. You find the responsible party, send them the information, and follow up to make sure they got the ticket and understand what needs to be done. Yes, this step can be entirely automated with those fancy (or even not-so-fancy) help desk systems.
- Close alert: Accountability is critical to the ultimate success of any security program. So if the security team just sends an alert over the transom to the ops team and then washes their hands, we assure you things will fall between the cracks. The security team needs to follow up and ensure each issue is taken to closure. Again, a lot of this process can be automated, but we’ve found the most effective solution is to make sure someone’s behind is on the line for making sure the alert gets closed.
Large Company vs. Small Company Considerations
The biggest differences you’ll see in this step between large and small companies is the number of moving pieces. Smaller companies tend to have a handful of security/ops folks at best, so there isn’t a lot of alert handoff & escalation, because in most cases the person doing the analysis is fixing the issue and then probably tuning the monitoring system as well.
Larger companies tend to have lots of folks involved in remediation and the like – so in this case documentation, managing expectations, and follow-up are imperative to successfully closing any issue.
But we don’t want to minimize the importance of documentation when closing tickets – even at smaller companies – because in some cases you’ll need to work with external parties or even auditors. If you (or another individual) are responsible for fixing something you validated above, we still recommend filling out ticket and documenting the feedback on rule changes (even if you’re effectively writing notes to yourself). We realize this is extra work, but it’s amazing how quickly the details fade – especially when you are dealing with many different issues every day – and this documentation helps ensure you don’t make the same mistake twice.
So that’s our Monitoring process, all eight steps in gory detail: Enumerate/Scope, Define Policies, Collect/Store, Analyze, and Validate/Escalate. But we aren’t done yet. However you decide to monitor your environment, some gear is involved. You need to maintain that gear, and that takes time and costs money. We need to model that as well, so you understand what it really costs to monitor your environment.
Posted at Tuesday 20th July 2010 2:05 pm
(2) Comments •
By Mike Rothman
Every time I think I’m making progress on controlling my cynical gene, I see something that sets me back almost to square one. I’ve been in this game for a long time, and although I think subconsciously I know some things are going on, it’s still a bit shocking to see them in print.
What set me off this time is Richard Bejtlich’s brief thoughts on the WEIS 2010 (Workshop on the Economics of Information Security) conference. His first thoughts are around a presentation on cyber insurance. The presenter admitted that the industry has no expected loss data and no financial impact data. Really? They actually admitted that. But it gets better.
Your next question must be, “So how do they price the policies?” It certainly was mine. Yes! They have an answer for that: Price the policies high and see what happens. WHAT? Does Dr. Evil head their policy pricing committee? I can’t say I’m a big fan of insurance companies, and this is the reason why. They are basically making it up. Pulling the premiums out of their butts. Literally. And they would never err favor of folks buying the policies, so you see high prices.
Clearly this is a chicken & egg situation. They don’t have data because no one shares it. So they write some policies to start collecting data, but they price the policies probably too high for most companies to actually buy. So they still have no data. And those looking for insurance don’t really have any options.
I guess I need to ask why folks are looking for cyber-insurance anyway? I can see the idea of trying to get someone else to pay for disclosure – those are hard costs. Maybe you can throw clean-up into that, but how could you determine what is clean-up required from a specific attack, and what is just crappy security already in place? It’s not like you are insuring Sam Bradford’s shoulder here, so you aren’t going to get a policy to reimburse for brand damage.
Back when I worked for TruSecure, the company had an “insurance” policy guaranteeing something in the event of a breach on a client certified using the company’s Risk Management Methodology. At some point the policy expired, and when trying to renew it, we ran across the same crap. We didn’t know how to model loss data – there was none because the process was perfect. LOL! And they didn’t either. So the quote came back off the charts. Then we had to discontinue the program because we couldn’t underwrite the risk.
Seems almost 7 years later, we’re still in the same place. Actually we’re in a worse place because the folks writing these policies are now aggressively working the system to prevent payouts (see Colorado Casualty/University of Utah) when a breach occurs.
I guess from my perspective cyber-insurance is a waste of time. But I could be missing something, so I’ll open it up to you folks – you’re collectively a lot smarter than me. Do you buy cyber-insurance? For what? Have you been able to collect on any claims? Is the policy just to make your board happy? To cover your ass and shuffle blame to the insurance company? Do tell. Please!
Photo credit: “Dr Evil 700 Billion” originally uploaded by Radio_jct
Posted at Monday 19th July 2010 5:04 pm
(6) Comments •
By Mike Rothman
Everyone in security knows data isn’t the problem. We have all sorts of data – tons of it. The last two steps in the Monitor process (Collect and Store) were focused on gathering data and putting it where we can get to it. What’s in short supply is information. Billions of event/log records don’t mean much if you can’t pinpoint what’s really happening at any given time and send actionable alerts.
So analyzing the data is the next subprocess. Every alert that fires requires a good deal of work – putting all that disparate data into a common format, correlating it, reducing it, finding appropriate thresholds, and then maybe, just maybe, you’ll be able to spot a real attack before it’s too late. Let’s decompose each of these steps to understand whether and how to do this in your environment.
In the high level monitoring process map we described the Analyze step as follows:
The collected data is then analyzed to identify potential incidents based on alerting policies defined in Phase 1. This may involve numerous techniques, including simple rule matching (availability, usage, attack traffic policy violations, time-based rules, etc.) and/or multi-factor correlation based on multiple device types (SIEM).
Wow, that was a mouthful. To break this into digestible steps that you might actually be able to perform, here are 5 subprocesses:
- Normalize Events/Data: In the Collect step didn’t talk about putting all that data into a common format. Most analyses require some level of event data normalization. As vendor tools become more sophisticated, and can do more analysis on unstructured data, the need for normalization is reduced, but we always expect some level of normalizing to be required – if only so we can compare apples to apples, with data types that are largely apples and oranges.
- Correlate: Once we have the data in a common format we look for patterns which may indicate some kind of attack. Of course we need to know what we are looking for to define the rules that we hope will identify attacks. We spoke extensively about setting up these policies in Define Policies. A considerable amount of correlation can be automated but not all of it, so human analysts aren’t going away.
- Reduce Events: Our systems are so interrelated now that any attack touches multiple devices, resulting in many similar events being received by the central event repository. This gets unwieldy quickly, so a key function of the analysis is to eliminate duplicate events. Basically you need to increase the signal-to-noise ratio, and filter out irrelevant events. Note that we don’t mean delete – merely move out of the main analysis engine to keep things streamlined. For forensics we want to retain the full log record.
- Tune Thresholds: If we see 2 failed logins, that might be an attack, or not. If we spot 100 in 10 minutes, something funky is going on. Each rule needs thresholds, below which there is no alert. Rule tend to look like “If [something] happens X times within Y minutes, fire an alert.” But defining X and Y is hard work. You start with a pretty loose threshold that generates many alerts, and quickly tune and tighten those thresholds to keep the number of alerts manageable. When building compound policies, like “if [this] happens X times within Y minutes AND [that] happens Z times within W minutes”, it’s even more fun. Keep in mind that thresholds are more art than science, and require plenty of testing and observation to determine the correct mix. You may also be setting thresholds on baselines set via data capture (this was explained in Define Policies), but these need good thresholds just as much.
- Trigger Alerts: Finally, after you’ve normalized, correlated, reduced, and blown past the threshold, you need to alert someone to something. In this step you send the alert based on the policies defined previously. The alert needs to go to the right person or team, with sufficient information to allow validation and support response. Depending on your internal workflow, the alert might be sent from use the monitoring tool, a help desk system, paper, or smoke signals. Okay, smoke signals are out of style nowadays, but you get the point.
The Most Powerful Correlation Engine
As you can imagine, there is a lot of number crunching involved in this kind of analysis. That usually requires a lot of computing power cranking away on zillions of records in huge data stores, at least is if you deployed a tool to faciliate monitoring. I’ve met more than a few security professionals who use the world’s most powerful correlation engine much more effectively than any programmed tool. Yes, I’m talking about the human brain. These are unique folks, but there are people who can monitor event streams flying past them and ‘instinctively’ know when something is not normal.
Is this something you can count on? Not entirely, but we don’t think you can solve this problem purely in software, or without using software. As usual, somewhere in the middle is best for most organizations. We have seen many situations where risk priorities are set via SIEM, and an analyst can then very quickly can determine whether the issue requires further investigation/escalation. We’ll talk more about that when we discuss validating the alert.
The attack space is very dynamic, which means the correlation rules and thresholds for alerts that you use today will need to be adapted tomorrow. This doesn’t happen by itself, so your process needs to systematically factor in feedback from analysts about which rules are working and which aren’t. The rules and alerting thresholds get updated accordingly, and hopefully over time the system increases its effectiveness and value.
Device Type Variances
As discussed under Define Policies, each device type generates different data and so requires customized rules, reduction, and thresholds. But the biggest challenge in monitoring the different device types is figuring out the dependencies of rules that incorporate data from more than one device type. Many vendors ship their tools with a set of default policies that map to specific attack vectors and that’s a good start, but tuning those policies for your environment takes time. So when modeling these process steps (to understand the cost of delivering this service internally), we need to factor that demanding tuning process into the mix.
Large vs. Small Company Considerations
Although it’s probably not wise to make assumptions about large company behavior, on the surface a larger enterprise should be able to provide far more feedback and invest more in sophisticated instrumentation to automate a lot of the heavy-duty analysis. This is important because larger environments generate much more data, which makes manual/human analysis infeasible.
Small companies are generally more resource constrained – especially for the tuning process. If the thresholds are too loose many of alerts require validatation, which is time consuming. With thresholds too tight things can slip through the cracks. And getting adequate feedback to even go through the tuning process is a challenge when the entire team – which might just be one person – has an overflowing task list.
Monitoring systems aim to improve security and increase efficiency, but recognize that there a significant time investment is required to get the system to a point where it generates value. And that investment is ongoing, because the system must be constantly tuned to ensure relevance over time.
Now that we have our alerts generated, we need to figure out if there is anything there. That’s Validation and Escalation, our next set of subprocesses. Stay tuned.
Posted at Monday 19th July 2010 3:13 pm
(1) Comments •
We’ve been writing a lot on tokenization as we build the content for our next white paper, and in Adrian’s response to the PCI Council’s guidance on tokenization. I want to address something that’s really been ticking me off…
In our latest post in the series we described the details of token generation. One of the options, which we had to include since it’s built into many of the products, is encryption of the original value – then using the encrypted value as the token.
Here’s the thing: If you encrypt the value, it’s encryption, not tokenization! Encryption obfuscates, but a token removes, the original data.
Conceptually the major advantages of tokenization are:
- The token cannot be reversed back to the original value.
- The token maintains the same structure and data type as the original value.
While format preserving encryption can retain the structure and data type, it’s still reversible back to the original if you have the key and algorithm. Yes, you can add per-organization salt, but this is still encryption. I can see some cases where using a hash might make sense, but only if it’s a format preserving hash.
I worry that marketing is deliberately muddling the terms.
Opinions? Otherwise, I declare here and now that if you are using an encrypted value and calling it a ‘token’, that is not tokenization.
Posted at Monday 19th July 2010 6:35 am
(21) Comments •
By Adrian Lane
In this post we’ll dig into the technical details of tokens. What they are and how they are created; as well as some of the options for security, formatting, and performance. For those of you who read our stuff and tend to skim the more technical posts, I recommend you stop and pay a bit more attention to this one. Token generation and structure affect the security of the data, the ability to use the tokens as surrogates in other applications, and the overall performance of the system. In order to differentiate the various solutions, it’s important to understand the basics of token creation.
Let’s recap the process quickly. Each time sensitive data is sent to the token server three basic steps are performed. First, a token is created. Second, the token and the original data are stored together in the token database. Third, the token is returned to the calling application. The goal is not just to protect sensitive data without losing functionality within applications, so we cannot simply create any random blob of data. The format of the token needs to match the format of the original data, so it can be used exactly as if it were the original (sensitive) data. For example, a Social Security token needs to have at least the same size (if not data type) as a social security number. Supporting applications and databases can accept the substituted value as long as it matches the constraints of the original value.
Let’s take a closer look at each of the steps.
There are three common methods for creating tokens:
- Random Number Generation: This method substitutes data with a random number or alphanumeric value, and is our recommended method. Completely random tokens offers the greatest security, as the content cannot be reverse engineered. Some vendors use sequence generators to create tokens, grabbing the next value in the series to use for the token – this is not nearly as secure as a fully randomized number, but is very fast and secure enough for most (non-PCI) use cases. A major benefit of random numbers is that they are easy to adapt to any format constraints (discussed in greater detail below), and the random numbers can be generated in advance to improve performance.
- Encryption: This method generates a ‘token’ by encrypting the data. Sensitive information is padded with a random salt to prevent reverse engineering, and then encrypted with the token server’s private key. The advantage is that the ‘token’ is reasonably secure from reverse engineering, but the original value can be retrieved as needed. The downsides, however are significant – performance is very poor, Format Preserving Encryption algorithms are required, and data can be exposed when keys are compromised or guessed. Further, the PCI Council has not officially accepted format preserving cryptographic algorithms, and is awaiting NIST certification. Regardless, many large and geographically disperse organizations that require access to original data favor the utility of encrypted ‘tokens’, even though this isn’t really tokenization.
- One-way Hash Function: Hashing functions create tokens by running the original value through a non-reversible mathematical operation. This offers reasonable performance, and tokens can be formatted to match any data type. Like encryption, hashes must be created with a cryptographic salt (some random bits of data) to thwart dictionary attacks. Unlike encryption, tokens created through hashing are not reversible. Security is not as strong as fully random tokens but security, performance, and formatting flexibility are all improved over encryption.
Beware that some open source and commercial token servers use poor token generation methods of dubious value. Some use reversible masking, and others use unsalted encryption algorithms, and can thus be easily compromised and defeated.
We mentioned the importance of token formats earlier, and token solutions need to be flexible enough to handle multiple formats for the sensitive data they accept – such as personally identifiable information, Social Security numbers, and credit card numbers. In some cases, additional format constraints must be honored. As an example, a token representing a Social Security Number in a customer service application may need to retain the real last digits. This enables customer service representatives to verify user identities, without access to the rest of the SSN.
When tokenizing credit cards, tokens are the same size as the original credit card number – most implementations even ensure that tokens pass the LUHN check. As the token still resembles a card number, systems that use the card numbers need not to be altered to accommodate tokens. But unlike the real credit card or Social Security numbers, tokens cannot be used as financial instruments, and have no value other than as a reference to the original transaction or real account. The relationship between a token and a card number is unique for any given payment system, so even if someone compromises the entire token database sufficiently that they can commit transactions in that system (a rare but real possibility), the numbers are worthless outside the single environment they were created for. And most important, real tokens cannot be decrypted or otherwise restored back into the original credit card number.
Each data type has different use cases, and tokenization vendors offer various options to accomodate them.
Tokens, along with the data they represent, are stored within heavily secured database with extremely limited access. The data is typically encrypted (per PCI recommendations), ensuring sensitive data is not lost in the event of a database compromises or stolen media. The token (database) server is the only point of contact with any transaction system, payment system, or collection point to reduce risk and compliance scope. Access to the database is highly restricted, with administrative personnel denied read access to the data, and even authorized access the original data limited to carefully controlled circumstances.
As tokens are used to represent the same data for multiple events, possibly across multiple systems, most can issue different tokens for the same user data. A credit card number, for example, may get a different unique token for each transaction. The token server not only generates a new token to represent the new transaction, but is responsible for storing many tokens per user ID. Many use cases require that the token database support multiple tokens for each piece of original data, a one-to-many relationship. This provides better privacy and isolation, if the application does not need to be able to correlation transactions by card number. Applications that rely on the sensitive data (such as credit card numbers) to correlate accounts or other transactions will require modification to use data which is still available (such as an non-sensitive customer number).
Token servers may be internally owned and operated, or provided as a third party service. We will discuss deployment models in an upcoming post.
Token Storage in Applications
When the token server returns the token to the application, the application must safely store the token and effectively erase the original sensitive data. This is critical – not just to secure sensitive data, but also to maintain transactional consistency. An interesting side effect of preventing reverse engineering is that a token by itself is meaningless. It only has value in relation to some other information. The token server has the ability to map the tokenized value back to the original sensitive information, but is the only place this can be done. Supporting applications need to associate the token with something like a user name, transaction ID, merchant customer ID, or some other identifier. This means applications that use token services must be resilient to communications failures, and the token server must offer synchronization services for data recovery.
This is one of the largest potential weaknesses – whenever the original data is collected or requested from the token database, it might be exposed in memory, log files, or virtual memory. This is the most sensitive part of the architecture, and later we’ll discuss some of the many ways to split functions architecturally in order to minimize risk.
At this point you should have a good idea of how tokens are generated, structured, and stored. In our next posts we’ll dig deeper into the architecture as we discuss tokenization server requirements and functions, application integration, and some sample architectural use cases.
Posted at Monday 19th July 2010 4:29 am
(3) Comments •