Tokenization: The TokensBy Adrian Lane
In this post we’ll dig into the technical details of tokens. What they are and how they are created; as well as some of the options for security, formatting, and performance. For those of you who read our stuff and tend to skim the more technical posts, I recommend you stop and pay a bit more attention to this one. Token generation and structure affect the security of the data, the ability to use the tokens as surrogates in other applications, and the overall performance of the system. In order to differentiate the various solutions, it’s important to understand the basics of token creation.
Let’s recap the process quickly. Each time sensitive data is sent to the token server three basic steps are performed. First, a token is created. Second, the token and the original data are stored together in the token database. Third, the token is returned to the calling application. The goal is not just to protect sensitive data without losing functionality within applications, so we cannot simply create any random blob of data. The format of the token needs to match the format of the original data, so it can be used exactly as if it were the original (sensitive) data. For example, a Social Security token needs to have at least the same size (if not data type) as a social security number. Supporting applications and databases can accept the substituted value as long as it matches the constraints of the original value.
Let’s take a closer look at each of the steps.
There are three common methods for creating tokens:
- Random Number Generation: This method substitutes data with a random number or alphanumeric value, and is our recommended method. Completely random tokens offers the greatest security, as the content cannot be reverse engineered. Some vendors use sequence generators to create tokens, grabbing the next value in the series to use for the token – this is not nearly as secure as a fully randomized number, but is very fast and secure enough for most (non-PCI) use cases. A major benefit of random numbers is that they are easy to adapt to any format constraints (discussed in greater detail below), and the random numbers can be generated in advance to improve performance.
- Encryption: This method generates a ‘token’ by encrypting the data. Sensitive information is padded with a random salt to prevent reverse engineering, and then encrypted with the token server’s private key. The advantage is that the ‘token’ is reasonably secure from reverse engineering, but the original value can be retrieved as needed. The downsides, however are significant – performance is very poor, Format Preserving Encryption algorithms are required, and data can be exposed when keys are compromised or guessed. Further, the PCI Council has not officially accepted format preserving cryptographic algorithms, and is awaiting NIST certification. Regardless, many large and geographically disperse organizations that require access to original data favor the utility of encrypted ‘tokens’, even though this isn’t really tokenization.
- One-way Hash Function: Hashing functions create tokens by running the original value through a non-reversible mathematical operation. This offers reasonable performance, and tokens can be formatted to match any data type. Like encryption, hashes must be created with a cryptographic salt (some random bits of data) to thwart dictionary attacks. Unlike encryption, tokens created through hashing are not reversible. Security is not as strong as fully random tokens but security, performance, and formatting flexibility are all improved over encryption.
Beware that some open source and commercial token servers use poor token generation methods of dubious value. Some use reversible masking, and others use unsalted encryption algorithms, and can thus be easily compromised and defeated.
We mentioned the importance of token formats earlier, and token solutions need to be flexible enough to handle multiple formats for the sensitive data they accept – such as personally identifiable information, Social Security numbers, and credit card numbers. In some cases, additional format constraints must be honored. As an example, a token representing a Social Security Number in a customer service application may need to retain the real last digits. This enables customer service representatives to verify user identities, without access to the rest of the SSN.
When tokenizing credit cards, tokens are the same size as the original credit card number – most implementations even ensure that tokens pass the LUHN check. As the token still resembles a card number, systems that use the card numbers need not to be altered to accommodate tokens. But unlike the real credit card or Social Security numbers, tokens cannot be used as financial instruments, and have no value other than as a reference to the original transaction or real account. The relationship between a token and a card number is unique for any given payment system, so even if someone compromises the entire token database sufficiently that they can commit transactions in that system (a rare but real possibility), the numbers are worthless outside the single environment they were created for. And most important, real tokens cannot be decrypted or otherwise restored back into the original credit card number.
Each data type has different use cases, and tokenization vendors offer various options to accomodate them.
Tokens, along with the data they represent, are stored within heavily secured database with extremely limited access. The data is typically encrypted (per PCI recommendations), ensuring sensitive data is not lost in the event of a database compromises or stolen media. The token (database) server is the only point of contact with any transaction system, payment system, or collection point to reduce risk and compliance scope. Access to the database is highly restricted, with administrative personnel denied read access to the data, and even authorized access the original data limited to carefully controlled circumstances.
As tokens are used to represent the same data for multiple events, possibly across multiple systems, most can issue different tokens for the same user data. A credit card number, for example, may get a different unique token for each transaction. The token server not only generates a new token to represent the new transaction, but is responsible for storing many tokens per user ID. Many use cases require that the token database support multiple tokens for each piece of original data, a one-to-many relationship. This provides better privacy and isolation, if the application does not need to be able to correlation transactions by card number. Applications that rely on the sensitive data (such as credit card numbers) to correlate accounts or other transactions will require modification to use data which is still available (such as an non-sensitive customer number).
Token servers may be internally owned and operated, or provided as a third party service. We will discuss deployment models in an upcoming post.
Token Storage in Applications
When the token server returns the token to the application, the application must safely store the token and effectively erase the original sensitive data. This is critical – not just to secure sensitive data, but also to maintain transactional consistency. An interesting side effect of preventing reverse engineering is that a token by itself is meaningless. It only has value in relation to some other information. The token server has the ability to map the tokenized value back to the original sensitive information, but is the only place this can be done. Supporting applications need to associate the token with something like a user name, transaction ID, merchant customer ID, or some other identifier. This means applications that use token services must be resilient to communications failures, and the token server must offer synchronization services for data recovery.
This is one of the largest potential weaknesses – whenever the original data is collected or requested from the token database, it might be exposed in memory, log files, or virtual memory. This is the most sensitive part of the architecture, and later we’ll discuss some of the many ways to split functions architecturally in order to minimize risk.
At this point you should have a good idea of how tokens are generated, structured, and stored. In our next posts we’ll dig deeper into the architecture as we discuss tokenization server requirements and functions, application integration, and some sample architectural use cases.