Tokenization: The Tokens
In this post we’ll dig into the technical details of tokens. What they are and how they are created; as well as some of the options for security, formatting, and performance. For those of you who read our stuff and tend to skim the more technical posts, I recommend you stop and pay a bit more attention to this one. Token generation and structure affect the security of the data, the ability to use the tokens as surrogates in other applications, and the overall performance of the system. In order to differentiate the various solutions, it’s important to understand the basics of token creation. Let’s recap the process quickly. Each time sensitive data is sent to the token server three basic steps are performed. First, a token is created. Second, the token and the original data are stored together in the token database. Third, the token is returned to the calling application. The goal is not just to protect sensitive data without losing functionality within applications, so we cannot simply create any random blob of data. The format of the token needs to match the format of the original data, so it can be used exactly as if it were the original (sensitive) data. For example, a Social Security token needs to have at least the same size (if not data type) as a social security number. Supporting applications and databases can accept the substituted value as long as it matches the constraints of the original value. Let’s take a closer look at each of the steps. Token Creation There are three common methods for creating tokens: Random Number Generation: This method substitutes data with a random number or alphanumeric value, and is our recommended method. Completely random tokens offers the greatest security, as the content cannot be reverse engineered. Some vendors use sequence generators to create tokens, grabbing the next value in the series to use for the token – this is not nearly as secure as a fully randomized number, but is very fast and secure enough for most (non-PCI) use cases. A major benefit of random numbers is that they are easy to adapt to any format constraints (discussed in greater detail below), and the random numbers can be generated in advance to improve performance. Encryption: This method generates a ‘token’ by encrypting the data. Sensitive information is padded with a random salt to prevent reverse engineering, and then encrypted with the token server’s private key. The advantage is that the ‘token’ is reasonably secure from reverse engineering, but the original value can be retrieved as needed. The downsides, however are significant – performance is very poor, Format Preserving Encryption algorithms are required, and data can be exposed when keys are compromised or guessed. Further, the PCI Council has not officially accepted format preserving cryptographic algorithms, and is awaiting NIST certification. Regardless, many large and geographically disperse organizations that require access to original data favor the utility of encrypted ‘tokens’, even though this isn’t really tokenization. One-way Hash Function: Hashing functions create tokens by running the original value through a non-reversible mathematical operation. This offers reasonable performance, and tokens can be formatted to match any data type. Like encryption, hashes must be created with a cryptographic salt (some random bits of data) to thwart dictionary attacks. Unlike encryption, tokens created through hashing are not reversible. Security is not as strong as fully random tokens but security, performance, and formatting flexibility are all improved over encryption. Beware that some open source and commercial token servers use poor token generation methods of dubious value. Some use reversible masking, and others use unsalted encryption algorithms, and can thus be easily compromised and defeated. Token Properties We mentioned the importance of token formats earlier, and token solutions need to be flexible enough to handle multiple formats for the sensitive data they accept – such as personally identifiable information, Social Security numbers, and credit card numbers. In some cases, additional format constraints must be honored. As an example, a token representing a Social Security Number in a customer service application may need to retain the real last digits. This enables customer service representatives to verify user identities, without access to the rest of the SSN. When tokenizing credit cards, tokens are the same size as the original credit card number – most implementations even ensure that tokens pass the LUHN check. As the token still resembles a card number, systems that use the card numbers need not to be altered to accommodate tokens. But unlike the real credit card or Social Security numbers, tokens cannot be used as financial instruments, and have no value other than as a reference to the original transaction or real account. The relationship between a token and a card number is unique for any given payment system, so even if someone compromises the entire token database sufficiently that they can commit transactions in that system (a rare but real possibility), the numbers are worthless outside the single environment they were created for. And most important, real tokens cannot be decrypted or otherwise restored back into the original credit card number. Each data type has different use cases, and tokenization vendors offer various options to accomodate them. Token Datastore Tokens, along with the data they represent, are stored within heavily secured database with extremely limited access. The data is typically encrypted (per PCI recommendations), ensuring sensitive data is not lost in the event of a database compromises or stolen media. The token (database) server is the only point of contact with any transaction system, payment system, or collection point to reduce risk and compliance scope. Access to the database is highly restricted, with administrative personnel denied read access to the data, and even authorized access the original data limited to carefully controlled circumstances. As tokens are used to represent the same data for multiple events, possibly across multiple systems, most can issue different tokens for the same user data. A credit card number, for example, may get a