The three basic data centric security tools are tokenization, masking, and data element encryption. Now we will discuss what they are, how they work, and which security challenges they best serve.

  • Tokenization: You can think of tokenization like a subway or arcade token: it has no cash value but can be used to ride the train or play a game. In data centric security, a token is provided in lieu of sensitive data. The most common use case today is in credit card processing systems, as a substitute for credit card numbers. A token is basically just a random number – that’s it. The token can be made to look just like the original data type; in the case of credit cards the tokens are typically 16 digits long, they usually preserve the last four original numbers, and can even be generated such that they pass the LUHN validation check. But it’s a random value, with no mathematical relationship to the original, and no value other than as a reference to the original in some other (more secure) database. Users may choose to maintain a “token database” which associates the original value with the token in case they need to look up the original at some point in the future, but this is optional.

Tokenization has advanced far beyond simple value replacement, and is lately being applied to more advanced data types. These days tokens are not just for simple things like credit cards and Social Security numbers, but also for JSON & XML files and web pages. Some tokenization solutions replace data stored within databases, while others can work on data streams – such as replacing unique cell IDs embedded in cellphone tower data streams. This enables both simple and complex data to be tokenized, at rest or in motion – and tokens can look like anything you want. Very versatile and very secure – you can’t steal what’s not there!

Tokenization is used to ensure absolute security by completely removing the original sensitive values from secured data. Random values cannot be reverse engineered back to the original data. For example given a database where the primary key is a Social Security number, tokenization can generate unique and random tokens which fits in the receiving database. Some firms merely use the token as a placeholder and don’t need the original value. In fact some firms discard (or never receive) the original value – they don’t need it. Instead they use tokens simply because downstream applications might break without a SSN or compatible surrogate. Users who need to occasionally reference the original values use token vaults or equivalent technologies. They are designed to only allow credentialed administrators access to the original sensitive values under controlled conditions, but a vault compromise would expose all the original values. Vaults are commonly used for PHI and financial data, as mentioned in the last post.

  • Masking: This is another very popular tool for protecting data elements while retaining aggregate values of data sets. For example we might substitute an individual’s Social Security number with a random number (as in tokenization), or a name randomly selected from a phone book, but retain gender. We might replace date of birth with a random value within X days of the original value to effectively preserve age. This way the original (sensitive) value is removed entirely without randomizing the value of the aggregate data set, to support later analysis.

Masking is the principal method of creating useful new values without exposing the original. It is ideally suited for creating data sets which can be used for meaningful analysis without exposing the original data. This is important when you don’t have sufficient resources to secure every system within your enterprise, or don’t fully trust the environment where the data is being stored. Different masks can be applied to the same data fields, to produce different masked data for different use cases. This flexibility exposes much of the value of the original with minimal risk. Masking is very commonly used with PHI, test data management, and NoSQL analytics databases.

That said, there are potential downsides as well. Masking does not offer quite as strong security as tokenization or encryption (which we will discuss below). The masked data does in fact bear some relationship to the original – while individual fields are anonymized to some degree, preservation of specific attributes of a person’s health record (age, gender, zip code, race, DoB, etc.) may provide more than enough information to reverse engineer the masked data back to the original data. Masking can be very secure, but that requires selection of good masking tools and application of a well-reasoned mask to achieve security goals while supporting desired analytics.

  • Element/Data Field Encryption / Format Preserving Encryption (FPE): Encryption is the go-to security tool for the majority of IT and data security challenges we face today. Properly implemented, encryption provides obfuscated data that cannot be reversed into the original data value without the encryption key. What’s more, encryption can be applied to any type of data such as first and names, or entire data structures such as a file or database table. And encryption keys can be provided to select users, keeping data secret from those not entrusted with keys.

But not all encryption solutions are suitable for a data centric security model. Most forms of encryption take human readable data and transform it into binary format. This is a problem for applications which expect text strings, or databases which require properly formatted Social Security numbers. These binary values create unwanted side effects and often cause applications to crash. So most companies considering data centric security need an encryption cipher that preserves at least format, and often data type as well. Typically these algorithms are applied to specific data fields (e.g.: name, Social Security number, or credit card number), and can be used on data at rest or applied to data streams as information moves from one place to the next. These encryption variants are commercially available, and provide the same degree of security at a modest performance penalty.

You need to be careful when selecting and implementing a encryption tool. Not all commercial platforms are vetted by experts. Sometimes firms deploy an excellent encryption cipher badly, perhaps by storing encryption keys on disk or accidentally exposing them to the public. It happens to a lot of organizations! Worse still, long-hidden vulnerabilities are discovered in ubiquitous encryption tools and expose user data. Encryption does not guarantee security, which is why some firms are prefer tokenization or masking to encryption.

That said, a significant advantage of encryption is many that data compliance requirements provide blanket approval for encryption: encrypted data is compliant. The same cannot be said for tokenization or masking. Encryption is also widely integrated with cloud infrastructure and most application platforms. Encryption is an industrial-grade general-purpose security tool, which requires great care during selection and deployment to meet security goals. With proper selection and deployment, encryption protects data very effectively, in diverse scenarios and use cases.

It is also worth mentioning homomorphic encryption, a topic which gets a lot of media coverage but it is of little use today. That is because homomorphic encryption is currently little more than a research lab concept. In practice homomorphic systems either require vast computing resources which make them economically unfeasible, or are based on compromised variants of standard encryption protocols. The result is that can run basic analytics on encrypted data, but it is not sufficiently secure, failing to achieve the principal requirement. Scientific advances could make homomorphic encryption a viable alternative within our lifetimes, but today it is simply impractical – or worse, snake oil.

  • Discovery: Most data centric security solutions include a discovery module. You need to locate sensitive data in order to know what needs to be secured, and you must identify what type of sensitive information it is in order to understand what security controls should be applied to protect it. Discovery tools handle that.

Data discovery tools are typically geared toward sifting through files or relational databases. File discovery tools first look at storage volumes to see what types of files they contain, then scan their contents for sensitive data. File scanners need credentials for access to the files to scan, and typically use a user account with read-only access. Some database scanners can identify databases on the network by scanning IP addresses, but most need to be pointed specifically at databases to scan. Like file scanners, they need credentials to read relational tables.

Discovery tools commonly employ two methods to discover content: metadata and regular expression checks. Metadata checks are common for database scanners: rather than examine every data element in the database they look at the size, column name, and structure of data. A column of 16-digit numerics with ‘C’ in the column name twice is probably a credit card field. Regular expression checks are also common; they examine the data itself rather than the database schema. You can try to pattern-match the contents of a file or database column – again, 16 digits is the most popular pattern – and check whatever matches. More advanced scanners employ heuristics to speed up scanning, or tricks like positional checks (e.g., the third element in a sequence of values often contains Social Security numbers). There are many types of discovery, but they all seek to locate sensitive data and identify its type, and often to map each data type to a security control. We will provide more examples of discovery next time.

Our next post will take a look at how some firms and security platforms implement data centric security.