Securosis

Research

Understanding and Selecting Data Masking: How It Works

In this post I want to show how masking works, focusing on how masking platforms move and manipulate data. I originally intended to start with architectures and mechanics of masking systems; but it should be more helpful to start by describing the different masking models, how data flows through different systems, and the advantages and disadvantages of each. I will comment on common data sources and destinations, and the issues to consider when considering masking technology. There are many different types of data repositories and services which can be masked, so I will go into detail on these choices. For now we will stick to relational databases, to keep things simple. Let’s jump right in and discuss how the technology works. ETL When most people think about masking, they think about ETL. ‘ETL’ is short for Extraction-Transformation-Load – a concise description of the classic (and still most common) masking process. Sometimes referred to as ‘static’ masking, ETL works against a fixed export from the source repository. Each phase of ETL is typically performed on separate servers: A source data repository, a masking server that orchestrates the transformation, and a destination database. The masking server connects to the source, retrieves a copy of the data, applies the mask to specified columns of data, and then loads the result onto the target server. This process may be partially manual, fully driven by an administrator, or fully automated. Let’s examine the steps in greater detail: Extract: The first step is to ‘extract’ the data from some storage repository – most often a relational database. The extracted data is often formatted to make it easier for the mask to be applied. For example, extraction can performed with a simple SELECT query issued against a database, filtering out unwanted rows and formatting columns in the query. Results may be streamed directly to the masking application for processing or dumped into a file – such as a comma-separated .csv or tab-separated .tsv file. The extracted data is then securely transferred, as an encrypted file or over an encrypted SSL connection, to the masking platform. Transform: The second step is to apply the data mask, transforming sensitive production data into a safe approximation of the original content. See Defining Masking for available transformations. Masks are almost always applied to what database geeks call “columnar data” – which simply means data of the same type is grouped together. For example, a database may contain a ‘customer’ table, where each customer entry includes a social security (SSN). These values are grouped together into a single column, in files and databases alike, making it easier for the masking application to identify which data to mask. The masking application parses through the data, and for each column of data to be masked, it replaces each entry in the column with a masked value. Load: In the last step masked data is loaded into a destination database. The masked data is copied to one or more destination databases, where it is loaded back into tables. The destination database does not contain sensitive data, so it is not subject to the same security and audit requirements as the original database with the unmasked data. ETL is the most generic and most flexible of masking approaches. The logical ETL process flow implemented in dedicated masking platforms, data management tools with integrated masking and encryption libraries, embedded database tools – all the way down to home-grown scripts. I see all these used in production environments, with the level of skill and labor required increasing as you progress down the chain. While many masking platforms replicate the full process – performing extraction, masking, and loading on separate systems – that is not always the case. Here are some alternative masking models and processes. In-place Masking In some cases you need to create a masked copy within the source database – perhaps before moving it to another less sensitive database. In other cases the production data is moved unchanged (securely!) into another system, and then masked at the destination. When production data is discovered on a test system, the data may be masked without being moved at all. All these variations are called “in-place masking” because they skip both movement steps. The masks are applied as before, but inside the database – which raises its own security and performance considerations. There are very good reasons to mask in place. The first is to take advantage of databases’ facility with management and and manipulation of data. They are incredibly adept at data transformation, and offer very high masking performance. Leveraging built-in functions and stored procedures can speed up the masking process because the database has already parsed the data. Masking data in place – replacing data rather than creating a new copy – protects database archives and data files from snooping, should someone access backup tapes or raw disk files. If the security of data after it leaves the production database is your principal concern, then ETL and in-place masking prior to moving data to another location should satisfy security and audit requirements. Many test environments have poor security, which may require masking prior to export or use of a secure ETL exchange, to ensure sensitive data is never exposed on the network or in destination data repository. That said, among enterprise customers we have interviewed, masking data at the source (in the production database) is not a popular option. The additional computational overhead of the masking operation, in addition to the overhead required to read and write the data being transformed, may have an unacceptable impact on database performance. In many organization legacy databases struggle to keep up with day-to-day operation, and cannot absorb the additional load. Masking in the target database (after the data has been moved) is not very popular either – masking solutions are generally purchased to avoid putting sensitive data on insecure test systems, and such customers prefer to avoid loading data into untrusted test systems prior to masking. In-place masking is typically

Share:
Read Post
dinosaur-sidebar

Totally Transparent Research is the embodiment of how we work at Securosis. It’s our core operating philosophy, our research policy, and a specific process. We initially developed it to help maintain objectivity while producing licensed research, but its benefits extend to all aspects of our business.

Going beyond Open Source Research, and a far cry from the traditional syndicated research model, we think it’s the best way to produce independent, objective, quality research.

Here’s how it works:

  • Content is developed ‘live’ on the blog. Primary research is generally released in pieces, as a series of posts, so we can digest and integrate feedback, making the end results much stronger than traditional “ivory tower” research.
  • Comments are enabled for posts. All comments are kept except for spam, personal insults of a clearly inflammatory nature, and completely off-topic content that distracts from the discussion. We welcome comments critical of the work, even if somewhat insulting to the authors. Really.
  • Anyone can comment, and no registration is required. Vendors or consultants with a relevant product or offering must properly identify themselves. While their comments won’t be deleted, the writer/moderator will “call out”, identify, and possibly ridicule vendors who fail to do so.
  • Vendors considering licensing the content are welcome to provide feedback, but it must be posted in the comments - just like everyone else. There is no back channel influence on the research findings or posts.
    Analysts must reply to comments and defend the research position, or agree to modify the content.
  • At the end of the post series, the analyst compiles the posts into a paper, presentation, or other delivery vehicle. Public comments/input factors into the research, where appropriate.
  • If the research is distributed as a paper, significant commenters/contributors are acknowledged in the opening of the report. If they did not post their real names, handles used for comments are listed. Commenters do not retain any rights to the report, but their contributions will be recognized.
  • All primary research will be released under a Creative Commons license. The current license is Non-Commercial, Attribution. The analyst, at their discretion, may add a Derivative Works or Share Alike condition.
  • Securosis primary research does not discuss specific vendors or specific products/offerings, unless used to provide context, contrast or to make a point (which is very very rare).
    Although quotes from published primary research (and published primary research only) may be used in press releases, said quotes may never mention a specific vendor, even if the vendor is mentioned in the source report. Securosis must approve any quote to appear in any vendor marketing collateral.
  • Final primary research will be posted on the blog with open comments.
  • Research will be updated periodically to reflect market realities, based on the discretion of the primary analyst. Updated research will be dated and given a version number.
    For research that cannot be developed using this model, such as complex principles or models that are unsuited for a series of blog posts, the content will be chunked up and posted at or before release of the paper to solicit public feedback, and provide an open venue for comments and criticisms.
  • In rare cases Securosis may write papers outside of the primary research agenda, but only if the end result can be non-biased and valuable to the user community to supplement industry-wide efforts or advances. A “Radically Transparent Research” process will be followed in developing these papers, where absolutely all materials are public at all stages of development, including communications (email, call notes).
    Only the free primary research released on our site can be licensed. We will not accept licensing fees on research we charge users to access.
  • All licensed research will be clearly labeled with the licensees. No licensed research will be released without indicating the sources of licensing fees. Again, there will be no back channel influence. We’re open and transparent about our revenue sources.

In essence, we develop all of our research out in the open, and not only seek public comments, but keep those comments indefinitely as a record of the research creation process. If you believe we are biased or not doing our homework, you can call us out on it and it will be there in the record. Our philosophy involves cracking open the research process, and using our readers to eliminate bias and enhance the quality of the work.

On the back end, here’s how we handle this approach with licensees:

  • Licensees may propose paper topics. The topic may be accepted if it is consistent with the Securosis research agenda and goals, but only if it can be covered without bias and will be valuable to the end user community.
  • Analysts produce research according to their own research agendas, and may offer licensing under the same objectivity requirements.
  • The potential licensee will be provided an outline of our research positions and the potential research product so they can determine if it is likely to meet their objectives.
  • Once the licensee agrees, development of the primary research content begins, following the Totally Transparent Research process as outlined above. At this point, there is no money exchanged.
  • Upon completion of the paper, the licensee will receive a release candidate to determine whether the final result still meets their needs.
  • If the content does not meet their needs, the licensee is not required to pay, and the research will be released without licensing or with alternate licensees.
  • Licensees may host and reuse the content for the length of the license (typically one year). This includes placing the content behind a registration process, posting on white paper networks, or translation into other languages. The research will always be hosted at Securosis for free without registration.

Here is the language we currently place in our research project agreements:

Content will be created independently of LICENSEE with no obligations for payment. Once content is complete, LICENSEE will have a 3 day review period to determine if the content meets corporate objectives. If the content is unsuitable, LICENSEE will not be obligated for any payment and Securosis is free to distribute the whitepaper without branding or with alternate licensees, and will not complete any associated webcasts for the declining LICENSEE. Content licensing, webcasts and payment are contingent on the content being acceptable to LICENSEE. This maintains objectivity while limiting the risk to LICENSEE. Securosis maintains all rights to the content and to include Securosis branding in addition to any licensee branding.

Even this process itself is open to criticism. If you have questions or comments, you can email us or comment on the blog.