Login  |  Register  |  Contact

Pragmatic Data Security

Monday, February 01, 2010

Pragmatic Data Security: Discover

By Rich

In the Discovery phase we figure where the heck our sensitive information is, how it's being used, and how well it's protected. If performed manually, or with too broad an approach, Discovery can be quite difficult and time consuming. In the pragmatic approach we stick with a very narrow scope and leverage automation for greater efficiency. A mid-sized organization can see immediate benefits in a matter of weeks to months, and usually finish a comprehensive review (including all endpoints) within a year or less.

Discover: The Process

Before we get into the process, be aware that your job will be infinitely harder if you don't have a reasonably up to date directory infrastructure. If you can't figure out your users, groups, and roles, it will be much harder to identify misuse of data or build enforcement policies. Take the time to clean up your directory before you start scanning and filtering for content. Also, the odds are very high that you will find something that requires disciplinary action. Make sure you have a process in place to handle policy violations, and work with HR and Legal before you start finding things that will get someone fired (trust me, those odds are pretty darn high).

You have a couple choices for where to start -- depending on your goals, you can begin with applications/databases, storage repositories (including endpoints), or the network. If you are dealing with something like PCI, stored data is usually the best place to start, since avoiding unencrypted card numbers on storage is an explicit requirement. For HIPAA, you might want to start on the network since most of the violations in organizations I talk to relate to policy violations over email/web/FTP due to bad business processes. For each area, here's how you do it:

  • Storage and Endpoints: Unless you have a heck of a lot of bodies, you will need a Data Loss Prevention tool with content discovery capabilities (I mention a few alternatives in the Tools section, but DLP is your best choice). Build a policy based on the content definition you built in the first phase. Remember, stick to a single data/content type to start. Unless you are in a smaller organization and plan on scanning everything, you need to identify your initial target range -- typically major repositories or endpoints grouped by business unit. Don't pick something too broad or you might end up with too many results to do anything with. Also, you'll need some sort of access to the server -- either by installing an agent or through access to a file share. Once you get your first results, tune your policy as needed and start expanding your scope to scan more systems.
  • Network: Again, a DLP tool is your friend here, although unlike with content discovery you have more options to leverage other tools for some sort of basic analysis. They won't be nearly as effective, and I really suggest using the right tool for the job. Put your network tool in monitoring mode and build a policy to generate alerts using the same data definition we talked about when scanning storage. You might focus on just a few key channels to start -- such as email, web, and FTP; with a narrow IP range/subnet if you are in a larger organization. This will give you a good idea of how your data is being used, identify some bad business process (like unencrypted FTP to a partner), and which users or departments are the worst abusers. Based on your initial results you'll tune your policy as needed. Right now our goal is to figure out where we have problems -- we will get to fixing them in a different phase.
  • Applications & Databases: Your goal is to determine which applications and databases have sensitive data, and you have a few different approaches to choose from. This is the part of the process where a manual effort can be somewhat effective, although it's not as comprehensive as using automated tools. Simply reach out to different business units, especially the application support and database management teams, to create an inventory. Don't ask them which systems have sensitive data, ask them for an inventory of all systems. The odds are very high your data is stored in places you don't expect, so to check these systems perform a flat file dump and scan the output with a pattern matching tool. If you have the budget, I suggest using a database discovery tool -- preferably one with built in content discovery (there aren't many on the market, as we'll mention in the Tools section). Depending on the tool you use, it will either sniff the network for database connections and then identify those systems, or scan based on IP ranges. If the tool includes content discovery, you'll usually give it some level of administrative access to scan the internal database structures.

I just presented a lot of options, but remember we are taking the pragmatic approach. I don't expect you to try all this at once -- pick one area, with a narrow scope, knowing you will expand later. Focus on wherever you think you might have the greatest initial impact, or where you have known problems. I'm not an idealist -- some of this is hard work and takes time, but it isn't an endless process and you will have a positive impact.

We aren't necessarily done once we figure out where the data is -- for approved repositories, I really recommend you also re-check their security. Run at least a basic vulnerability scan, and for bigger repositories I recommend a focused penetration test. (Of course, if you already know it's insecure you probably don't need to beat the dead horse with another check). Later, in the Secure phase, we'll need to lock down the approved repositories so it's important to know which security holes to plug.

Discover: Technologies

Unlike the Define phase, here we have a plethora of options. I'll break this into two parts: recommended tools that are best for the job, and ancillary tools in case you don't have a budget for anything new. Since we're focused on the process in this series, I'll skip definitions and descriptions of the technologies, most of which you can find in our Research Library

Recommended Tools

  1. Data Loss Prevention (DLP): This is the best tool for storage, network, and endpoint discovery. Nothing else is nearly as effective.
  2. Database Discovery: While there are only a few tools on the market, they are extremely helpful for finding all the unexpected databases that tend to be floating around most organizations. Some offer content discovery, but it's usually limited to regular expressions/keywords (which is often totally fine for looking within a database).
  3. Database Activity Monitoring (DAM): A couple of the tools include content discovery (some also include database discovery). I only recommend DAM in the discover phase if you also intend on using it later for database monitoring -- otherwise it's not the right investment.

Ancillary Tools

  1. IDS/IPS/Deep Packet Inspection: There are a bunch of different deep packet inspection network tools -- including UTM, Web Application Firewalls, and web gateways -- that now include basic regular expression pattern matching for "poor man's" DLP functionality. They only help with data that fits a pattern, they don't include any workflow, and they usually have a ton of false positives. If the tool can't crack open file attachments/transfers it probably won't be very helpful.
  2. Electronic Discovery, Search, and Data Classification: Most of these tools perform some level of pattern matching or indexing that can help with discovery. They tend to have much higher false positive rates than DLP (and usually cost more if you're buying new), but if you already have one and budgets are tight they can help.
  3. Email Security Gateways: Most of the email security gateways on the market can scan for content, but they are obviously limited to only email, and aren't necessarily well suited to the discovery process.
  4. FOSS Discovery Tools: There are a couple of free/open source content discovery tools, mostly projects from higher education institutions that built their own tools to weed out improper use of Social Security numbers due to a regulatory change a few years back.

Discover: Case Study

Frank from Billy Bob's Bait Shop and Sushi Outlet decides to use a DLP tool to help figure out where any unencrypted credit card numbers might be stored. He decides to go with a full suite DLP tool since he knows he needs to scan his network, storage, servers in the retail outlets, and employee systems.

Before turning on the tool, he contacts Legal and HR to set up a process in case they find any employees illegally using these numbers, as opposed to the accidental or business-process leaks he also expects to manage. Although his directory servers are a little messy due to all the short-term employees endemic to retail operations, he's confident his core Active Directory server is relatively up to date, especially where systems/servers are concerned.

Since he's using a DLP tool, he develops a three-tier policy to base his discovery scans on:

  1. Using the one database with stored unencrypted numbers, he creates a database fingerprinting policy to alert on exact matches from that database (his DLP tool uses hashes, not the original values, so it isn't creating a new security exposure). These are critical alerts.
  2. His next policy uses database fingerprints of all customer names from the customer database, combined with a regular expression for generic credit card numbers. If a customer name appears with something that matches a credit card number (based on the regex pattern) it generates a medium alert.
  3. His lowest priority policy uses the default "PCI" category built into his DLP tool, which is predominantly basic pattern matching.

He breaks his project down into three phases, to run during overlapping periods:

  1. Using those three policies, he turns on network monitoring for email, web, and FTP.
  2. He begins scanning his storage repositories, starting in the data center. Once he finishes those, he will expand the scans into systems in the retail outlets. He expects his data center scan to go relatively quickly, but is planning on 6-12 months to cover the retail outlets.
  3. He is testing endpoint discovery in the lab, but since their workstation management is a bit messy he isn't planning on trying to install agents and beginning scans until the second year of the project.

It took Frank about two months to coordinate with other business/IT units before starting the project. Installing DLP on the network only took a few hours because everything ran through one main gateway, and he wasn't worried about installing any proxy/blocking technology.

Frank immediately saw network results, and found one serious business process problem where unencrypted numbers were included in files being FTPed to a business partner. The rest of his incidents involved individual accidents, and for the most part they weren't losing credit card numbers over the monitored channels.

The content discovery portion took a bit longer since there wasn't a consistent administrative account he could use to access and scan all the servers. Even though they are a relatively small operation, it took about 2 months of full time scanning to get through the data center due to all the manual coordination involved. They found a large number of old spreadsheets with credit card numbers in various directories, and a few in flat files -- especially database dumps from development.

The retail outlets actually took less time than he expected. Most of the servers, except at the largest regional locations, were remotely managed and well inventoried. He found that 20% of them were running on an older credit card transaction system that stored unencrypted credit card numbers.

Remember, this is a 1,000 person organization... if you work someplace with five or ten times the employees and infrastructure, your process will take longer. Don't assume it will take five or ten times longer, though -- it all depends on scope, infrastructure, and a variety of other factors.

–Rich

Wednesday, January 27, 2010

Pragmatic Data Security- Define Phase

By Rich

Now that we've described the Pragmatic Data Security Cycle, it's time to dig into the phases. As we roll through each of these I'm going to break it into three parts: the process, the technologies, and a case study. For the case study we're going to follow a fictional organization through the entire process. Instead of showing you every single data protection option at each phase, we'll focus on a narrow project that better represents what you will likely experience.

Define: The Process

From a process standpoint, this is both the easiest and hardest of the phases. Easy, since there's only one thing you need to do and it isn't very technical or complex, hard since it may involve coordination across multiple business units and the quest for executive sponsorship.

  1. Identify an executive sponsor to support your efforts. Without management support, the rest of the process will be extremely difficult.
  2. Identify the one piece of information/content/data you want to protect. The definition shouldn't be too broad. For example, "engineering plans" is too broad, but "engineering plans for project X" is acceptable. Using "PCI/NPI/HIPAA" is acceptable, assuming you narrow it down in the next step.
  3. Define and model the information you defined in the step above. For totally unstructured content like engineering plans, identify a repository to use for your definition, or any watermarking/labels you are certain will be available to identify and protect the information. For PCI/NPI/HIPAA determine the exact fields/pieces of data to protect. For PCI it might be only the credit card number, for NPI it might be names and addresses, and for HIPAA it might be ICD9 billing codes. If you are protecting data from a database, also identify the source repository.
  4. Identify key business units with a stake in the information, and contact them to verify the priority, structure, and repositories for this information. It's no fun if you think you're going to protect a database of customer data, only to find out halfway through that it's not really the important one from a business perspective.

That's it: find a sponsor, identify the category, identify the data/repository, and confirm with the business folks.

Define: Technologies

None. This is a manual business process and the only technology you need is something to take notes with... or maybe email to communicate.

Define: Case Study

Billy Bob's Bait Shop and Sushi Outlet is a mid-sized, multi-site retail organization that specializes in "The freshest seafood, for your family or aquatic friends". Billy Bob's consists of a corporate headquarters and a few dozen retail outlets in three states. There are about 1,000 employees, and a growing web business due to their capability to ship fresh bait or sushi to any location in the US overnight.

Billy Bob's is struggling with PCI compliance and wants to avoid a major security breach after seeing the damage caused to their major competitor during a breach (John Boy's Worms and Grub).

They do not have a dedicated security team, but their CIO designated one of their top network administrators (the former firewall manager) to head up security operations. Frank has a solid history as a network administrator and is familiar with security (including some SANS training and a CISSP class). Due to problems with their first PCI assessment, Frank has the backing of the CIO.

The category of data is PCI. After some research, Frank decides to go with a multilevel definition -- at the top is credit card numbers. Since they are (supposedly) not storing them in a database they could feed to any data protection tools, Frank is starting with a regular expression to identify credit card numbers, and then plans on refining it using customer names (which are stored in the database). He is hoping that whatever tools he picks can use a generic credit card number definition for low-priority alerts, and a credit card (generic) tied with a customer name to trigger higher priority alerts. Frank also plans on using violation counts to help find real problems areas.

Frank now has a generic category (PCI), a specific definition (generic regex and customer name from a database) and the repository location (the customer database itself). From the heads of the customer relations and billing, he learned that there are really two databases he needs to worry about: the main transaction processing/records system for the web outlet, and the point of sale transaction processing system for the retail outlets. The web outlet does not store unencrypted credit card numbers, but the retail outlets currently do, and they are working with the transaction processor to fix that. Thus he is adding credit card numbers from the retail database to his list of data sources. Fortunately, they are only stored in the central processing database, and not at the individual retail outlets.

That's the setup -- in our next post we will cover the Discovery process to figure out where the heck all that data is.

–Rich

Wednesday, January 20, 2010

Pragmatic Data Security: Groundwork

By Rich

Back in Part 1 of our series on Pragmatic Data Security, we covered some guiding concepts. Before we actually dig in, there's some more groundwork we need to cover. There are two important fundamentals that provide context for the rest of the process.

The Data Breach Triangle

In May of 2009 I published a piece on the Data Breach Triangle, which is based on the fire triangle every Boy Scout and firefighter is intimately familiar with. For a fire to burn you need fuel, oxygen, and heat -- take any single element away and there's no combustion. Extending that idea: to experience a data breach you need an exploit, data, and an egress route. If you block the attacker from getting in, don't leave them data to steal, or block the stolen data's outbound path, you can't have a successful breach.

image

To date, the vast majority of information security spending is directed purely at preventing exploits -- including everything from vulnerability management, to firewalls, to antivirus. But when it comes to data security, in many cases it's far cheaper and easier to block the outbound path, or make the data harder to access in the first place. That's why, as we detail the process, you'll notice we spend a lot of time finding and removing data from where it shouldn't be, and locking down outbound egress channels.

The Two Domains of Data Security

We're going to be talking about a lot of technologies through this series. Data security is a pretty big area, and takes the right collection of tools to accomplish. Think about network security -- we use everything from firewalls, to IDS/IPS, to vulnerability assessment and monitoring tools. Data security is no different, but I like to divide both the technologies and the processes into two major buckets, based on how we access and use the information:

  1. The Data Center and Enterprise Applications -- When a user access content through an enterprise application (client/server or web), often backed by a database.
  2. Productivity Tools -- When a user works with information with their desktop tools, as opposed to connecting to something in the data center. This bucket also includes our communications applications. If you are creating or accessing the content in Microsoft Office, or exchanging it over email/IM, it's in this category.

To provide a little more context, our web application and database security tools fall into the first domain, while DLP and rights management generally fall into the second.

Now I bet some of you thought I was going to talk about structured and unstructured data, but I think that distinction isn't nearly as applicable as the data center vs. productivity applications. Not all structured data is in a database, and not all unstructured data is on a workstation or file server. Practically speaking, we need to focus on the business workflow of how users work with data, not where the data might have come from. You can have structured data in anything from a database to a spreadsheet or a PDF file, or unstructured data stored in a database, so that's no longer an effective division when it comes to the design and implementation of appropriate security controls.

The distinction is important since we need to take slightly different approaches based on how a user works with the information, taking into account its transitions between the two domains. We have a different set of potential controls when a user comes through a controlled application, vs. when a user is creating or manipulating content on their desktop and exchanging it through email.

As we introduce and explore the Pragmatic Data Security process, you'll see that we rely heavily on the concepts of the Data Breach Triangle and these two domains of data security to focus our efforts and design the right business processes and control schemes without introducing unneeded complexity.

–Rich

Wednesday, January 13, 2010

Pragmatic Data Security- Introduction

By Rich

Over the past 7 years or so I've talked with thousands of IT professionals working on various types of data security projects. If I were forced to pull out one single thread from all those discussions it would have to be the sheer intimidating potential of many of these projects. While there are plenty of self-constrained projects, in many cases the security folks are tasked with implementing technologies or changes that involve monitoring or managing on a pretty broad scale. That's just the nature of data security -- unless the information you're trying to protect is already in isolated use, you have to cast a pretty wide net.

But a parallel thread in these conversations is how successful and impactful well-defined data security projects can be. And usually these are the projects that start small, and grow over time.

Way back when I started the blog (long before Securosis was a company) I did a series on the Information-Centric Security Cycle (linked from the Research Library). It was my first attempt to pull the different threads of data security together into a comprehensive picture, and I think it still stands up pretty well.

But as great as my inspired work of data-security genius is (*snicker*), it's not overly useful when you have to actually go out and protect, you know, stuff. It shows the potential options for protecting data, but doesn't provide any guidance on how to pull it off.

Since I hate when analysts provide lofty frameworks that don't help you get your job done, it's time to get a little more pragmatic and provide specific guidance on implementing data security. This Pragmatic Data Security series will walk through a structured and realistic process for protecting your information, based on hundreds of conversations with security professionals working on data security projects.

Before starting, there's a bit of good news and bad news:

  1. Good news: there are a lot of things you can do without spending much money.
  2. Bad news: to do this well, you're going to have to buy the right tools. We buy firewalls because our routers aren't firewalls, and while there are a few free options, there's no free lunch.

I wish I could tell you none of this will cost anything and it won't impose any additional effort on your already strained resources, but that isn't the way the world works.

The concept of Pragmatic Data Security is that we start securing a single, well-defined data type, within a constrained scope. We then grow the scope until we reach our coverage objectives, before moving on to additional data types. Trying to protect, or even find, all of your sensitive information at once is just as unrealistic as thinking you can secure even one type of data everywhere it might be in your organization.

As with any pragmatic approach, we follow some simple principles:

  • Keep it simple. Stick to the basics.
  • Keep it practical. Don't try to start processes and programs that are unrealistic due to resources, scope, or political considerations.
  • Go for the quick wins. Some techniques aren't perfect or ideal, but wipe out a huge chunk of the problem.
  • Start small.
  • Grow iteratively. Once something works, expand it in a controlled manner.
  • Document everything. Makes life easier come audit time.

I don't mean to over-simplify the problem. There's a lot we need to put in place to protect our information, and many of you are starting from scratch with limited resources. But over the rest of this series we'll show you the process, and highlight the most effective techniques we've seen.

Tomorrow we'll start with the Pragmatic Data Security Cycle, which forms the basis of our process.

–Rich

Thursday, May 21, 2009

The Pragmatic Data (Information-Centric) Security Cycle

By Rich

Way back when I started Securosis, I came up with something called the Data Security Lifecycle, which I later renamed the Information-Centric Security Cycle. While I think it does a good job of capturing all the components of data security, it's also somewhat dense. That lifecycle was designed to be a comprehensive outline of protective controls and information management, but I've since realized that if you have a specific data security problem, it isn't the best place to start.

In a couple weeks I'll be speaking at the TechTarget Financial Information Security Decisions conference in New York, where I'm presenting Pragmatic Data Security. By "pragmatic" I mean something you can implement as soon as you get home. Where the lifecycle answers the question, "How can I secure all my data throughout its entire lifecycle?" pragmatic data security answers, "How can I protect this specific data at this point in time, in my existing environment?"

It starts with a slimmed down cycle:

image

  1. Define what information you want to protect (specifically, not general data classification)
  2. Discover where it's located (various tools/techniques, preferably automated, like DLP, rather than manual)
  3. Secure the data where it's stored, and/or eliminate data where it shouldn't be (access controls, encryption)
  4. Monitor data usage (various tools, including DLP, DAM, logs, SIEM)
  5. Protect the data from exfiltration (DLP, USB control, email security, web gateways, etc.)

For example, if you want to protect credit card numbers you'd define them in step 1, use DLP content discovery in step 2 to locate where they are stored, remove it or lock the repositories down in step 3, use DAM and DLP to monitor where they're going in step 4, and use blocking technologies to keep them from leaving the organization in step 5.

All too often I'm seeing people get totally wrapped up in complex "boil the ocean" projects that never go anywhere, vs. defining and solving a specific problem. You don't need to start your entire data security program with some massive data classification program. Pick one defined type of data/information, and just go protect it. Find it, lock it down, watch how it's being used, and stop it from going where you don't want.

Yeah, parts are hard, but hard != impossible. If you keep your focus, any hard problem is just a series of smaller, defined steps.

–Rich