GigaOm offers a fascinating glimpse into Netflix’s EC2 architecture: Netflix shows off how it does Hadoop in the cloud:
“Hadoop is more than a platform on which data scientists and business analysts can do their work. Aside from their 500-plus-nod[sic] cluster of Elastic MapReduce instances, there’s another equally sized cluster for extract-transform-load (ETL) workloads – essentially, taking data from other sources and making it easy to analyze within Hadoop. Netflix also deploys various “development” clusters as needed, presumably for ad hoc experimental jobs.”
The big data users I have spoken with about data security agreed that data masking at that scale is infeasible. Given the rate of data insertion (also called ‘velocity’), masking sensitive data before loading it into a cluster would require “an entire ETL cluster to front the Hadoop cluster”. But apparently it’s doable, and Netflix did just that – fronted its analytics cluster with a data transformation cluster, all within EC2. 500 nodes massaging data for another 500 nodes. While the ETL cluster is not used for masking, note that it is about the same size as the analysis cluster. It’s this one-to-one mapping that I often worry about with security. Ask yourself, “Do we need another whole cluster for masking?” No? Then what about NoSQL activity monitoring? What about IAM, application monitoring, and any other security tasks. Do you start to see the problem with bolting on security? Logging and auditing are embeddable – most everything else is not.
When the Cloud Security Alliance advised reinvestment of some savings back into security, I don’t think this is quite what they had in mind.