“The outcome of any statistical analysis is essentially equally likely independent of whether any individual chooses to opt into the data set or to opt out of the data set.” – Cynthia Dwork
The goal of differential privacy is to protect the privacy of individuals who contribute to records in a database while still allowing for useful statistics to be derived from the aggregate information.
Why don’t you just strip away Personally Identifiable Information (PII)?
There are a couple of infamous examples of large organizations trying to do exactly that. Netflix had a competition, Netflix Prize, in which they anonymized the data set of 480,000 subscribers and challenged anyone to come up with the best collaborative filtering algorithm to predict user ratings for films. Two researchers from The University of Texas, Narayanan and Shmatikov, came up with a method to identify individual users by cross-referencing the data from IMDB. Another example is when the Massachusetts Group Insurance Commission released anonymized data on state employees to help researchers. After the data was publicized Dr. Sweeney, a computer science graduate, was able to cross-reference the data with voter data to uniquely identify governor William Weld and his medical records. These identifying attacks, called linkage attacks, can identify individuals by combining anonymized data sets with other data sets showing a clear need for stronger privacy protections.
So how does Differential Privacy work?
Differential Privacy uses math to determine the amount of disclosed information allowed before an individual’s information can be compromised. Once that is determined, an algorithm is used to add distortion (noise) to the data such that even if the data is released, the data would not be traceable back to an individual person. The amount of noise added is correlated with the accuracy of the information, in particular, the more distorted the data, the less accurate it will be.
As an example, you can think of asking someone a question like “Have you ever smoked?”, then the person flips a coin (in secret), if heads, they answer honestly, if tails, they flip another coin and answer “Yes” if heads, and “No” if tails. As you can see, looking at the individual answer, we do not know if the person has ever smoked or not, but aggregating many answers from many people, we can determine the fraction of people who smoke because the probability of answering with the truth is higher. In this case, a useful statistic can be derived, while still maintaining the privacy of the individual users.
There are currently only a few real-world use cases by Google, Apple and Microsoft.
- Microsoft is working with a power company in California to implement differential privacy on their smart power grid.
- Apple is using differential privacy to figure out popular emojis, words unknown to the QuickType dictionary, ddeep links within apps marked as “eligible for public indexing”, highlighted hints in their Notes app, health data types, and media playback preferences in Safari.
- Google is using their RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) project in Chrome to learn statistics about how unwanted software is hijacking users’ settings. The RAPPOR project is open-source and can be found here:
Sounds great, so should I be using it?
Implementing differential privacy from scratch would be a difficult task. Measuring the privacy budget and deciding which algorithms to use is non-trivial. There are only a handful of real-world examples coming from companies with extremely large databases at their disposal. From these examples, only Google has theirs open-sourced. It would be helpful to take a look at Google’s RAPPOR to see its implementation and get a better understanding of how it could be customized for your own needs. There are many complex parts to differential privacy and zero room for error. An improper implementation will make the private data vulnerable and can expose individuals.
The overall concept is that you are trading accuracy for privacy. Any kind of query will leak a bit of information otherwise the query would be useless. The more information that leaks, the greater the amount of noise is required to minimize leakage. This will ultimately lead to pure noise or useless data. What can be done is adjusting the algorithm to account for the previous leakage. There are also other measures that can be made such as limiting the amount of data from the same source. It may sound bad that the data converges to pure noise, however, if you think of cryptography, anything encoded can be decoded by brute force, it would just take a really long time, making it reasonable to use. Differential privacy works in much the same way, in that, adding the right amount of noise can prolong the inevitability of pure noise or useless data.