Consider dataset containing an entry of our data and without that entry. If we run the same algorithm on both datasets, we get the outputs and . To ensure privacy, we want to make and โindistinguishable.โ That is, for an observer who knows everything except which output corresponds to which dataset, they canโt tell which output is the result of which dataset.
The only way to ensure this is through randomization and probabilistic indistinguishability. One example is ๐ Randomized Response; following this idea, we โrandomizeโ a ๐ฒ Probability Distribution to be similar to the original.
Formally, an algorithm is -differentially private is for any neighboring datasets and , the distributions for the outputs and are -close,
for any subset of outcomes. With , we have perfect privacy, and the higher the , the less privacy we have.
Intuitively, we can treat is the set of โharmfulโ outcomes. This is a guarantee that using our data () or not using it () doesnโt change the risk of the harmful outcomes happening. Note that this is different from guaranteeing that nothing bad will happen.
The post-processing immunity of differential privacy states that any algorithm processing the output of a differentially-private algorithm cannot โundoโ the differential privacy. That is, for some -DP algorithm and post-processing algorithm , is also -DP.
If is an algorithm that tries to recover the input to (that is, outputs or ), then we can show that our chance of a correct answer is not much better than random guessing. First, letโs approximate . Then, the probability of a correct guess for is
Then, adding up the probability of correct guesses for both and gives us
The right hand side is close to . However, a perfect predictor would have each of these correct probabilities be , meaning its sum would be . The right hand sideโs upper bound of thus shows that the best can do is close to random guessing.
Though differential privacy protects against post-processing, the paired output of two parallel -DP and -DP algorithms is -DP. Generally, if there are datasets, each processed with an -DP algorithm, then their composition is -DP.