Coin flip com
We would still want the analyst to obtain valuable information about the dataset while at the same time preserving privacy of all users. Even though we are adding privacy to users we cannot report this mean back to the analyst. Now this expected value is not a true representation of the population. Why is that?Īgain look back at the formula we derived for the expected number of “Yes” after the 2 coin flips are done on everyone.
The interesting thing happening here is that we are multiplying the augmented mean by 2 and subtracting 0.5 from it. In the above code, the augmented database is obtained after the 2 coin flips. Imagine what this would look like if the data had not been private by the coin toss. Let us say you want to query from the dataset a particular person to find whether he/she skipped a traffic signal. Now imagine you are an analyst who wants to look at this data of people who tossed the coin. That means even if the coin toss ends up in them answering Yes even though the true answer is No they can deny it and blame it on the coin toss(the second toss to be specific as it is in this toss we don’t really give the person to tell the truth. As explained in Cynthia Dwork’s book on “ The Algorithmic Foundations of Differential Privacy” the coin toss gives people plausible deniability. Hence we have added “noise” to our dataset. If you look at the individual components closely they answer questions A and B respectively. If we translate it to our case it would look something like this: This is the formal definition of expected value. If we add up these two expected “Yes” values we get the expected number of “yes” in the whole population.įormula for expected valued of random variable(discrete).
Out of the proportion “p” who did skip the traffic signal how many people are expected to say “Yes”?. Out of the proportion 1-p who did not skip the traffic signal how many people are expected to say “Yes”?.ī. Either the person skipped the traffic signal and responded Yes(P(Yes|Yes)) or he did not and responded Yes(P(Yes|No)).Ī. What are the expected number of “Yes” answers once the coin toss experiment is done?Īgain for this look back at the two probabilities, we computed. And those who did not is naturally ‘1-p’. Now imagine that the ‘True proportion’ of people who skipped a traffic signal is ‘p’. If you want to know more about probabilities you can check out Khan academy’s course. Here P(A|B) is a conditional probability meaning the probability of A happening given B has already happened. P(Yes|Yes)=P(heads on first toss) + P(tails on first toss)*P(heads on second toss)=0.5 + 0.5*0.5 =0.75. Or the first toss can end up being tails and the second heads. If the first toss is heads they have to say the truth which is Yes. What is the probability of a person responding Yes when the true answers is in fact Yes? But if the first coin toss is tails and the second coin toss is heads the person will respond Yes even though the true response is No. That’s because if the first toss was heads, the person would’ve told that they did not skip the signal. Think about the probability of a person responding ‘Yes’ even though that person actually hasn’t skipped a traffic signal.įor this to happen, the second coin toss has to occur and it has to land on heads. We aren’t giving the person really a choice here. Now the individual has to answer ‘Yes’ if it lands on heads and ‘No’ if it lands on tails. If the first coin lands on tails, toss the coin again.If it lands on heads the individual has to tell the truth(whether he/she skipped a traffic signal or not). We can conduct a simple experiment to do so: The question now is, how do we ensure the privacy of every individual while also providing the analyst data that is not very noisy so that he/she can draw value from it? They don’t want an analyst to be able to query something about them and be a 100 percent sure of their activities. But the people in the dataset want their data to be secured.
The analyst can then query the ‘Skipped Traffic Signal’ column of the dataset to find out if a particular person skipped the signal or not. Now you, the curator of the dataset want to make this data public so that someone can perform some statistical analysis on it. ‘ Skipped Traffic Signal’ will be one column of this dataset. Assume you want to build a dataset regarding human behaviour to perform some sort of analysis. The data that you are collecting here is sensitive. Assume you want to survey a population of 1000 and ask them if they’ve skipped past a traffic signal.