print("Differential Privacy")Udacity Course
Differential Privacy
- Most accurate query with the greatest amount of privacy
- Greatest fit with trusted models in the actual world (don't waste trust)
- Create flexible DP strategies
Types of DP
- Local: add noise to each data point
- Global: add noise to query output
Local DP
Coin flip jaywalking example
- Flip coin 2x
- If first coin flip is heads, answer honestly
- If first coin flip is tails, answer according to the second coin flip (heads for yes, tails for no)
Each person is now protected with plausible deniability
If we collect a bunch of samples, and 60% answer yes, then we know the true distribution is 70%, because 70% averaged with 50% (coin flip) is 60% which is the result we obtained.
NB: This privacy technique comes at the cost of accuracy, especially when we only have a few samples. The greater the privacy protection (plausible deniability) the less accurate the results.
Types of Noise
- Gaussian
- Laplacian (typical - Delta always zero)
How much noise to add?
- [Type of noise]
- Sensitivity of noise
- Desired Epsilon (major privacy parameter)
- Desired Delta (minor privacy parameter - in case a query doesn't satisfy Epsilon)
Laplacian Noise
- Increased or decreased according to a 'scale' parameter B (Beta)
- Beta = Sensitivity(Query) / Epsilon
- Delta always zero with Laplacian Noise
np.random.laplace
Perfect Privacy (AI model)
Training a model on a dataset should return the same model even if we remove any person from the training set.
Training a model is kind of like querying a database
Two points of complexity
- Do we always know where 'people' are referenced in the dataset?
- Neural models rarely ever train to the same location, even when trained on the same dataset twice (element of randomness in the training process)
Hospital Scenario
- You have unannotated medical data and want to build classifier
- 10 partner hospitals have annotated data
Steps
- Ask each hospital to train a model on their own dataset (10 models generated)
- Use each model to predict on your own local dataset, generating 10 labels for each datapoint
- Perform a DP query to generate the final true (DP) label for each datapoint (max function, where max is the most frequent label across the 10 labels assigned; then add laplacian noise for DP)
- Retrain a new model on your local dataset which now has DP labels
Misc Notes
https://arxiv.org/abs/1607.00133
Copyright / Privacy can cause implicit bias issues
https://shows.pippa.io/ipse-dixit/episodes/amanda-levendowski-on-copyright-ais-implicit-bias-problem
https://www.wired.com/2014/11/hacker-lexicon-homomorphic-encryption/
Privacy: Control information about your life and other people's access to that information.
Open AI interview on privacy: https://www.youtube.com/watch?v=by08lyQ18EA