# The Cost of Privacy

## Introduction

The increasing concerns that statistical analysis of datasets (containing sensitive personal information) may compromise individual privacy give rise to statistical methods that provide privacy guarantees at the cost of statistical accuracy, but there has been very limited understanding of the optimal tradeoff between accuracy and privacy for many important statistical problems.

Differential privacy, introduced in Dwork et al. (2006), is arguably the most widely adopted definition of privacy in statistical data analysis.

A usual approach to developing differentially private algorithms is perturbing the output of non-private algorithms by random noise.

• when the observations are continuous, differential privacy can be guaranteed by adding Laplace/Gaussian noise to the non-private output
• for discrete data, differential privacy can be achieved by adding Gumbel noise to utility score functions.

Naturally, the processed output suffers from some loss of accuracy, and the goal of the paper is to provide a quantitative characterization of the tradeoff between differential privacy guarantees and statistical accuracy, under the statistical framework.

Specifically, consider the problem for mean estimation and linear regression models in both classical and high-dimensional setting with $(\varepsilon,\delta)$-differential privacy constraint.

A randomized algorithm $M$ is $(\epsilon, \delta)$-differentially private iff for every pair of adjacent datasets $X_{1:n}$ and $X_{1:n}'$, and for any set $S$, $$P(M(X_{1:n})\le S)\le e^\epsilon \cdot P(M(X_{1:n}')\in S) +\delta\,,$$ where we say two datasets $X_{1:n}=\{x_i\}_{i=1}^n$ and $X_{1:n}'=\{x_i'\}_{i=1}^n$ are adjacent iff $\sum_{i=1}^n1(x_i\neq x_i')=1$.

The raw definition in Dwork et al. (2006) is

A mechanism is $\epsilon$-indistinguishable if for all pairs $\x, \x'\in D^n$ which differ in only one entry, for all adversaries $\cA$, and for all transcripts $t$: $$\left\vert \log \left(\frac{\Pr[\cT_\cA(\x)=t]}{\Pr[\cT_\cA(\x')=t]}\right) \right\vert\le\varepsilon\,.$$

According to the definition, the two parameter $\epsilon$ and $\delta$ control the level of privacy against an adversary who attempts to detect the presence of a certain subject in the sample. Roughly speaking, $\epsilon$ is an upper bound on the amount of influence an individual’s record has on the information released and $\delta$ is the probability that this bound fails to hold.

The authors establish the necessary cost of privacy by providing minimax risk lower bounds under the $(\varepsilon, \delta)$-differential privacy constraint.

Published in categories Memo