# Big Data Paradox

##### Posted on

By developing measures for data quality, this article suggests a framework to address the following question

**Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?**

A 5-element Euler-formula-like identity shows that for any dataset of size $n$, probabilistic or not, the difference between the sample average $\bar X_n$ and the population average $\bar X_N$ is the product of three terms:

- a data quality measure, $\rho_{R,X}$, the correlation between $X_j$ and the response/recording indicator $R_j$
- a data quantity measure, $\sqrt{(N-n)/n}$
- a problem difficulty measure, $\sigma_X$

The decomposition provides multiple insights:

- probabilistic sampling ensures high data quality by controlling $\rho_{R,X}$ at the level of $N^{-1/2}$ (???)
- when lose this control, the estimation error, relative to the benchmarking rate $1/\sqrt n$, increased with $\sqrt N$
- the bigness of such Big Data (for population inferences) should be measured by the relative size $f=n/N$, not the absolute size $n$
- when combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes

Without taking data quality into account, population inferences with Big Data are subject to a **Big Data Paradox**: the more the data, the surer we fool ourselves.

**data defect index**

For population inferences, a key “policy proposal” of the paper is to shift from the traditional focus on assessing probabilistic uncertainty

\[\text{Standard Error}\prop \frac{\sigma}{\sqrt n}\]to the practice of ascertaining systematic error in non-probabilistic Big Data captured by

\[\text{Relative Bias} \prop \rho \sqrt N\]The key question is

**how to compare two datasets with different quantities and different qualities?**

Consider finite population indexed by $j=1,\ldots,N$.

- $\bar G_N$: population average of $\{G_j\equiv G(X_j),j=1,\ldots,N\}$

When we have a sample, the sample average

\[\bar G_n = \frac 1n\sum_{j\in I_n}G_j = \frac{\sum_{j=1}^NR_jG_j}{\sum_{j=1}^NR_j}\]where “R” leads to R-mechanism.

the difference between $\bar G_n$ and $\bar G_N$ can be written as

\[\bar G_n-\bar G_N = \frac{\Cov_J(R_J, G_J)}{E_J(R_J)}\]Let

- $\rho_{R,G}=\Corr_J(R_J, G_J)$ be the population correlation between $R_J$ and $G_J$,
- $f=E_J(R_J）=n/N$ be the sampling rate,
- $\sigma_G$ be the standard deviation of $G_J$.

Then

\[\bar G_n - \bar G_N = \rho_{R, G} \times \sqrt{\frac{1-f}{f}} \times \sigma_G\]The most critical, yet most challenging to assess is **data quality**.