FDR Control for Byzantine Machines
Posted on
In a distributed learning system, there is usually a small fraction of worker machines that send any arbitrary information due to malicious attacks on work machines and communication channels, or the variation and contamination in the data sources. Those abnormal machines are called Byzantine machines
Byzantine robust learning
it focuses on aggregating workers’ messages, e.g., the averages of gradients, via some robust estimation algorithms in the master machine
one popular choice is to take the median instead of the sample mean. however, those median-type algorithms suffer a bias dependent on the dimension due to Byzantine failures
another direction is to detect and delete Byzantine machines and further make estimations based on some reputation scores which measure the truthworthiness of worker machines
Outlier detection methods
Byzantine machine performs like one outlier.
traditionally, one outlier detection procedure is generally to obtain some robust center estimates and then to compute p-values based on some efficient tests to evaluate whether it is one outlier
Contributions
the paper suggests a p-value free and dimension insensitive detection procedure, named as Byzantine Machines Identification (ByMI)
Problem Formulation
assume $N$ independent samples ${s_i}_{i=1}^N$ are evenly stored in $m + 1$ machines $\cM_0, \cM_1,\ldots, \cM_m$, each of which contain $n$ observations and $N = (m+1)n$
- $s$ could either be a $p$-variate random vector $x$, or $(y, x)$
- $\cM_0$: the master machine that is in charge of integrating information from worker machines $\cM_1,\ldots,\cM_m$
- $\cB$: Byzantine machines set
- $\cG$: good/normal machines set
assume the normal data are i.i.d. drawn from $P_0$, but the data on Byzantine machine is not from $P_0$
it is the following multiple testing problem
\[H_{0j}: \cM_j \in \cG\quad \text{v.s.} \quad H_{1j}: \cM_j\in \cB,\quad j \in [m]\]Byzantine Machines Identification Procedure
consider a generic distributed risk minimization framework.
- $\ell(s, \theta)$: the loss function with parameter $\theta$
- $g(s, \theta)$: gradient function
each worker machine computes an empirical gradient
\[g_j(\theta_0) = \frac 1n \sum_{i\in \cM_j} g(s_i, \theta_0)\]denote $\mu^\star$ as the mean of $g_j(\theta_0)$ when $\cM_j\in \cG$ is one normal machine.
then the alternative hypotheses can be written as
\[H_{1j}: \bbE[g_j(\theta_0)] \neq \mu^\star\quad\text{for } j \in [m]\]- data splitting: $g_{1j}$ and $g_{2j}$
- obtain robust mean estimator for $\mu^\star$, denoted by $\hat g(\theta)$
- construct the ranking score which provides $\cM_j$ may be one Byzantine machine
- a large positive $W_j$ indicates that $\cM_j$ is likely to be the Byzantine machine
- fdr control