Surrogate Splits in Classification and Regression Trees
Posted on (Update: )
This note is for Section 5.3 of Breiman, L. (Ed.). (1998). Classification and regression trees (1. CRC Press repr). Chapman & Hall/CRC.
At any given node $t$, let $\Delta t$ be the best split of the node into $t_L$ and $t_R$. In the basic tree structure, $\Delta^*$ is the best univariate split. In general, $\Delta^*$ can be the best linear or Boolean combination split.
Take any variable $x_m$. Let
- $S_m$: the set of all splits on $x_m$
- $\bar S_m$: the set of splits complementary to $S_m$
For any split $\Delta_m\in S_m\cup \bar S_m$ of the node $t$ into $t_L’$ and $t_R’$, let
- $N_j(LL)$: the number of cases in $t$ that both $\Delta^*$ and $\Delta_m$ send left, i.e., $t_L\cap t_L’$
Estimate the probability that a case falls into $t_L\cap t_L’$ as
\[p(t_L\cap t_L') = \sum_j \pi(j) N_j(LL) / N_j\,.\]Then define the estimated probability $p_{LL}(\Delta^*, \Delta_m)$ that both $\Delta^*$ and $\Delta_m$ send a case in $t$ left as
\[p_{LL}(\Delta^*, \Delta_m) = p(t_L\cap t_L')/p(t)\,.\]why $p(t)$ in the denominator?
Suppose we tried to predict the results of the split $\Delta^*$ knowing only the results of the split $\Delta_m$. If $\Delta_m$ sends a case to $t_L’$, predict that $\Delta^*$ sends it to $t_L$, and for cases sent to $t_R’$, predict that they are sent to $t_R$. Then the probability that $\Delta_m$ predicts $\Delta^*$ correctly is estimated by
\[p(\Delta^*, \Delta_m) = p_{LL}(\Delta^*, \Delta_m) + p_{RR}(\Delta^*, \Delta_m)\,.\]A split $\tilde \Delta_m\in S_m \cap \bar S_m$ is called a surrogate split on $x_m$ for $\Delta^*$ if \(p(\Delta^*, \tilde \Delta_m) = \max_{\Delta_m}p(\Delta^\*, \Delta_m)\) where the maximum is over $S_m\cap \bar S_m$.
Usually, the surrogate split $\tilde \Delta_m$ is unique and can be interpreted as the split on $x_m$ that most accurately predicts the action of $\Delta^*$.
Suppose that $\Delta^*$ sends the cases in $t$ left with relative probability $p_L$ and right with relative probability $p_R$, and we have $p_L+p_R=1$. If a new case falls into $t$, then a natural way to predict whether it will send it right or left is the rule:
Predict $t_L$ if $p_L=\max(p_L, p_R)$; otherwise $t_R$.
Given that a case falls into $t$, this rule has error probability $\min(p_L,p_R)$. In consequence, to get a measure of how good a predictor $\tilde \Delta_m$ is, its error probability should be compared with $\min(p_L,p_R)$.
Define the predictive measure of association $\lambda(\Delta^*\mid \tilde \Delta_m)$ between $\Delta^*$ and $\tilde \Delta_m$ as \(\lambda(\Delta^*\mid \tilde\Delta_m) = \frac{\min(p_L, p_R)-(1-p(\Delta^*, \tilde\Delta_m))}{\min(p_L, p_R)}\,.\)
The surrogate splits and their measures of predictive association have three major uses:
- handling missing data
- variable importance ranking
- detection of masking
If there are missing values, the best split $\Delta_m^*$ on $x_m$ is computed using all cases containing a value of $x_m$ and then $\Delta^*$ selected as that split $\Delta_m^*$ which maximizes $\Delta I$. If a case has missing values so that $\Delta^*$ is not defined for that case, proceed as follows. Among all nonmissing variables in the case, find that one, say, $x_m$, with $\tilde \Delta_m$ having the highest measure of predictive association with $\Delta^*$. Then split the case using $\tilde \Delta_m$.