# Surrogate Splits in Classification and Regression Trees

##### Posted on Jan 08, 2020 (Update: Jan 10, 2022)

This note is for Section 5.3 of Breiman, L. (Ed.). (1998). Classification and regression trees (1. CRC Press repr). Chapman & Hall/CRC.

At any given node $t$, let $\Delta t$ be the best split of the node into $t_L$ and $t_R$. In the basic tree structure, $\Delta^*$ is the best univariate split. In general, $\Delta^*$ can be the best linear or Boolean combination split.

Take any variable $x_m$. Let

• $S_m$: the set of all splits on $x_m$
• $\bar S_m$: the set of splits complementary to $S_m$

For any split $\Delta_m\in S_m\cup \bar S_m$ of the node $t$ into $t_L’$ and $t_R’$, let

• $N_j(LL)$: the number of cases in $t$ that both $\Delta^*$ and $\Delta_m$ send left, i.e., $t_L\cap t_L’$

Estimate the probability that a case falls into $t_L\cap t_L’$ as

$p(t_L\cap t_L') = \sum_j \pi(j) N_j(LL) / N_j\,.$

Then define the estimated probability $p_{LL}(\Delta^*, \Delta_m)$ that both $\Delta^*$ and $\Delta_m$ send a case in $t$ left as

$p_{LL}(\Delta^*, \Delta_m) = p(t_L\cap t_L')/p(t)\,.$

why $p(t)$ in the denominator?

Suppose we tried to predict the results of the split $\Delta^*$ knowing only the results of the split $\Delta_m$. If $\Delta_m$ sends a case to $t_L’$, predict that $\Delta^*$ sends it to $t_L$, and for cases sent to $t_R’$, predict that they are sent to $t_R$. Then the probability that $\Delta_m$ predicts $\Delta^*$ correctly is estimated by

$p(\Delta^*, \Delta_m) = p_{LL}(\Delta^*, \Delta_m) + p_{RR}(\Delta^*, \Delta_m)\,.$

A split $\tilde \Delta_m\in S_m \cap \bar S_m$ is called a surrogate split on $x_m$ for $\Delta^*$ if $$p(\Delta^*, \tilde \Delta_m) = \max_{\Delta_m}p(\Delta^\*, \Delta_m)$$ where the maximum is over $S_m\cap \bar S_m$.

Usually, the surrogate split $\tilde \Delta_m$ is unique and can be interpreted as the split on $x_m$ that most accurately predicts the action of $\Delta^*$.

Suppose that $\Delta^*$ sends the cases in $t$ left with relative probability $p_L$ and right with relative probability $p_R$, and we have $p_L+p_R=1$. If a new case falls into $t$, then a natural way to predict whether it will send it right or left is the rule:

Predict $t_L$ if $p_L=\max(p_L, p_R)$; otherwise $t_R$.

Given that a case falls into $t$, this rule has error probability $\min(p_L,p_R)$. In consequence, to get a measure of how good a predictor $\tilde \Delta_m$ is, its error probability should be compared with $\min(p_L,p_R)$.

Define the predictive measure of association $\lambda(\Delta^*\mid \tilde \Delta_m)$ between $\Delta^*$ and $\tilde \Delta_m$ as $$\lambda(\Delta^*\mid \tilde\Delta_m) = \frac{\min(p_L, p_R)-(1-p(\Delta^*, \tilde\Delta_m))}{\min(p_L, p_R)}\,.$$

The surrogate splits and their measures of predictive association have three major uses:

• handling missing data
• variable importance ranking
If there are missing values, the best split $\Delta_m^*$ on $x_m$ is computed using all cases containing a value of $x_m$ and then $\Delta^*$ selected as that split $\Delta_m^*$ which maximizes $\Delta I$. If a case has missing values so that $\Delta^*$ is not defined for that case, proceed as follows. Among all nonmissing variables in the case, find that one, say, $x_m$, with $\tilde \Delta_m$ having the highest measure of predictive association with $\Delta^*$. Then split the case using $\tilde \Delta_m$.