WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Surrogate Splits in Classification and Regression Trees

Posted on (Update: )
Tags: Classification Trees, Regression Trees

This note is for Section 5.3 of Breiman, L. (Ed.). (1998). Classification and regression trees (1. CRC Press repr). Chapman & Hall/CRC.

At any given node $t$, let $\Delta t$ be the best split of the node into $t_L$ and $t_R$. In the basic tree structure, $\Delta^*$ is the best univariate split. In general, $\Delta^*$ can be the best linear or Boolean combination split.

Take any variable $x_m$. Let

  • $S_m$: the set of all splits on $x_m$
  • $\bar S_m$: the set of splits complementary to $S_m$

For any split $\Delta_m\in S_m\cup \bar S_m$ of the node $t$ into $t_L’$ and $t_R’$, let

  • $N_j(LL)$: the number of cases in $t$ that both $\Delta^*$ and $\Delta_m$ send left, i.e., $t_L\cap t_L’$

Estimate the probability that a case falls into $t_L\cap t_L’$ as

\[p(t_L\cap t_L') = \sum_j \pi(j) N_j(LL) / N_j\,.\]

Then define the estimated probability $p_{LL}(\Delta^*, \Delta_m)$ that both $\Delta^*$ and $\Delta_m$ send a case in $t$ left as

\[p_{LL}(\Delta^*, \Delta_m) = p(t_L\cap t_L')/p(t)\,.\]

why $p(t)$ in the denominator?

Suppose we tried to predict the results of the split $\Delta^*$ knowing only the results of the split $\Delta_m$. If $\Delta_m$ sends a case to $t_L’$, predict that $\Delta^*$ sends it to $t_L$, and for cases sent to $t_R’$, predict that they are sent to $t_R$. Then the probability that $\Delta_m$ predicts $\Delta^*$ correctly is estimated by

\[p(\Delta^*, \Delta_m) = p_{LL}(\Delta^*, \Delta_m) + p_{RR}(\Delta^*, \Delta_m)\,.\]

A split $\tilde \Delta_m\in S_m \cap \bar S_m$ is called a surrogate split on $x_m$ for $\Delta^*$ if \(p(\Delta^*, \tilde \Delta_m) = \max_{\Delta_m}p(\Delta^\*, \Delta_m)\) where the maximum is over $S_m\cap \bar S_m$.

Usually, the surrogate split $\tilde \Delta_m$ is unique and can be interpreted as the split on $x_m$ that most accurately predicts the action of $\Delta^*$.

Suppose that $\Delta^*$ sends the cases in $t$ left with relative probability $p_L$ and right with relative probability $p_R$, and we have $p_L+p_R=1$. If a new case falls into $t$, then a natural way to predict whether it will send it right or left is the rule:

Predict $t_L$ if $p_L=\max(p_L, p_R)$; otherwise $t_R$.

Given that a case falls into $t$, this rule has error probability $\min(p_L,p_R)$. In consequence, to get a measure of how good a predictor $\tilde \Delta_m$ is, its error probability should be compared with $\min(p_L,p_R)$.

Define the predictive measure of association $\lambda(\Delta^*\mid \tilde \Delta_m)$ between $\Delta^*$ and $\tilde \Delta_m$ as \(\lambda(\Delta^*\mid \tilde\Delta_m) = \frac{\min(p_L, p_R)-(1-p(\Delta^*, \tilde\Delta_m))}{\min(p_L, p_R)}\,.\)

The surrogate splits and their measures of predictive association have three major uses:

  • handling missing data
  • variable importance ranking
  • detection of masking

If there are missing values, the best split $\Delta_m^*$ on $x_m$ is computed using all cases containing a value of $x_m$ and then $\Delta^*$ selected as that split $\Delta_m^*$ which maximizes $\Delta I$. If a case has missing values so that $\Delta^*$ is not defined for that case, proceed as follows. Among all nonmissing variables in the case, find that one, say, $x_m$, with $\tilde \Delta_m$ having the highest measure of predictive association with $\Delta^*$. Then split the case using $\tilde \Delta_m$.


Published in categories Note