# Conditional Quantile Regression Forests

##### Posted on

This note is based on the slides of the seminar, Dr. ZHU, Huichen. Conditional Quantile Random Forest.

## Motivation

### REactions to Acute Care and Hospitalization (REACH) study

- patients who suffer from acute coronary syndrome (ACS, 急性冠心症) are at high risk for many adverse outcomes, including recurrent cardiac (心脏的；心脏病的；心脏病患者) events, re-hospitalizations, major mental disorders, and mortality. There is an urgent need for post-discharge (出院后) chronic (长期的，慢性的) care for ACS patients.
- developing accurate predictive models for those adverse events are the first step in response to such need.
- The Columbia University Medical Center has established an on-going cohort since 2013, called the REactions to Acute Care and Hospitalization (REACH)
- Includes rich and comprehensive electronic medical records (EMRs) of 1744 ACS patients
- Provides a unique and ideal platform to develop high-dimensional statistical predictive analysis for ACS patients.

### Scientific/Clinical Objectives

- one of the adverse event is post-traumatic stress disorder (PTSD), which leads to a much higher risk of recurrent cardiovascular (心血管的) events and mortality.
- about 20% of the patients in REACH cohort achieved a higher score based on a PTSD checklist within one month after emergency room (ER) discharge
**goal**: to develop a statistical approach to enhance the prediction of PTSD at the time of discharge

### Traditional Approaches to Predict Continuous Outcome

- Linear Regression
- with a set of pre-specified key predictors
- with variable screening and selection to include a large number of potential predictors
- LASSO
- SCAD
- elastic net

- Machine learning approach
- random forest
- support vector machine
- linear discriminant analysis
- k-means clustering

### Quantile Regression

- Conditional quantile: $Q_\tau(Y\mid X)=\inf\{y:\Pr(Y\le y\mid X)\ge \alpha\}$
- Quantile regression (Koenker & Bassett Jr, 1978) is an extension of traditional linear regression
- Quantile regression allows us to study the impact of predictors on different quantiles of response distribution, and thus provides a complete picture of relationship between $Y$ and $X$.

### Combine the Domain Knowledge

- Theoretical mental health for ACS survivor indicates that PTSD due to a medical event is often driven by the fear of death and recurrence
- The impact of the fear also depends on individuals
- REACH study have assessed the fear level using multiple measures during emergency room (ER) visits
- Adopting such domain knowledge into the data-driven machine learning approach is potentially another direction to improve the prediction

*(??Is that fair for the previous compared methods do not combine such domain knowledge)*

### Existing Works

Only a few previous work involving quantile related splitting criterion or prediction in regression tree or random forest.

- Chaudhuri and Loh (2002) and Bhat et al. (2015) proposed to partition the sample using conditional quantile loss functions at a fixed quantile level
- Meinshausen (2006) proposed to use empirical functions for predictions, but individual tree is still based on CART

## Proposed Methods

### A Non-parametric Interactive Quantile Model for PTSD prediction

Assume a non-parametric interactive quantile model

\[Q_{y_i}(\tau, x_i, z_i) = \beta_0(\tau, x_i) + z_i^T\beta_1(\tau, x_i)\,,\]where

- $y_i$: individual post-ACS PTSD scores
- $z_i$: key predictors (level of fear during ER visit, PTSD score at baseline)
- $x_i$: a $p$-dimensional patient profile from EMR
- $\tau\in (0,1)$: the quantile level

### Splitting Criterion

- For any recursive partition approach, a splitting criterion is the most crucial part.
- Classical regression tree/forest choose the optimal split that leads to the greatest reduction in Residual Sum of Squares (RSS).
- It partitions the sample into subsets with distinctive means.

### Conditional Quantile-based Splitting Criterion I

#### Step 1

First determine a $K$ that $\tau_k = k/(K+1),k=1,2,\ldots,K$. Build a quantile regression model for each $\tau_k$ at node $N$ as

\[Q_y(\tau_k,z,x\in N) = z^T\beta_N(\tau_k, 0)\,,\]where $\beta_N(\tau_k, 0)$ is the quantile coefficient at quantile level $\tau_k$. *( ?? confused about the notation, why take 0 as the second argument, got the idea in the following context)*. Estimate each $\beta_N(\tau_k, 0)$ by

#### Step 2

Re-model the data in $N$ following the split $s$ as

\[Q_Y(\tau, z_i, s) = z_i^T\beta_1(\tau, s) + \delta_i(s)z_i^T\beta_2(\tau, s)\,,\]where $\beta_N(\tau, s) = (\beta_1(\tau, s)^T, \beta_2(\tau, s)^T)^T$ represents the quantile coefficients following the partition $s$, and

\[\delta_i(s) = \begin{cases} 1 & \text{$x_i$ belongs to the resulting left child node $N_L^s$}\\ 0 & \text{o.w.} \end{cases}\]Estimate each $\beta_N(\tau_k, s)$ by

\[\hat \beta_N(\tau_k, s) = \argmin_\beta L_N(s,\tau_k, \beta) = \argmin_{\beta_1,\beta_2} \frac{1}{\vert N\vert} \sum_i[\rho_{\tau_k}\left\{y_i - z_i^T\beta_1 - \delta_i(s)z_i^T\beta_2\right\}I\left\{x_i\in N\right\}]\,.\]So, it seems that $\beta_1+\beta_2$ for the left child node, while $\beta_1$ for the right child node.

#### Step 3

Propose the first splitting criterion, extended from the concept of RSS, in the form,

\[\Delta_N(s) = \sum_{k=1}^K\omega(\tau_k) \left\{ \frac{\hat L_N(0, \tau_k) - \hat L_N(s, \tau_k)}{\hat L_N(0, \tau_k)} \right\}\,,\]where $\hat L_N(s,\tau_k) = L_N(s, \tau_k, \hat\beta(\tau_k, s))$. The term $\omega(\tau_k)$ is a predefined weight function and reflects the relative importance across quantile levels.

#### Step 4

Choose the optimal split by maximizing $\Delta_N(s)$, set

\[s_N^1 = \argmax_{s} \Delta_N(s)\,.\]### Splitting Criterion II

Let

\[\hat \beta_N(\tau_k, 0) = \argmin_\beta L_N(0, \tau_k, \beta)\,.\]Denote that

\[\hat R_i = \sum_{k=1}^K I\left\{ y_i-z_i^T\hat \beta_N(\tau_k, 0) \le 0\right\}\,,\]which ranges between 1 and $K$ and identifies the rank of $y_i$ with respect to the estimated conditional quantile process *(?? due to $k$?)* $z_i^T\hat\beta(\tau_k, 0)$.

A contingency table can be constructed based on $\hat R_i$ for each split $s$.

### Comparing the Splitting Criteria

confused about the above two figures.

### Algorithm for Conditional Quantile Random Forest

Let $(y_i, x_i, z_i), i=1,\ldots, n_t$ be a training data.

#### Step 1: Sub-sampling the training data

randomly draw a sub-sample ($m, m\le n_t$ out of $n_t$ samples without replacement) from the training data *(not a bootstrap sample!？)*

#### Step 2: Grow a tree

based on the sub-sample, at each split

- randomly draw $p^*\le p$ out of $p$ predictors $x$ as the potential splitting variables ($p^*$ is pre-specified number)
- select the optimal splitting variable using the rank score test statistics
- select an optimal cut-off value of the splitting variable using the proposed criterion

#### Step 3: Assemble a random forest

## The Applications of CQRF

### Applications

- Prediction: a new patient with covariates profile $(x^*, z^*)$
- first estimate the individualized conditional quantile effects $\beta(\tau, x^*)$ based on the constructed CQRF
- construct the estimated conditional quantile process ${z^*}^T\hat\beta(\tau, x^*)$
- based on the constructed conditional quantile function, we could obtain: prediction mean, prediction median, prediction interval and predicted risk assessment

- Feature selection: evaluate and rank the importance of the splitting variables
- Precision medicine: assessing individualized treatment effect

### Estimating Algorithm of $\beta(\tau, x^*)$

#### Step 1

drop the vector $x^*$ into each of the tree in the forest $\cT$, and let $N_b(x^*)$ be the terminal node in the tree $T_b$ where $x^*$ lands in

#### Step 2

for each observation in the training sample, define a tree-based weight

\[\omega_i(x^*, b) = \frac{I(x_i\in N_b(x^*))}{\vert N_b(x^*)\vert}\]and aggregate $\omega_i(x^*, b)$ over the entire forest by

\[\omega_i(x^*) = \frac 1B \sum_{b=1}^B \omega_i(x^*, b)\,,\]where $\omega_i(x^*)$ measures how important the $i$-th observation is in estimating $\beta(\tau, x^*)$.

#### Step 3

construct weighted quantile regression objective function by

\[L_{\cT, \tau_t} (\beta, x^*) = \sum_{i=1}^n \omega_i(x^*)\rho_{\tau_t}(y_i-z_i^T\beta), t=1,\ldots, t_n\,,\]and estimate $\beta(\tau_t, x^*)$ on a sequence of quantile levels $\tau_t = t/(t_n+1)$ by

\[\hat \beta_{\cT, \tau_t}(x^*) = \argmin_\beta L_{\cT, \tau_t}(\beta, x^*)\,.\]#### Step 4

construct the coefficient function $\hat \beta_\cT(\tau, x^*)$ by a nature linear spline over $\hat\beta_{\cT,\tau_t}(x^*)$.

\[\hat\beta_\cT(\tau, x^*) = \begin{cases} \hat \beta_{\cT, \tau_1} & \tau < \tau_1\\ \hat \beta_{\cT, \tau_{t_n}} & \tau > \tau_{t_n}\\ \hat \beta_{\cT, \lfloor\tau t_n\rfloor} + \frac{\hat \beta_{\cT, \lfloor \tau t_n\rfloor + 1}(x^*) - \hat \beta_{\cT,\lfloor \tau t_n\rfloor}(x^*)}{ 1/t_n} \left(\tau - \frac{\lfloor \tau t_n\rfloor}{t_n+1}\right) & \text{else} \end{cases}\]### Theoretical Properties

Under certain conditions, for fixed $x^*$, \(\sup_{\tau \in[1/(t_n+1), t_n/(t_n+1)]} \Vert \hat\beta_{\cT}(\tau, x^*)-\beta(\tau, x^*)\Vert = o_p(1)\) as $n\rightarrow \infty$.

### Conditional Quantile Based Prediction

Conditional quantile process of $y$ given $(x^*, z^*)$:

\[\hat Q_y(\tau, x^*, z^*) = (z^*)^T\hat \beta_{\cT}(\tau, x^*), \tau\in (0, 1)\]- Prediction mean: $E_\cT(Y\mid x^*, z^*) = \int_0^1 {z^*}^T\hat\beta_\cT(u, x^*)du$

Note that $EX=\int xp(x)dx = \int xdF(x) = \int Q_\alpha dF(Q_\alpha) = \int Q_\alpha d\alpha$.

- Prediction median: $\hat Q_y(0.5, x^*, z^*)$
- The $100(1-\alpha)\%$ prediction interval

### Conditional Quantile Based Risk Assessment

### REACH Result

Compare prediction accuracy among the following approaches

- Marginal Quantile RF: treating both $(z, x)$ as splitting variables
- Linear Regression with LASSO: using linear regression, regress one-month PTSD against $(x, z)$, and use a LASSO penalty to select the predictors.
- Random Forest: build the classical mean based random forest using both $x$ and $z$ as splitting variables. The prediction interval is constructed based on predicted mean and standard deviation
- Quantile Regression Forest: The prediction interval is based on the empirical distribution.

### Conclusion for CQRF

a robust and efficient approach for improving the screening and intervention strategies.

- it complements the mean-based approaches and fully takes the population heterogeneity into account.
- the use of conditional quantile regression at each split also provides a convenient way to incorporate domain knowledge to improve prediction accuracy
*(why bother use $\beta$, directly input $x, z$ into the forest can also combine the domain knowledge)* - incorporate with treatment assignment allow to directly estimate individualized treatment effect for precision care
*(?? not say related materials)*

## Ongoing work and future extensions

- explore its applications in gene-environment interactions, and develop a new splitting criterion for large-scale but highly correlated genetics data
- the approach could be extended to a longitudinal outcome, where PTSD score might be measured at different time points.
*(any details?!)* - the approach could be extended to a survival outcome
*(any details)*

## Q & A

Here are the questions and the speaker’s answers, only keywords are recorded.

- Q: How to deal with missing data?
- A: For missed response, directly delete. For covariates, surrogate.
- Q: What if misspecified model in the form?
- A: The non-parametric actually is quite flexible.
- Q: Theories about variable selection, such as some accuracy bounds?
- A: No.
- Q: Compared with other machine methods?
- A: Not enough.