# Conditional Quantile Regression Forests

##### Posted on Dec 12, 2019

This note is based on the slides of the seminar, Dr. ZHU, Huichen. Conditional Quantile Random Forest.

## Motivation

### REactions to Acute Care and Hospitalization (REACH) study

• patients who suffer from acute coronary syndrome (ACS, 急性冠心症) are at high risk for many adverse outcomes, including recurrent cardiac (心脏的；心脏病的；心脏病患者) events, re-hospitalizations, major mental disorders, and mortality. There is an urgent need for post-discharge (出院后) chronic (长期的，慢性的) care for ACS patients.
• developing accurate predictive models for those adverse events are the first step in response to such need.
• The Columbia University Medical Center has established an on-going cohort since 2013, called the REactions to Acute Care and Hospitalization (REACH)
• Includes rich and comprehensive electronic medical records (EMRs) of 1744 ACS patients
• Provides a unique and ideal platform to develop high-dimensional statistical predictive analysis for ACS patients.

### Scientific/Clinical Objectives

• one of the adverse event is post-traumatic stress disorder (PTSD), which leads to a much higher risk of recurrent cardiovascular (心血管的) events and mortality.
• about 20% of the patients in REACH cohort achieved a higher score based on a PTSD checklist within one month after emergency room (ER) discharge
• goal: to develop a statistical approach to enhance the prediction of PTSD at the time of discharge

### Traditional Approaches to Predict Continuous Outcome

1. Linear Regression
• with a set of pre-specified key predictors
• with variable screening and selection to include a large number of potential predictors
• LASSO
• elastic net
2. Machine learning approach
• random forest
• support vector machine
• linear discriminant analysis
• k-means clustering

### Quantile Regression

• Conditional quantile: $Q_\tau(Y\mid X)=\inf\{y:\Pr(Y\le y\mid X)\ge \alpha\}$
• Quantile regression (Koenker & Bassett Jr, 1978) is an extension of traditional linear regression
• Quantile regression allows us to study the impact of predictors on different quantiles of response distribution, and thus provides a complete picture of relationship between $Y$ and $X$.

### Combine the Domain Knowledge

• Theoretical mental health for ACS survivor indicates that PTSD due to a medical event is often driven by the fear of death and recurrence
• The impact of the fear also depends on individuals
• REACH study have assessed the fear level using multiple measures during emergency room (ER) visits
• Adopting such domain knowledge into the data-driven machine learning approach is potentially another direction to improve the prediction

(??Is that fair for the previous compared methods do not combine such domain knowledge)

### Existing Works

Only a few previous work involving quantile related splitting criterion or prediction in regression tree or random forest.

• Chaudhuri and Loh (2002) and Bhat et al. (2015) proposed to partition the sample using conditional quantile loss functions at a fixed quantile level
• Meinshausen (2006) proposed to use empirical functions for predictions, but individual tree is still based on CART

## Proposed Methods

### A Non-parametric Interactive Quantile Model for PTSD prediction

Assume a non-parametric interactive quantile model

where

• $y_i$: individual post-ACS PTSD scores
• $z_i$: key predictors (level of fear during ER visit, PTSD score at baseline)
• $x_i$: a $p$-dimensional patient profile from EMR
• $\tau\in (0,1)$: the quantile level

### Splitting Criterion

• For any recursive partition approach, a splitting criterion is the most crucial part.
• Classical regression tree/forest choose the optimal split that leads to the greatest reduction in Residual Sum of Squares (RSS).
• It partitions the sample into subsets with distinctive means.

### Conditional Quantile-based Splitting Criterion I

#### Step 1

First determine a $K$ that $\tau_k = k/(K+1),k=1,2,\ldots,K$. Build a quantile regression model for each $\tau_k$ at node $N$ as

where $\beta_N(\tau_k, 0)$ is the quantile coefficient at quantile level $\tau_k$. (?? confused about the notation, why take 0 as the second argument, got the idea in the following context). Estimate each $\beta_N(\tau_k, 0)$ by

#### Step 2

Re-model the data in $N$ following the split $s$ as

where $\beta_N(\tau, s) = (\beta_1(\tau, s)^T, \beta_2(\tau, s)^T)^T$ represents the quantile coefficients following the partition $s$, and

Estimate each $\beta_N(\tau_k, s)$ by

So, it seems that $\beta_1+\beta_2$ for the left child node, while $\beta_1$ for the right child node.

#### Step 3

Propose the first splitting criterion, extended from the concept of RSS, in the form,

where $\hat L_N(s,\tau_k) = L_N(s, \tau_k, \hat\beta(\tau_k, s))$. The term $\omega(\tau_k)$ is a predefined weight function and reflects the relative importance across quantile levels.

#### Step 4

Choose the optimal split by maximizing $\Delta_N(s)$, set

### Splitting Criterion II

Let

Denote that

which ranges between 1 and $K$ and identifies the rank of $y_i$ with respect to the estimated conditional quantile process (?? due to $k$?) $z_i^T\hat\beta(\tau_k, 0)$.

A contingency table can be constructed based on $\hat R_i$ for each split $s$.

### Comparing the Splitting Criteria

confused about the above two figures.

### Algorithm for Conditional Quantile Random Forest

Let $(y_i, x_i, z_i), i=1,\ldots, n_t$ be a training data.

#### Step 1: Sub-sampling the training data

randomly draw a sub-sample ($m, m\le n_t$ out of $n_t$ samples without replacement) from the training data (not a bootstrap sample!？)

#### Step 2: Grow a tree

based on the sub-sample, at each split

1. randomly draw $p^*\le p$ out of $p$ predictors $x$ as the potential splitting variables ($p^*$ is pre-specified number)
2. select the optimal splitting variable using the rank score test statistics
3. select an optimal cut-off value of the splitting variable using the proposed criterion

## The Applications of CQRF

### Applications

• Prediction: a new patient with covariates profile $(x^*, z^*)$
• first estimate the individualized conditional quantile effects $\beta(\tau, x^*)$ based on the constructed CQRF
• construct the estimated conditional quantile process ${z^*}^T\hat\beta(\tau, x^*)$
• based on the constructed conditional quantile function, we could obtain: prediction mean, prediction median, prediction interval and predicted risk assessment
• Feature selection: evaluate and rank the importance of the splitting variables
• Precision medicine: assessing individualized treatment effect

### Estimating Algorithm of $\beta(\tau, x^*)$

#### Step 1

drop the vector $x^*$ into each of the tree in the forest $\cT$, and let $N_b(x^*)$ be the terminal node in the tree $T_b$ where $x^*$ lands in

#### Step 2

for each observation in the training sample, define a tree-based weight

and aggregate $\omega_i(x^*, b)$ over the entire forest by

where $\omega_i(x^*)$ measures how important the $i$-th observation is in estimating $\beta(\tau, x^*)$.

#### Step 3

construct weighted quantile regression objective function by

and estimate $\beta(\tau_t, x^*)$ on a sequence of quantile levels $\tau_t = t/(t_n+1)$ by

#### Step 4

construct the coefficient function $\hat \beta_\cT(\tau, x^*)$ by a nature linear spline over $\hat\beta_{\cT,\tau_t}(x^*)$.

### Theoretical Properties

Under certain conditions, for fixed $x^*$, $\sup_{\tau \in[1/(t_n+1), t_n/(t_n+1)]} \Vert \hat\beta_{\cT}(\tau, x^*)-\beta(\tau, x^*)\Vert = o_p(1)$ as $n\rightarrow \infty$.

### Conditional Quantile Based Prediction

Conditional quantile process of $y$ given $(x^*, z^*)$:

• Prediction mean: $E_\cT(Y\mid x^*, z^*) = \int_0^1 {z^*}^T\hat\beta_\cT(u, x^*)du$

Note that $EX=\int xp(x)dx = \int xdF(x) = \int Q_\alpha dF(Q_\alpha) = \int Q_\alpha d\alpha$.

• Prediction median: $\hat Q_y(0.5, x^*, z^*)$
• The $100(1-\alpha)\%$ prediction interval

### REACH Result

Compare prediction accuracy among the following approaches

• Marginal Quantile RF: treating both $(z, x)$ as splitting variables
• Linear Regression with LASSO: using linear regression, regress one-month PTSD against $(x, z)$, and use a LASSO penalty to select the predictors.
• Random Forest: build the classical mean based random forest using both $x$ and $z$ as splitting variables. The prediction interval is constructed based on predicted mean and standard deviation
• Quantile Regression Forest: The prediction interval is based on the empirical distribution.

### Conclusion for CQRF

a robust and efficient approach for improving the screening and intervention strategies.

• it complements the mean-based approaches and fully takes the population heterogeneity into account.
• the use of conditional quantile regression at each split also provides a convenient way to incorporate domain knowledge to improve prediction accuracy (why bother use $\beta$, directly input $x, z$ into the forest can also combine the domain knowledge)
• incorporate with treatment assignment allow to directly estimate individualized treatment effect for precision care (?? not say related materials)

## Ongoing work and future extensions

• explore its applications in gene-environment interactions, and develop a new splitting criterion for large-scale but highly correlated genetics data
• the approach could be extended to a longitudinal outcome, where PTSD score might be measured at different time points. (any details?!)
• the approach could be extended to a survival outcome (any details)

## Q & A

Here are the questions and the speaker’s answers, only keywords are recorded.

• Q: How to deal with missing data?
• A: For missed response, directly delete. For covariates, surrogate.
• Q: What if misspecified model in the form?
• A: The non-parametric actually is quite flexible.
• Q: Theories about variable selection, such as some accuracy bounds?
• A: No.
• Q: Compared with other machine methods?
• A: Not enough.

Published in categories Note