WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

XGBoost for IPF Biomarker

Posted on
Tags: XGBoost, IPF

This post is for Fanidis, D., Pezoulas, V. C., Fotiadis, D. Ι., & Aidinis, V. (2023). An explainable machine learning-driven proposal of pulmonary fibrosis biomarkers. Computational and Structural Biotechnology Journal, 21, 2305–2315.

use Shapley values to explain the decisions made by an ensemble learning model trained to classify samples to an either pulmonary fibrosis or steady state based on the expression values of deregulated genes

  • the process resulted in a full and a laconic set of features capable of separating phenotypes to an at least equal degree as previously published marker sets.
  • indicatively, a maximum increase of 6% in specifically and 5% in Mathew’s correlation coefficient was achieved.
  • evaluation with an additional independent dataset showed the feature set having a greater generalization potential than the rest
  • ultimately, the proposed gene lists are expected not only to serve as new sets of diagnostic marker elements, but also as a target pool for future research initiatives

Currently, as SARS-CoV-2 infection has been suggested to stimulate the expression of pro-fibrotic targets and interstitial lung disease patients present an increased risk of poor COVID-19 outcome, the proposal of a robust set of disease biomarkers and potential new targets has become more crucial.

  • Fibromine: a collection of manually curated and consistently processed IPF-related omics datasets
  1. ensemble learning on steady-state samples according to expression data from Fibromine
  2. Shapley (Shapley Additive exPlanations; SHAP) values were used to explain model decisions and rank/select 76 features as the most diagnostically valuable.
  3. text mining and functional characterization of the selected genes revealed both well-established and slightly researched in the IPF context features.
  4. different ranking aggregation methods were used to integrate results across multiple models and select a shorter, lite version of the 76 Shapley-prioritized features.
  5. compare with others by correlations, etc.


Semantics similarity prioritization

  • fetch consensus IPF_vs_Ctrl differentially expressed genes
  • top 100 up and top 100 down-regulated genes were isolated

Preprocessing of Fibromine gene expression data

common genes across all aforementioned datasets were isolated and then intersected with the 200 ones prioritized by semantics similarity.

A final list of 172 features was used as the training set of machine learning models

Assessment of sex-specific feature expression

samples hierarchical clustering using Euclidean distance and complete linkage was performed.


Machine learning models tuning, training and evaluation


  • XGBoost
  • trained/tested using a Monte Carlo cross validation (MCCV) approach with a 75:25 train:test split iterated ten times to account for differences during data splits
  • F1 and Matthews correlation coefficients

Comparison of biomarker lists

XGBoost was employed to compare the capability of different biomarker lists for phenotype separation.

Machine learning model explanation

Game theory inspired Shapley values: https://github.com/slundberg/shap

SHAP values reflect the contribution of each feature on the final prediction of the model after taking into consideration all possible feature combinations (coalitions) formed one feature at a time and then averaging the marginal contributions of each feature in a weighted manner

Pathway analysis


Text mining

PubMed 2022 baseline was used to form an abstract corpus.

Ranking aggregation methods

SHAP-weighted majority voting

two well established competitors:

  • MAIC

the results of the above three methods were compared by means of Kendall correlation

Correlation analysis


normalized gene expression values of selected machine learning prioritized features


XGBoost separates IPF from control individuals

SHAP values-based selection and characterization of ML used genes

why a different number of genomic features??

SHAP-based ranking aggregation of prioritized features

To obtain a unique robust set of important features sharing information across ten XGBoost models, integrate SHAP-ordered lists using three ranking aggregation methods.


Comparison with previous biomarker lists


Published in categories Note