XGBoost for IPF Biomarker
Posted on
use Shapley values to explain the decisions made by an ensemble learning model trained to classify samples to an either pulmonary fibrosis or steady state based on the expression values of deregulated genes
- the process resulted in a full and a laconic set of features capable of separating phenotypes to an at least equal degree as previously published marker sets.
- indicatively, a maximum increase of 6% in specifically and 5% in Mathew’s correlation coefficient was achieved.
- evaluation with an additional independent dataset showed the feature set having a greater generalization potential than the rest
- ultimately, the proposed gene lists are expected not only to serve as new sets of diagnostic marker elements, but also as a target pool for future research initiatives
Currently, as SARS-CoV-2 infection has been suggested to stimulate the expression of pro-fibrotic targets and interstitial lung disease patients present an increased risk of poor COVID-19 outcome, the proposal of a robust set of disease biomarkers and potential new targets has become more crucial.
- Fibromine: a collection of manually curated and consistently processed IPF-related omics datasets
- ensemble learning on steady-state samples according to expression data from Fibromine
- Shapley (Shapley Additive exPlanations; SHAP) values were used to explain model decisions and rank/select 76 features as the most diagnostically valuable.
- text mining and functional characterization of the selected genes revealed both well-established and slightly researched in the IPF context features.
- different ranking aggregation methods were used to integrate results across multiple models and select a shorter, lite version of the 76 Shapley-prioritized features.
- compare with others by correlations, etc.
Methods
Semantics similarity prioritization
- fetch consensus IPF_vs_Ctrl differentially expressed genes
- top 100 up and top 100 down-regulated genes were isolated
Preprocessing of Fibromine gene expression data
common genes across all aforementioned datasets were isolated and then intersected with the 200 ones prioritized by semantics similarity.
A final list of 172 features was used as the training set of machine learning models
Assessment of sex-specific feature expression
samples hierarchical clustering using Euclidean distance and complete linkage was performed.
Machine learning models tuning, training and evaluation
- XGBoost
- trained/tested using a Monte Carlo cross validation (MCCV) approach with a 75:25 train:test split iterated ten times to account for differences during data splits
- F1 and Matthews correlation coefficients
Comparison of biomarker lists
XGBoost was employed to compare the capability of different biomarker lists for phenotype separation.
Machine learning model explanation
Game theory inspired Shapley values: https://github.com/slundberg/shap
SHAP values reflect the contribution of each feature on the final prediction of the model after taking into consideration all possible feature combinations (coalitions) formed one feature at a time and then averaging the marginal contributions of each feature in a weighted manner
Pathway analysis
clusterProfiler
Text mining
PubMed 2022 baseline was used to form an abstract corpus.
Ranking aggregation methods
SHAP-weighted majority voting
two well established competitors:
- MAIC
- BIRRA
the results of the above three methods were compared by means of Kendall correlation
Correlation analysis
?
normalized gene expression values of selected machine learning prioritized features
Results
XGBoost separates IPF from control individuals
SHAP values-based selection and characterization of ML used genes
why a different number of genomic features??
SHAP-based ranking aggregation of prioritized features
To obtain a unique robust set of important features sharing information across ten XGBoost models, integrate SHAP-ordered lists using three ranking aggregation methods.