Consistent Probabilities along GO Structure

Posted on Nov 16, 2023

Tags: Shape-constrained, Smoothing Splines, Penalized Splines

This note is for Obozinski, Guillaume, Gert Lanckriet, Charles Grant, Michael I. Jordan, and William Stafford Noble. “Consistent Probabilistic Outputs for Protein Function Prediction.” Genome Biology 9 Suppl 1, no. Suppl 1 (2008): S6.

the paper focus on methods for calibrating and combining independent predictions to obtain a set of probabilistic predictions that are consistent with the topology of the ontology.

consider 11 distinct reconciliation methods

3 heuristic methods
4 variants of a Bayesian network
1 extension of logistic regression to the structured case
1 isotonic regression
2 variants of a KL projection method

evaluate each method in three different modes — per term, per protein and joint — interpretability

it is important to assess whether interpretability comes at a cost in terms of precision and recall.

many apparently reasonable reconciliation methods yield reconciled probabilities with significantly lower precision than the original, unreconciled estimates.
isotonic regression usually performs better than the underlying, unreconciled method, and almost never performs worse
an exception is the high precision regime for joint evaluation, where KL projection yields the best performance.

to measure the accuracy, consider

precision: measures the fraction of predictions made that are correct
recall: measures the fraction of the correct answers that are predicted

three prediction tasks and three corresponding modes of evaluation

per protein: e.g., a developmental biologist has determined a few genes that are activated by a particular regulator, and the biologist wants to understand which biological process is regulated and how it relates to the phenotype observed. Given a certain protein, a prediction for its function is needed.
per term: e.g., a drug designer has determined which biological process is involved in a pathological condition and is now looking for a drug target in that pathway. Given a function, a prediction for which proteins have that function is needed.
joint annotation: e.g., a bioinformatician is annotating a new genome and wants to guarantee a high level of accuracy in the predictions made. Given protein-function pairs, correct associations have to be predicted.

consider separately the three ontologies that comprise GO

biological process
cellular component
molecular function

subdivide predictions into four groups on the basis of the number of proteins assigned to a GO term

Reconciliation methods

generally, the problem of reconciliation arises for any structured prediction problem, in which a set of independent labels have to be predicted

all methods except the Bayesian networks and the cascaded logistic regression take as input the logistic regression estimates of the posterior probability for GO term $i$:

\[\check p_i = P(Y_i\mid X_i = x_i)\]

where

$Y_i$: a binary variable indicating whether the protein has the function corresponding to GO term $i$
$x_i$: contains the outputs of 35 SVMs trained for term $i$

the Bayesian networks explicitly model the likelihood $P(X_i=x_i\mid Y_i = y_i)$

the cascaded logistic regression uses individual regressions that estimate $P(Y_i\mid Y_{\pi_i}, X_i=x_i)$ or $P(Y_i\mid Y_{c_i}, X_i=x_i)$, where $\pi_i$ and $c_i$ are the set of parents or children of term $i$.

Calibration

for all edges $i\rightarrow j$ the implication $(Y_j=1)\rightarrow (Y_i=1)$ translates in probabilistic terms as $P(Y_i=0, Y_j=1)=0$.

The set of distributions satisfying the implication (or inclusion) constraints of the graph is then

\[\cP^\Rightarrow \{P\in \cP\vert \forall (i, j)\in E, P(Y_i=0, Y_j=1) = 0\},\]

which is a subset of

\[\cP^\le = \{P\in \cP \mid \forall (i, j)\in E, p_j\le p_i\}\]

the distributions that factorize according to $G$ can be defined formally as

\[\cP^G = \left\{P\in \cP\mid P(Y=y) = \prod_{i\in I} P(Y_i=y_i\mid Y_{\pi_i} = y_{\pi_i}); \forall i, P(Y_i=1\mid Y_{\pi_i}=0) = 0\right\}\]

an element of $\cP^G$ is completely characterized by its set of conditional distributions $q_i=P(Y_i=1\mid Y_{\pi_i}=1)$. The marginal probabilities can then be computed immediately from the conditionals

\[p_i = P(Y_i=1) = q_ip_{\pi_i}\qquad p_{\pi_i} = P(Y_{\pi_i}=1) = \prod_{j\in A_i} q_j\]

For distributions in $\cP^G$, the entropy has a simple analytical expression:

\[H(P) = H(Y) = \sum_{i\in I} H(Y_i\mid Y_{\pi_i}) = \sum_{i\in I} H(Y_i\mid y_{\pi_i}=1)P(Y_{\pi_i} = 1)=\sum_{i\in I} h(q_i) p_{\pi_i}\]

consider the graph $G^{-1}$ that inverts the parent-children relationships in $G$

Bayesian approach

Does protein $j$ have function $i$?

initially, a prior joint distributoon for the binary variable $Y_i$ is chosen
assume that given $Y_i=y_i$ some evidence $x_i$ is observed indepently for each GO term according to $p(x_i\mid Y_i=y_i) = L_i(y_i),y_i\in{0, 1}$.
using Bayes’ rule, calculate $P(Y\mid X=x)$ as the calibrated value.

it is natural to choose the prior $P_0$ in $\cP^G$

Variational Inference in $\cP^G$

if the inference involves finding the marginals, say $m$, of a distribution $Q\in \cP$, where the latter is some exponential family, then the unnormalized log-likelihood is a dot product $\langle \theta, \phi(y)\rangle$ such that $m = E\phi(Y)$

To give $\cP$ corresponds a set of possible marginals $\cM$. The variational inference problem can then be formulated as the optimization problem.

\[\max_{m\in \cM} H(m) + \langle m, \theta(x)\rangle\]

Cascaded logistic regression

weakness of Bayesian: require $p(x_i\mid Y_i=y_i)$ to be nearly correct, which is in general unlikely to be the case.

\[P(Y=y\mid X=x) = \prod_i P(Y_i = y_i\mid Y_{\pi_i} = y_{\pi_i}, X_i = x_i)\]

then consider the logistic regression for $P(Y_i\mid Y_{\pi_i}=1\mid X_i=x_i)$

weakness: some training set have very few negative examples

Projection Methods

Isotonic regression

instead of using the Euclidian distance, one can consider the KL divergence, which is more natural for probability distributions

Projections on $\cP^G$

for isotonic regression, even though we minimize a sum of KL divergences between pairs of marginals, we are not minimizing a KL-divergence between joint distributions.

Heuristic methods

Max: report the largest LR value of self and all descendants: $p_i=\max_{j\in D_i}\hat p_j$
And: report the product of LR values of all ancenstors and self: $p_i=\prod_{j\in A_i}\hat p_j$
Or: at least one of the descendant GO terms is “on”: $1-p_i=\prod_{j\in D_i}(1-\hat p_j)$

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.