WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Gibbs Sampler for Finding Motif

Posted on (Update: )
Tags: Gibbs

This post is the online version of my report for the Project 2 of STAT 5050 taught by Prof. Wei.

Posterior distribution

Now we have a total of $N$ DNA sequences, $\R=(R_1,\ldots,R_N)$ with common length $L$. Let $\bftheta_0=(\theta_{01},\theta_{02},\theta_{03},\theta_{04})$ be the probability vector describing the residue frequencies outside a motif, or say these residues follow a common multinomial model with parameter $\bftheta_0$, where the number 1, 2, 3, 4 represent four types of nucleotide, A, G, C, T for simplicity. Let $\bftheta_j,j=1,\ldots,W$ represent the frequency of each base at the $j$-th position of the motif, and $\bfTheta = [\bftheta_1,\ldots,\bftheta_W]$. By treating the alignment data $A$ as missing data, we can write the complete-data likelihood of the parameters as

where the notation is adopted from Liu et al. (1995). Specifically, ${\A}=\{(k, A_k+j-1):k=1,\ldots,N,j=1,\ldots,W\}$ represent the set of residue indices occupied by the motif elements with alignment variable $A$, in which $A_k$ is the starting position of the element of the $k$-th sequence. $\R_{{\A}} = \{r_{k,A_k+j-1}:k=1,\ldots,N;j=1,\ldots,W\}$ denote the collection of the residues, and $\A(j) = \{(k,A_k+j-1):k=1,\ldots,N\}$ is the set of the $j$-th positions of all elements. Also, $\h$ is the counting function, which counts how many of each type (A, G, C, T) in a DNA sequence. For general vectors $\v = (v_1,\ldots,v_p)^T$ and $\bftheta = (\theta_1,\ldots,\theta_p)^T$, we define $\bftheta^\v =\theta_1^{v_1}\cdots\theta_p^{v_p}$.

Collapsing $\bfTheta$

Let the prior of $\bftheta_0$, $f(\bftheta_0)$ be a Dirichlet distribution with parameter $\bfalpha = (\alpha_1,\ldots,\alpha_4)$, and let the prior of $\bfTheta$, $g(\bfTheta)$, be a product Dirichlet distribution with parameter $\bfbeta = (\bfbeta_1,\ldots,\bfbeta_W)$ and $\bfbeta_j=(\beta_{1j}, \ldots,\beta_{4j})^T$. Then by using the Bayes theorem and integrating out the $\bfTheta$,

Then we have

Let $\A_{[-k]}$ denote the set ${(l,a_l),l\neq k}$, then

where the equality is due to

and the fact that the components of $\h(r_{k,A_k+j-1})$ only take 0 or 1. Then

where

Shift Mode

Liu (1994) discussed the shift mode problem, which means that the Gibbs sampler will not enable a global change for all random components simultaneously. He proposed the following transition to encourage a global shifting.

For any integer $\delta$, let $\A+\delta = (A_1+\delta, \ldots,A_N+\delta)$. If one of the $A_k$ equals 1, we assume that $\A-1\equiv \A$; if one of the $A_k$ equals $L-W+1$, then $\A+1=\A$. These two cases can be treated as the termination of shifting. Suppose that the current state is $\A^{(k)}=\A$; first choose $\delta=+1$ or $-1$ with probability 1/2 each, and compute

where

and $n_{jA}(\R)$ is the number of counts of nucleotide type $A$ among the $(A_k+j-1)$-th base of sequence $R_k$ for all $k=1,\ldots,N$, then update $\A^{(k+1)}=\A^{(k)}+\delta$ with probability $\min{p, 1}$, otherwise set $\A^{(k+1)}=\A^{(k)}$.

Gibbs sampler with Shifting

The implementation can be found in szcf-weiya/Gibbs-Motif.

Estimation

$\widehat \A$

Use the samples’ posterior mode $\widehat\A$ as the estimator of $\A$, and get the following sequence logo by submitting the result to http://weblogo.berkeley.edu/logo.cgi.

$\widehat\bftheta_0$

Use the sample’s posterior mean $\widehat\bftheta_0$ as the estimator of $\bftheta_0$, and we have

$\widehat\bfTheta$

Base on the estimator $\widehat \A$, calculate the frequencies of each type of nucleotide on base $j, j=1,\ldots,W$, which can be the estimator of $\hat\bftheta_j$, and hence $\hat\bfTheta = [\hat\bftheta_1,\ldots,\hat\bftheta_W]$.

References


Published in categories Note