# Gibbs Sampler for Finding Motif

##### Posted on Dec 10, 2018 (Update: Dec 25, 2019)
Tags: Gibbs

This post is the online version of my report for the Project 2 of STAT 5050 taught by Prof. Wei.

## Posterior distribution

Now we have a total of $N$ DNA sequences, $\R=(R_1,\ldots,R_N)$ with common length $L$. Let $\bftheta_0=(\theta_{01},\theta_{02},\theta_{03},\theta_{04})$ be the probability vector describing the residue frequencies outside a motif, or say these residues follow a common multinomial model with parameter $\bftheta_0$, where the number 1, 2, 3, 4 represent four types of nucleotide, A, G, C, T for simplicity. Let $\bftheta_j,j=1,\ldots,W$ represent the frequency of each base at the $j$-th position of the motif, and $\bfTheta = [\bftheta_1,\ldots,\bftheta_W]$. By treating the alignment data $A$ as missing data, we can write the complete-data likelihood of the parameters as

where the notation is adopted from Liu et al. (1995). Specifically, ${\A}=\{(k, A_k+j-1):k=1,\ldots,N,j=1,\ldots,W\}$ represent the set of residue indices occupied by the motif elements with alignment variable $A$, in which $A_k$ is the starting position of the element of the $k$-th sequence. $\R_{{\A}} = \{r_{k,A_k+j-1}:k=1,\ldots,N;j=1,\ldots,W\}$ denote the collection of the residues, and $\A(j) = \{(k,A_k+j-1):k=1,\ldots,N\}$ is the set of the $j$-th positions of all elements. Also, $\h$ is the counting function, which counts how many of each type (A, G, C, T) in a DNA sequence. For general vectors $\v = (v_1,\ldots,v_p)^T$ and $\bftheta = (\theta_1,\ldots,\theta_p)^T$, we define $\bftheta^\v =\theta_1^{v_1}\cdots\theta_p^{v_p}$.

## Collapsing $\bfTheta$

Let the prior of $\bftheta_0$, $f(\bftheta_0)$ be a Dirichlet distribution with parameter $\bfalpha = (\alpha_1,\ldots,\alpha_4)$, and let the prior of $\bfTheta$, $g(\bfTheta)$, be a product Dirichlet distribution with parameter $\bfbeta = (\bfbeta_1,\ldots,\bfbeta_W)$ and $\bfbeta_j=(\beta_{1j}, \ldots,\beta_{4j})^T$. Then by using the Bayes theorem and integrating out the $\bfTheta$,

Then we have $\bftheta_0 \mid \A,\R \sim \mathrm{Dir}(\h(\R_{\{\A\}^c})+\bfalpha)\,.$

Let $\A_{[-k]}$ denote the set ${(l,a_l),l\neq k}$, then

where the equality is due to

and the fact that the components of $\h(r_{k,A_k+j-1})$ only take 0 or 1. Then

where

## Shift Mode

Liu (1994) discussed the shift mode problem, which means that the Gibbs sampler will not enable a global change for all random components simultaneously. He proposed the following transition to encourage a global shifting.

For any integer $\delta$, let $\A+\delta = (A_1+\delta, \ldots,A_N+\delta)$. If one of the $A_k$ equals 1, we assume that $\A-1\equiv \A$; if one of the $A_k$ equals $L-W+1$, then $\A+1=\A$. These two cases can be treated as the termination of shifting. Suppose that the current state is $\A^{(k)}=\A$; first choose $\delta=+1$ or $-1$ with probability 1/2 each, and compute

where

and $n_{jA}(\R)$ is the number of counts of nucleotide type $A$ among the $(A_k+j-1)$-th base of sequence $R_k$ for all $k=1,\ldots,N$, then update $\A^{(k+1)}=\A^{(k)}+\delta$ with probability $\min{p, 1}$, otherwise set $\A^{(k+1)}=\A^{(k)}$.

## Gibbs sampler with Shifting The implementation can be found in szcf-weiya/Gibbs-Motif.

## Estimation

### $\widehat \A$

Use the samples’ posterior mode $\widehat\A$ as the estimator of $\A$, and get the following sequence logo by submitting the result to http://weblogo.berkeley.edu/logo.cgi. ### $\widehat\bftheta_0$

Use the sample’s posterior mean $\widehat\bftheta_0$ as the estimator of $\bftheta_0$, and we have $\bftheta_0 = [0.208, 0.307, 0.314, 0.171].$

### $\widehat\bfTheta$

Base on the estimator $\widehat \A$, calculate the frequencies of each type of nucleotide on base $j, j=1,\ldots,W$, which can be the estimator of $\hat\bftheta_j$, and hence $\hat\bfTheta = [\hat\bftheta_1,\ldots,\hat\bftheta_W]$.