WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Biclustering on Gene Expression Data

Posted on 0 Comments
Tags: Biclustering

The note is based on Padilha, V. A., & Campello, R. J. G. B. (2017). A systematic comparative evaluation of biclustering techniques. BMC Bioinformatics, 18(1), 55.

Biclustering (simultaneous clustering of rows and columns) is quite popular and widely-used in gene expression data. (Simple googling with keywords “review biclustering” returns many papers on gene expression data)

The drawbacks of partitional or hierarchical clustering algorithm:

  • each gene or experimental condition is assigned to only one cluster
  • each gene or experimental condition must be assigned to some cluster

Greedy Algorithm:

  • Cheng and Church’s Algorithm (CCA): add/remove rows and columns to minimize the mean squared residue measure.
  • OPSM: search biclusters with rows following the same linear order under the same columns
  • xMOTIFs: nondeterministic algorithm for discrete data matrix
  • Iterative Signature Algorithm (ISA): starts with randomly selected genes and experimental conditions, evaluating and updating them through iterative steps until convergence.
  • Minimum Sum-Squared Residue Coclustering (MSSRCC): finding checkerboard patterns

  • Qualitative Biclustering: discretizes the input, and build a graph where each node corresponds to a gene, each edge has a weight equal to the number of experimental conditions for which two genes have the same nonzero integer values, then searches for biclusters corresponding to heavy subgraphs
  • Combinatorial Algorithm for Expression and Sequence-Based Cluster Extraction: initiate a bicluster with two maximally correlated genes across all experimental conditions
  • Correlated Pattern Biclusters (CPB): starts with randomly selected subsets of rows and columns, and searches for biclusters with highly correlated rows regarding some reference genes
  • Large Average Submatrices (LAS): search for submatrices by locally maximizing a Bonferroni-based significance score.

Divide-and-conquer algorithms:

Split a problem into smaller instances, solving each one recursively and combining their solutions into the solution for the original problem.

  • Binary Inclusion-Maximal Biclustering Algorithm (Bimax): recursively divides a binary matrix for searching submatrices formed with entries whose values ae all equal to one.

Exhaustive enumeration algorithms:

Assume that the best submatrices can only be identified by generating all possible row and column combinations of the dataset.

  • Statistical-Algorithmic Method for Bicluster Analysis (SAMBA): model the input datasets as a bipartite graph, where one set of nodes corresponds to the genes and the other is related to the experimental conditions
  • Bit-Pattern Biclustering Algorithm: search for maximal biclusters in binary datasets by applying the logical AND operator over all possible gene pairs.
  • Differentially Expressed Biclusters (DeBi): based on a frequent itemset approach

Distribution parameter identification algorithms:

Assume some statistical model related to the structure of the biclusters and then apply some iterative procedure to adapt its parameters.

  • Plaid: the expression value in a bicluster is the sum of the main effect $\mu$, the gene effect $\alpha_i$, the condition effect $\beta_j$, and the noise term \(y_{ij}=(\mu_0+\alpha_{i0}+\beta_{j0})+\sum_{k=1}^K(\mu_k+\alpha_{ik}+\beta_{jk})\rho_{ik}\kappa_{jk} + \varepsilon_{ij}\)
  • Bayesian BiClustering (BBC)
    • for a single bicluster, same model as in Plaid model
    • for multiple biclusters, constrain the overlapping of biclusters to only one direction: either gene $\sum_{k=1}^K\rho_{ik}\le 1$ or condition $\sum_{k=1}^K\kappa_{jk}\le 1$.

  • Spectral: use SVD to simultaneously cluster genes and experimental conditions to find checkerboard patterns
  • Factor Analysis for Bicluster Acquisition (FABIA): assume a multiplicative model, and uses a factor analysis approach

Published in categories Note