Biclustering on Gene Expression Data
Posted on 0 Comments
The note is based on Padilha, V. A., & Campello, R. J. G. B. (2017). A systematic comparative evaluation of biclustering techniques. BMC Bioinformatics, 18(1), 55.
Biclustering (simultaneous clustering of rows and columns) is quite popular and widely-used in gene expression data. (Simple googling with keywords “review biclustering” returns many papers on gene expression data)
The drawbacks of partitional or hierarchical clustering algorithm:
- each gene or experimental condition is assigned to only one cluster
- each gene or experimental condition must be assigned to some cluster
Greedy Algorithm:
- Cheng and Church’s Algorithm (CCA): add/remove rows and columns to minimize the mean squared residue measure.
- OPSM: search biclusters with rows following the same linear order under the same columns
- xMOTIFs: nondeterministic algorithm for discrete data matrix
- Iterative Signature Algorithm (ISA): starts with randomly selected genes and experimental conditions, evaluating and updating them through iterative steps until convergence.
- Minimum Sum-Squared Residue Coclustering (MSSRCC): finding checkerboard patterns
- Qualitative Biclustering: discretizes the input, and build a graph where each node corresponds to a gene, each edge has a weight equal to the number of experimental conditions for which two genes have the same nonzero integer values, then searches for biclusters corresponding to heavy subgraphs
- Combinatorial Algorithm for Expression and Sequence-Based Cluster Extraction: initiate a bicluster with two maximally correlated genes across all experimental conditions
- Correlated Pattern Biclusters (CPB): starts with randomly selected subsets of rows and columns, and searches for biclusters with highly correlated rows regarding some reference genes
- Large Average Submatrices (LAS): search for submatrices by locally maximizing a Bonferroni-based significance score.
Divide-and-conquer algorithms:
Split a problem into smaller instances, solving each one recursively and combining their solutions into the solution for the original problem.
- Binary Inclusion-Maximal Biclustering Algorithm (Bimax): recursively divides a binary matrix for searching submatrices formed with entries whose values ae all equal to one.
Exhaustive enumeration algorithms:
Assume that the best submatrices can only be identified by generating all possible row and column combinations of the dataset.
- Statistical-Algorithmic Method for Bicluster Analysis (SAMBA): model the input datasets as a bipartite graph, where one set of nodes corresponds to the genes and the other is related to the experimental conditions
- Bit-Pattern Biclustering Algorithm: search for maximal biclusters in binary datasets by applying the logical AND operator over all possible gene pairs.
- Differentially Expressed Biclusters (DeBi): based on a frequent itemset approach
Distribution parameter identification algorithms:
Assume some statistical model related to the structure of the biclusters and then apply some iterative procedure to adapt its parameters.
- Plaid: the expression value in a bicluster is the sum of the main effect $\mu$, the gene effect $\alpha_i$, the condition effect $\beta_j$, and the noise term \(y_{ij}=(\mu_0+\alpha_{i0}+\beta_{j0})+\sum_{k=1}^K(\mu_k+\alpha_{ik}+\beta_{jk})\rho_{ik}\kappa_{jk} + \varepsilon_{ij}\)
- Bayesian BiClustering (BBC)
- for a single bicluster, same model as in Plaid model
- for multiple biclusters, constrain the overlapping of biclusters to only one direction: either gene $\sum_{k=1}^K\rho_{ik}\le 1$ or condition $\sum_{k=1}^K\kappa_{jk}\le 1$.
- Spectral: use SVD to simultaneously cluster genes and experimental conditions to find checkerboard patterns
- Factor Analysis for Bicluster Acquisition (FABIA): assume a multiplicative model, and uses a factor analysis approach