A pHMM Algorithm for Correcting Long Reads
Posted on 0 Comments
second or third generation sequencing platforms: trade-offs between accuracy and read length
to get long and accurate reads: combine both technologies and the errorneous long reads are corrected using the short reads
- current approaches rely on various graph or alignment based techniques and do not take the error profile of the underlying technilogy into account.
- ? efficient machine learning algorithms that address these shortcomings have the potential to achieve more accurate integration of these two technologies
The paper proposed Hercules, the first (?) machine learning-based long read error correction algorithm.
- it models every long read as a profile HMM w.r.t. the underlying platform’s error profile.
- it learns a posterior transition/emission probability distribution for each long read to correct errors in these reads
Hight Throughput Sequencing (HTS) technologies suffer from two fundamental limitations
- no platform is yet able to generate a chromosome-long read. Average read length ranges from 100bp to 20kb depending on the platform.
- reads are not error-free.
Illumina: the most ubitiquous platform. produces the most accurate (~0.1% error rate), yet the shortest (100-150bp) reads. Pacific Biosciences Single Molecule, Real-Time (SMRT) sequencing technology is capable of producing > 10kb-long reads on average, though with a substantially higher (~15%) error rate Oxford Nanopore Technologies (ONT) platform: can generate longer reads (up to ~900kb), but their error rate is also higher (>15%)
The strengths and weaknesses of the platforms makes it attractive to combine them.
Two major categories for hybrid error correction methods
- align short reads onto long reads generated from the same sample, implemented by several tools. These algorithms fix the errors in long reads by calculating a consensus of the short reads over the same segment of the nucleic acid sequence.
- drawback: alignment based approaches are highly dependent on the performance of the aligner. Hence, accuracy, run time, and memory usage of the aligner will directly affect the performance of the downstream correction tool.
- align long reads over a de Bruijn graph constructed using short reads, and the $k$-mers on the de Bruijn graph that are connected with a long read are then merged into a new, corrected form of the long read.
- drawback: the resulting graph may contain bulges and tips, that are typically treated as errors and removed. Accurate removal of such graph elements relies on the availability of high coverage data to be able to confidently discriminate real edges from the ones caused by erroneous $k$-mers.
Herclues models each long and erroneous read as a template profile HMM. It uses short reads as observations to train the model via forward-backward algorithm, and learns posterior transition and emission probabilities. Finally, Hercules decodes the most probable sequence for each profile HMM using Viterbi algorithm.
main advantages of Hercules:
- reduce dependency on aligner performance - alignment-based methods are dependent on the aligner’s full CIGAR string for each basepair to correct. They perform majority voting to resolve discrepancies among short reads with the assumption that all error types are equally likely to happen during correction. - Hercules uses the starting positions obtained from an aligner, it does not depend on any other information provided by the aligner. It sequentially and probabilistically accounts for the evidence provided per short read, instead of just independently using majority voting per base-pair.
- directly incorporate experimentally observed error profiles of long reads that can be updated when the underlying sequencing technology improves, or adapted to new long read sequencing platforms.
Hercules correct errors (insertions, deletions, and substitutions) present in long read sequencing platforms such as PacBio SMRT and Oxford Nanopore Technologies, using reads from a more accurate orthogonal sequencing platform, such as Illumina.
Goal: calculate the most likely (consensus) sequence the model would produce, among all possible sequences that can be produced using the letters in the alphabet $\Sigma$.