Effective Gene Expression Prediction
Posted on (Update: )
Effective gene expression prediction from sequence by integrating long-range interactions
- how noncoding DNA determines gene expression in different cell types
- here report substantially improved gene expression prediction accuracy from DNA sequences via a deep learning architecture, called Enformer
- able to integrate information from long-range interactions (up to 100kb away) in the genome
- enformer learned to predict enhancer-promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input
Introduction
-
increasing information flow between distal elements is a promising path to increase predictive accuracy
- introduce a neural network architecture based on self-attentiion towards this goal
- frame the machine learning problem as predicting thousands of epigenetic and transcriptional datasets in at multitask setting across long DNA sequences
Enformer attends to cell-type-specific enhancers
Methods
Model architecture
the Enformer architecture consists of three parts
- 7 convolutional blocks with pooling
- 11 transformer blocks
- a cropping layer followed by final pointwise convolutions branching into 2 organism-specific network heads
take as input one-hot-encoded DNA sequence of length 196608 bp and predicts 5313 genomic tracks for the human genome and 1643 tracks for the mouse genome