One of the major challenges in biology is to understand how the information encoded in the genome is translated into the observed characteristics of an organism. That is, we would like to have mappings between an organism’s genotype and its phenotype for defined environmental conditions. As living cells are the basis of all lifeforms, the first step is to understand how changes in DNA sequence affects cell’s molecular composition. The abundance and state of central molecular components, such as DNA, RNA, protein, and metabolites can be measured with high coverage by high throughput methods, and recently even at single cell level. However, finding causal relationships between sequence information and composition of molecular components is very difficult due to the combinatory complexity of sequence information, the incomplete coverage of molecular components that can be measured (e.g. DNA, RNA, and protein can take different modification states), and the corruption of data by biological and technical noise.
Our approach is to combine different data sets in a large convolutional network whose architecture is determined to significant extent by physicochemical constraints. The input typically consists of sequence data and the targets are functionally relevant molecular states, such as binding affinities of transcription factors to DNA regions or the probability to find a ribosome at certain positions on an mRNA. To prevent overfitting we typically rely on Bayesian sampling using MCMC methods. In fact, we make heavy use of the algorithmic developments in the field of deep learning. However, we emphasise that biological data typically contains highly structured noise and as such the problem lies more in finding models that generalise well and less in capturing the strong non-linearities.