Deep Learning boosts biology research
Biology researchers are increasingly using Deep Learning models to develop their knowledge in the rapidly growing fields of biology: Genomics and Proteomics.
- Genomics is a discipline of modern biology. It studies the functioning of an organism, an organ, a cancer, etc… at the genome scale, instead of being limited to the scale of a single gene. The genome is the whole genetic material of an organism. The study of the genome allows to improve diagnostics; to discover predisposition to a disease; the development of treatments based on the genetic information of each individual and, therefore, the advancement of personalized medicine.
- Proteomics refers to the science that studies proteomes, all the proteins of a cell, tissue, organ or organism. Proteomics could help to unravel the mystery of giant viruses and discover new drugs.
Evidence of the growing importance of Deep Learning in the field of biology. MIT offers a course “Computational system Biology: Deep learning in the life science”, led by Manolis Kellis, which shows how Deep Learning algorithms can be used effectively in the life sciences and compares them with traditional research methods in this field.
Machine Learning and Genomics
DeepVariant to predict variants
Many models have been developed in the field of genomics in recent years. They are based on CNN (Convolutional neural network), RNN (Recurrent neural network), LSTM (Long Short Term Memory), GANs (Generative Adversarial Networks) and Autoencoders (AE) architectures: DeepTarget, DeepChrome, DeepVariant.
DeepVariant, identifying variants via a convolutional neural network (CNN)
DeepVariant , developed by Google researchers in 2017, is an example of the use of Deep Learning in the field of genomics. This model allows through a CNN architecture to sequence the genome and identify variations from an individual’s reference genome, such as mutations or polymorphisms.
The term “mutation” refers to any change in the DNA sequence, without prejudging its pathogenicity at the gene or chromosome level. They are also called “variants”. The consequence of any mutation depends on its functional effect, which can be neutral, lead to the improvement of a function (diversity, evolution) or to the alteration of a function (pathogenic effect).
Instead of directly using the nucleotides of the sequenced DNA fragments (in the form of the symbols A, C, G, T), the Google researchers converted the sequences into images and then applied convolutional neural networks (CNN) to these images.
- Model inputs: 3 alternative alleles ‘A’, ‘ATATTT’, ‘ATATTTT’ with the reference allele ‘AT’.
- Model outputs: DeepVariant generates examples of all possible combinations of two different alleles (6 combinations)
- Analysis: it is not difficult to deduce from the model predictions that the most likely alleles at this location are the reference allele ‘AT’ and the alleles ‘ATATTT’.
In September 2020, Google unveiled DeepVariant 1.0 which improves the performance of DeepVariant. DeepVariant can be trained directly through Google Cloud.
In a cross-sectional study published in JAMA of 2,367 prostate cancer and melanoma patients in the United States and Europe, DeepVariant 1.0 found pathogenic variants in 14% more people; compared to previous state-of-the-art methods.
Machine learning and Proteomics
BERTology to discover protein structure
In June 2020, researchers at Salesforce Research, published a paper “BERTology Meets Biology: Interpreting Attention in Protein Language Models” that shows the use of the Natutal Language Processing model BERT in protein structure analysis. BERTology allows to study the three levels of protein structure:
- The primary structure: the amino acid sequence.
- Secondary structure: specific protein shapes (alpha helix, beta leaflet).
- The tertiary structure: spatial folding (3D structure, contact between amino acids, binding sites).
It is this unique native three-dimensional structure that gives proteins their biological properties. These properties allow the creation of new drugs.
The idea is to have as input to the model an amino acid sequence and to predict the missing elements of this sequence as BERT does with the words of a sentence. The model also allows to determine if the amino acids are in close contact and where the binding sites are located.
AlphaFold 2, a revolution in the field of biology
At the end of December 2020, DeepMind, announced to have solved with its AlphaFold 2 algorithm, one of the most important problems of biology, 50 years old: the folding of proteins.
There is a large, but seemingly finite, number of protein folds observed in nature. There are about 1400 of them (according to classification methods and databases).
Antibodies have a Y-shaped structure to attach to pathogens and trigger an immune response. Alzheimer’s or Parkinson’s disease are thought to be related to the fact that the proteins are not in the right configuration.
Many proteins function as receptors: the protein is activated when a complementary form of protein is associated with it. This mechanism that is taken into account in the design of many drugs as shown in this video.
The CASP (Critical Assessment of protein Structure Prediction) experiments aim to establish the current state of knowledge in protein structure prediction. A committee chooses proteins whose amino acid sequence is known.
On one hand experimenters work on the structure of the proteins using X-rays. On the other hand, researchers make structure predictions via algorithms.
To evaluate the results of the competition, the experimental results (in green) and the predictions of the algorithms (in blue) are compared. The performances of the competitors are evaluated in GDT (Global Distance Test).
In 2018 and 2020, DeepMind’s algorithms achieved significant results. Over the entire competition, across all categories, AlphaFold2’s median GTD score is 92.4 (for specialists the protein folding problem is solved with a GTD >90). On average their distance error on amino acid placement is 1.6 angstroms, barely more than the size of an atom.
Deepmind divided the problem into two steps:
- step 1: the creation of a distance matrix from an amino acid sequence.
- step 2 : the reconstruction of the protein structure from the obtained distance matrix (via the gradient descent system).
In Alphafold (2018), step 1 was provided by a CNN (convolutional neural network of 220 blocks), in the more powerful Alphafold2 version : step 1 was provided by Attention-based neural networks.
Attention is all you need
BERTology and AlphaFold2 are both based on neural networks with attention mechanisms (Transformers) that are used by NLP models such as GPT-3 and BERT to memorize, for example, the correlation between a pronoun and a noun in a sentence to translate.
AlphaFold 2 : from the competition to the fight against COVID-19
In 2020, Deepmind teams used AlfaFold to generate the structure of proteins associated with SARS-CoV-2, the virus that causes Covid-19.
This analysis of protein structure is essential in understanding the evolution of the SARS-CoV-2 virus (mutations) and for the development of vaccines as shown by Etienne Decroly, director at CNRS in the following video.
Appendix
More on DeepVariant
More on BERTology
more on AlphaFold 2