AI is able to identify the structure of proteins to create new drugs

7 min readMay 18, 2021

Deep Learning boosts biology research

Biology researchers are increasingly using Deep Learning models to develop their knowledge in the rapidly growing fields of biology: Genomics and Proteomics.

Genomics is a discipline of modern biology. It studies the functioning of an organism, an organ, a cancer, etc… at the genome scale, instead of being limited to the scale of a single gene. The genome is the whole genetic material of an organism. The study of the genome allows to improve diagnostics; to discover predisposition to a disease; the development of treatments based on the genetic information of each individual and, therefore, the advancement of personalized medicine.
Proteomics refers to the science that studies proteomes, all the proteins of a cell, tissue, organ or organism. Proteomics could help to unravel the mystery of giant viruses and discover new drugs.

hussius/deeplearning-biology

This is a list of implementations of deep learning methods to biology, originally published on Follow the Data. There…

github.com

Evidence of the growing importance of Deep Learning in the field of biology. MIT offers a course “Computational system Biology: Deep learning in the life science”, led by Manolis Kellis, which shows how Deep Learning algorithms can be used effectively in the life sciences and compares them with traditional research methods in this field.

6.874 Computational Systems Biology: Deep Learning in the Life Sciences

Additional course websites: Welcome to a new approach to computational problems in the life sciences. This subject is…

mit6874.github.io

Machine Learning and Genomics

DeepVariant to predict variants

Many models have been developed in the field of genomics in recent years. They are based on CNN (Convolutional neural network), RNN (Recurrent neural network), LSTM (Long Short Term Memory), GANs (Generative Adversarial Networks) and Autoencoders (AE) architectures: DeepTarget, DeepChrome, DeepVariant.

Deep learning models in genomics; are we there yet?

With the evolution of biotechnology and the introduction of the high throughput sequencing, researchers have the…

www.sciencedirect.com

DeepVariant, identifying variants via a convolutional neural network (CNN)
DeepVariant , developed by Google researchers in 2017, is an example of the use of Deep Learning in the field of genomics. This model allows through a CNN architecture to sequence the genome and identify variations from an individual’s reference genome, such as mutations or polymorphisms.

The term “mutation” refers to any change in the DNA sequence, without prejudging its pathogenicity at the gene or chromosome level. They are also called “variants”. The consequence of any mutation depends on its functional effect, which can be neutral, lead to the improvement of a function (diversity, evolution) or to the alteration of a function (pathogenic effect).

Instead of directly using the nucleotides of the sequenced DNA fragments (in the form of the symbols A, C, G, T), the Google researchers converted the sequences into images and then applied convolutional neural networks (CNN) to these images.

Model inputs: 3 alternative alleles ‘A’, ‘ATATTT’, ‘ATATTTT’ with the reference allele ‘AT’.
Model outputs: DeepVariant generates examples of all possible combinations of two different alleles (6 combinations)
Analysis: it is not difficult to deduce from the model predictions that the most likely alleles at this location are the reference allele ‘AT’ and the alleles ‘ATATTT’.

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Across many scientific disciplines, but in particular in the field of genomics, major breakthroughs have often resulted…

ai.googleblog.com

In September 2020, Google unveiled DeepVariant 1.0 which improves the performance of DeepVariant. DeepVariant can be trained directly through Google Cloud.

In a cross-sectional study published in JAMA of 2,367 prostate cancer and melanoma patients in the United States and Europe, DeepVariant 1.0 found pathogenic variants in 14% more people; compared to previous state-of-the-art methods.

Improving the Accuracy of Genomic Analysis with DeepVariant 1.0

Sequencing genomes involves sampling short pieces of the DNA from the ~6 billion pairs of nucleobases — i.e., adenine…

ai.googleblog.com

Machine learning and Proteomics

BERTology to discover protein structure

In June 2020, researchers at Salesforce Research, published a paper “BERTology Meets Biology: Interpreting Attention in Protein Language Models” that shows the use of the Natutal Language Processing model BERT in protein structure analysis. BERTology allows to study the three levels of protein structure:

The primary structure: the amino acid sequence.
Secondary structure: specific protein shapes (alpha helix, beta leaflet).
The tertiary structure: spatial folding (3D structure, contact between amino acids, binding sites).

It is this unique native three-dimensional structure that gives proteins their biological properties. These properties allow the creation of new drugs.

The idea is to have as input to the model an amino acid sequence and to predict the missing elements of this sequence as BERT does with the words of a sentence. The model also allows to determine if the amino acids are in close contact and where the binding sites are located.

AlphaFold 2, a revolution in the field of biology

At the end of December 2020, DeepMind, announced to have solved with its AlphaFold 2 algorithm, one of the most important problems of biology, 50 years old: the folding of proteins.

AlphaFold: a solution to a 50-year-old grand challenge in biology

In a major scientific advance, the latest version of our AI system AlphaFold has been recognised as a solution to this…

deepmind.com

There is a large, but seemingly finite, number of protein folds observed in nature. There are about 1400 of them (according to classification methods and databases).

Antibodies have a Y-shaped structure to attach to pathogens and trigger an immune response. Alzheimer’s or Parkinson’s disease are thought to be related to the fact that the proteins are not in the right configuration.

Many proteins function as receptors: the protein is activated when a complementary form of protein is associated with it. This mechanism that is taken into account in the design of many drugs as shown in this video.

The CASP (Critical Assessment of protein Structure Prediction) experiments aim to establish the current state of knowledge in protein structure prediction. A committee chooses proteins whose amino acid sequence is known.

On one hand experimenters work on the structure of the proteins using X-rays. On the other hand, researchers make structure predictions via algorithms.

To evaluate the results of the competition, the experimental results (in green) and the predictions of the algorithms (in blue) are compared. The performances of the competitors are evaluated in GDT (Global Distance Test).

CASP competition: comparison of protein structure

In 2018 and 2020, DeepMind’s algorithms achieved significant results. Over the entire competition, across all categories, AlphaFold2’s median GTD score is 92.4 (for specialists the protein folding problem is solved with a GTD >90). On average their distance error on amino acid placement is 1.6 angstroms, barely more than the size of an atom.

Deepmind divided the problem into two steps:

step 1: the creation of a distance matrix from an amino acid sequence.
step 2 : the reconstruction of the protein structure from the obtained distance matrix (via the gradient descent system).

In Alphafold (2018), step 1 was provided by a CNN (convolutional neural network of 220 blocks), in the more powerful Alphafold2 version : step 1 was provided by Attention-based neural networks.

Winner of the CASP 2020-AlphaFold 2 competition

Attention is all you need

BERTology and AlphaFold2 are both based on neural networks with attention mechanisms (Transformers) that are used by NLP models such as GPT-3 and BERT to memorize, for example, the correlation between a pronoun and a noun in a sentence to translate.

AlphaFold 2 : from the competition to the fight against COVID-19

In 2020, Deepmind teams used AlfaFold to generate the structure of proteins associated with SARS-CoV-2, the virus that causes Covid-19.

Computational predictions of protein structures associated with COVID-19

The scientific community has galvanised in response to the recent COVID-19 outbreak, building on decades of basic…