AI is able to identify the structure of proteins to create new drugs

Deep Learning boosts biology research

Biology researchers are increasingly using Deep Learning models to develop their knowledge in the rapidly growing fields of biology: Genomics and Proteomics.

Evidence of the growing importance of Deep Learning in the field of biology. MIT offers a course “Computational system Biology: Deep learning in the life science”, led by Manolis Kellis, which shows how Deep Learning algorithms can be used effectively in the life sciences and compares them with traditional research methods in this field.

Machine Learning and Genomics

DeepVariant to predict variants

Many models have been developed in the field of genomics in recent years. They are based on CNN (Convolutional neural network), RNN (Recurrent neural network), LSTM (Long Short Term Memory), GANs (Generative Adversarial Networks) and Autoencoders (AE) architectures: DeepTarget, DeepChrome, DeepVariant.

DeepVariant, identifying variants via a convolutional neural network (CNN)
DeepVariant , developed by Google researchers in 2017, is an example of the use of Deep Learning in the field of genomics. This model allows through a CNN architecture to sequence the genome and identify variations from an individual’s reference genome, such as mutations or polymorphisms.

The term “mutation” refers to any change in the DNA sequence, without prejudging its pathogenicity at the gene or chromosome level. They are also called “variants”. The consequence of any mutation depends on its functional effect, which can be neutral, lead to the improvement of a function (diversity, evolution) or to the alteration of a function (pathogenic effect).

Instead of directly using the nucleotides of the sequenced DNA fragments (in the form of the symbols A, C, G, T), the Google researchers converted the sequences into images and then applied convolutional neural networks (CNN) to these images.

In September 2020, Google unveiled DeepVariant 1.0 which improves the performance of DeepVariant. DeepVariant can be trained directly through Google Cloud.

In a cross-sectional study published in JAMA of 2,367 prostate cancer and melanoma patients in the United States and Europe, DeepVariant 1.0 found pathogenic variants in 14% more people; compared to previous state-of-the-art methods.

Machine learning and Proteomics

BERTology to discover protein structure

In June 2020, researchers at Salesforce Research, published a paper “BERTology Meets Biology: Interpreting Attention in Protein Language Models” that shows the use of the Natutal Language Processing model BERT in protein structure analysis. BERTology allows to study the three levels of protein structure:

It is this unique native three-dimensional structure that gives proteins their biological properties. These properties allow the creation of new drugs.

The idea is to have as input to the model an amino acid sequence and to predict the missing elements of this sequence as BERT does with the words of a sentence. The model also allows to determine if the amino acids are in close contact and where the binding sites are located.

AlphaFold 2, a revolution in the field of biology

At the end of December 2020, DeepMind, announced to have solved with its AlphaFold 2 algorithm, one of the most important problems of biology, 50 years old: the folding of proteins.

There is a large, but seemingly finite, number of protein folds observed in nature. There are about 1400 of them (according to classification methods and databases).

Antibodies have a Y-shaped structure to attach to pathogens and trigger an immune response. Alzheimer’s or Parkinson’s disease are thought to be related to the fact that the proteins are not in the right configuration.

Many proteins function as receptors: the protein is activated when a complementary form of protein is associated with it. This mechanism that is taken into account in the design of many drugs as shown in this video.

The CASP (Critical Assessment of protein Structure Prediction) experiments aim to establish the current state of knowledge in protein structure prediction. A committee chooses proteins whose amino acid sequence is known.

On one hand experimenters work on the structure of the proteins using X-rays. On the other hand, researchers make structure predictions via algorithms.

To evaluate the results of the competition, the experimental results (in green) and the predictions of the algorithms (in blue) are compared. The performances of the competitors are evaluated in GDT (Global Distance Test).

CASP competition: comparison of protein structure

In 2018 and 2020, DeepMind’s algorithms achieved significant results. Over the entire competition, across all categories, AlphaFold2’s median GTD score is 92.4 (for specialists the protein folding problem is solved with a GTD >90). On average their distance error on amino acid placement is 1.6 angstroms, barely more than the size of an atom.

2018-AlfaFold / 2020-AlfaFold2

Deepmind divided the problem into two steps:

In Alphafold (2018), step 1 was provided by a CNN (convolutional neural network of 220 blocks), in the more powerful Alphafold2 version : step 1 was provided by Attention-based neural networks.

Winner of the CASP 2020-AlphaFold 2 competition

Attention is all you need

BERTology and AlphaFold2 are both based on neural networks with attention mechanisms (Transformers) that are used by NLP models such as GPT-3 and BERT to memorize, for example, the correlation between a pronoun and a noun in a sentence to translate.

AlphaFold 2 : from the competition to the fight against COVID-19

In 2020, Deepmind teams used AlfaFold to generate the structure of proteins associated with SARS-CoV-2, the virus that causes Covid-19.

This analysis of protein structure is essential in understanding the evolution of the SARS-CoV-2 virus (mutations) and for the development of vaccines as shown by Etienne Decroly, director at CNRS in the following video.


More on DeepVariant

More on BERTology

more on AlphaFold 2

Diplodocus interested in the applications of artificial intelligence to healthcare. Twitter : @

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store