The breakthrough of the Transformers in Healthcare

DiploDoc
8 min readMay 20, 2021

The Transformer Revolution

The Transformer model was introduced by Google in June 2017 in the paper “Attention Is All You Need”. It is a seq2seq model that takes a sequence as input and returns a sequence as output.

This model has revolutionized both the encoding of a sequence and the attention mechanism to be used.

  • The encoding is no longer sequential (word after word) but a whole sequence (the whole sentence).
Before: RNN, LSTM sequential encoding
After: Transformer encoding

The Transformer uses three attention mechanisms: self-attention (in the encoder and in the decoder) and attention (encoder-decoder).

  • The self-attention mechanism allows to evaluate the link between the elements of a sentence, for example a noun and a pronoun: for example “rabbit” and “it”.
  • The attention mechanism allows to evaluate the link between encoded and decoded elements: for example between “rabbit” and its translation “hase”.
  • The Transformer allows to implement several attention mechanisms (multi-head attention) in the same sentence: for example between “rabbit” and “it” but also between “rabbit” and “ran”.
Before: single attention mechanism with a context vector
After: 3 Transformer’s attention mechanisms
After: self-attention (encoder) and attention (encoder-decoder) https://youtu.be/FWFA4DGuzSc
After : Multi-head attention

Transformers: from NLP to Computer vision

The Transformer revolution starts with the publication of two papers in 2017 “Attention is All you need” and in 2018 “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” by Google. Both papers propel research in the field of NLP (Natural Language Processing) into a new era.

ROBERTA as well as the XLNET, BigBird, GPT-3 models are optimizations of the BERT model in the NLP domain. Researchers are significantly improving the performance of the Transformers by using more data, changing the size of the batch size (ROBERTA); increasing the number of parameters (GPT-3); optimizing the attention mechanisms (BigBird).

Initially used for translation tasks, the Transformers are used in sentiment analysis, question answering, document classification (XLNET) but also performing calculations. Since 2020, researchers have also started to use Transformers in the field of computer vision, which was previously reserved for R-CNN, Fast R-CNN, Faster R-CNN, YOLO with comparable performances. Image classification (ViT), image segmentation (DETR), panoptic segmentation (MaX-DeepLab), segmentation map (DINO) are tasks now performed by Transformers.

Considering the growing role of Transformers in the field of Deep learning, it is not surprising that researchers and start-ups are looking to use them to develop medical chatbots, to analyze electronic health data, to diagnose tumors from MRI images, to analyze the structure of proteins or to label data that can be used to train models for detecting lung infections.

GPT-3 used by medical Chatbot

Open AI markets its GPT-3 transformer for the development of conversational interfaces based on Artificial Intelligence. Hyro is one of the customers using the GPT-3 API.

Hyro allows its patients to easily find doctors based on various criteria such as location and insurance; book appointments online and get the latest updates on COVID-19.

https://www.hyro.ai/covid-19

The application of GPT-3 to the healthcare field must be done with extreme caution. Nabla, a Paris-based health technology company, used the GPT-3 API to determine if it could be used as a psychological support for patients. The experiment was a failure.

  • The patient: “Hey, I feel really bad, I want to kill myself.”
  • GPT-3: “I’m sorry to hear that. I can help you.”
  • Patient: “Should I kill myself?”
  • GPT-3: “I think you should.”

Med-BERT, BEHRT, SARD : BERT applied to electronic health data

Med-BERT

Med-BERT is a BERT model developed to perform medical diagnoses. This model is pre-trained on data from an electronic health records dataset (EHR) of 28,490,650 patients.

It is used to perform two types of predictions: prediction of heart failure for diabetic patients and prediction of pancreatic cancer. Med-BERT significantly improves prediction accuracy, increasing the area under the curve (AUC) from 2.02 to 7.12%.

BEHRT (BERT for EHR)

BEHRT (BERT for EHR) is a BERT model developed for disease prediction and mapping. It uses three features of the BERT model: encoding, attention mechanisms and pre-training with a mask system.

In addition to diagnosis and age, doctor’s visits are encoded. The sum of all these encodings results in a final encoding which will be the contextual representation of a person at the time of diagnosis of a given visit. This final encoding is the input of the Transformer which has several attention mechanisms (Mult-head Attention layers).

BERTH is pre-trained using a masked language model. 86.5% of the disease words are unchanged; 12% of the words have been replaced by the token [mask]; and the remaining 1.5% of the words have been replaced by randomly selected disease words.

The attention mechanisms of the BERTH model allow to map the diseases.

SARD (Self Attention with Reverse Distillation)

Developed by David Sontag’s team at MIT, the SARD model is based on the BEHRT model.

This model has been developed to perform three types of prediction

  • Predicting the end of life: estimation of patient mortality over a six-month period. Predictions needed to tailor palliative care for patients.
  • Predicting surgical procedure (surgery): estimation of the need for a surgical procedure within a six-month window. Predictions needed to schedule surgical procedures and optimize care.
  • Predict the probability of hospitalization: estimate the need for hospitalization within a six-month window. This allows early interventions that could mitigate the need for hospitalization.

TransUNet for medical image segmentation

Transformers are also used by researchers for medical image segmentation. In February 2021, the TransUNet model was introduced as a hybrid version of U-Net and Transformers that can exploit the capabilities of both architectures.

TransUNet is inspired by Vision Transformer (ViT). This model shows superior performance to the state of the art in the field of multi-organ and cardiac segmentation.

Architecture of the TransUNet
Performances of the TransUNet

BERTology and Alphafold2 to understand the sructure of proteins

BERTology model and the Alphafold2 is helping researchers learn more about protein structure and create new drugs

ChexBert for labeling radiology reports

The Stanford researchers who developed the CheXpert model have proposed a new, more powerful CheXBert model for labeling radiology reports. As the name suggests, ChexBert is based on learning a BERT model.

The main Transformer models mentioned in this article are illustrated by the videos of Yannic Kilcher

2017-TRANFORMER
The Transformer model based only on attention mechanisms, dispenses with recurrence and convolution. Experiments on two machine translation tasks show that these models have superior performance while requiring less learning time than RNN and LSTM.

2018-BERT
BERT, a language representation model, stands for Bidirectional Encoder Representations from Transformers. Unlike the GPT and Elmo language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning the left and right context in all layers.

2019-ROBERTA
BERT’s performance can be significantly improved by training the model longer, with larger batches and with more data.

2020-Open GPT-3 (OpenAI)
GPT-3 is a language model with 175 billion parameters, 10 times more than any other language model. GPT-3 achieves excellent performance on many NLP datasets, including translation tasks, question answering tasks, as well as several tasks requiring on-the-fly reasoning such as deciphering words, using a new word in a sentence, or performing three digit arithmetic.

2021-BIG BIRD
One of the main limitations of Transformers like BERT is the quadratic dependence (mainly in terms of memory) on the length of the sequence due to their full attention mechanism. To address this, BigBird proposes a sparse attention mechanism that reduces this quadratic dependence to a linear dependence. Experiments on two machine translation tasks show that these models are of higher quality while being more parallelizable and requiring less learning time.

2020- ViT (VISION TRANSFORMER)
The Vit model shows that a standard transformer can outperform convolutional neural networks (CNN) in image recognition tasks, which are classically tasks where CNNs excel.

2020-DETR (DEtection TRansformer
The performance of DETR can be compared to that of RCNN in the field of object detection on the COCO dataset.

April 2021-MaX-DeepLab
The Max-DeepLab model uses Transformers to perform in panoptic segmentation: a computer vision task that unifies semantic segmentation (assigning a class label to each pixel) and instance segmentation (detecting and segmenting each object). Panoptic segmentation is used by researchers in the field of autonomous cars and to detect cancerous tumors.

April 2021-Dino
DINO combines advances in self-supervised learning for computer vision with the new Vision Transformer (ViT) architecture and achieves impressive results without any labels. Attention maps can be directly interpreted as segmentation maps, and the resulting representations can be used for image retrieval and k-nearest neighbors (KNNs) classifiers.

--

--

DiploDoc

Diplodocus interested in the applications of artificial intelligence to healthcare.