The NLP used to create labeled datasets
To advance research and train algorithms such as CheXNeXt, researchers need labeled data sets of chest X-rays. CheXpert, MIMIC-CXR, PadChest, ChestX-ray14, IU X-Ray are the most commonly used datasets.
It is tedious to manually assign labels to the images in these datasets, so researchers create labelers. To create CheXpert, Stanford researchers developed an NLP model to extract image labels using the de-identified radiological reports associated with each ray.
This work resulted in the creation of a tagged data set consisting of 224,316 chest x-rays of 65,240 patients who underwent radiographic examinations at Stanford University Medical Center between October 2002 and July 2017. This dataset is open source and allows researchers to propose new models to detect lung pathologies such as COVID-19.
CheXpert and ChestXray-14 to diagnose Covid-19
The CheXpert and ChestXray-14 datasets, enriched with COVID-19 data, allowed IEEE researchers to train two CMTNet and ReCoNet models capable of classifying COVID-19 and non-COVID-19 thorax X-rays and provide visual segmentation of the X-ray to locate anomalies.
- CMTNet
- ReCoNet
BERT : a new era for the NLP
The research paper on CheXpert indicates that Stanford researchers to develop this label used the following NLP libraries:
- NLTK (Bird, Klein, and Loper 2009) Natural Language Toolkit, an advanced platform for building Python algorithms
- the Bllip parser (Charniak and Johnson 2005; McClosky 2010), a Python parser library.
- Stanford CoreNLP (De Marneffe et al. 2014), a natural language processing library in Java.
To carry out the entire process of natural word processing, researchers have at their disposal other NLP libraries: spaCy, Gensim, SparkNLP, PyTorch-NLP, Scikit-learn, Tensorflow, Transformers.
The publication of the BERT model (Bidirectional Encoder Representations from Transformers), in 2018, by Google means for experts the beginning of a new era in the field of Natural Language Processing (NLP).
BERT builds on the strengths of the two previous models ELMo (taking into account the context of words in the sentence) and Open AI GPT (attention mechanism to distinguish the most important words in the sentence).
BERT has been trained on a Bookcorpus of 800M words and the English language wikipedia of 2.5000M words. It takes 3 days to pre-train with 16TPU. He is able to predict a word and predict what the next sentence will be. For annecdote, there are two French versions of BERT: CamemBERT and FlauBERT.
CheXbert: improving the CheXpert labeler
The Stanford researchers who developed the CheXpert model proposed a new, more powerful CheXbert model for labeling radiology reports. As its name suggests, CheXbert is based on the training of a BERT model.
BERT is first trained on rule-based annotations from the labeler CheXpert, and then refined on a small set of radiologist annotations supplemented by automated back-translation.
Through the use of BERT, CheXbert is able to outperform CheXpert, establishing a new State of the Art for labeling reports on one of the largest chest x-ray datasets.
Researchers are able to compare the performance of different T-auto, CheXpert and CheXbert labelers and understand how they interpret the words in the sentences and how they generate the label.
BigBird for more efficient genome sequencing
At the end of July 2020, Google researchers published a new research paper on arxiv that presents BigBird: a new model of NLP that outperforms BERT. BigBird uses the “Sparse Attention” mechanism which allows it to process sequences up to 8 times longer than is possible with BERT with the same computing power.
One of the uses of BigBird identified by researchers is its application in the field of DNA sequencing. DNA sequence analysis can be used to identify, diagnose and potentially find treatments for genetic diseases. It is also used to analyze viruses and find vaccines.
Go further…. Natural Language Processing (NLP) Zero to Hero with Tensorflow
Going further …with BigBird