AI enable disabled people to communicate despite their hearing and speech impairment? -Automatic Speech Recognition

Projects initiated by Google to fight against speech disorders

Speech disorders can be linked to many pathologies: deafness, Trisomy 21, ALS (Charcot’s disease), stroke. Google has launched several projects to give those with speech difficulties access to applications that simplify their daily life or increase their ability to communicate with others. The Parrotron and Live Transcribe projects are based on speech recognition and text-to-speech algorithms.


To fight against the speech disorders that affect 7.5 million Americans, Google engineers have improved AI-driven speech recognition and text-to-speech models. They have developed Parrotron, which allows people with atypical speech to make themselves understood.

The Section 3.2 “Normalization of hearing-impaired speech” of the following article “Audio samples from “Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications… shows how Parrotron transforms the language of a hearing-impaired person (Input) into speech without speech impairment (Output).

Parrotron is based on Convolutional Neural Networks (CNN) and LSTM (Long Short Term Memory networks).

Parrotron Architecture

The advantage of the LSTM compared to a conventional RNN is its cellular memory unit. LSTM were created as the solution to short-term memory of the RNN. They have internal mechanisms called gates that can regulate the flow of information. These gates can learn which data in a sequence is important to keep or throw away. By doing that, it can pass relevant information down the long chain of sequences.

Dimitri Kanevski, engineer at Google, developed 20,000 recordings to train his Parrotron model. He then deployed his model on two hardwares: Google Home and his cell phone equipped with the Google Assistant application (equivalent of Apple’s Siri). Dimitri Kanevski’s voice commands that are not taken into account by a classical text recognition model are now taken into account by the Parrotron.

Parrotron demo with Google Home and Google Assistant
Dimitri Kanevsky explains his research

Google Live Transcribe

Google Live Transcribe is an application available, since February 2019, on the 1.8 billion Android phones in circulation worldwide. It works with 70 languages or dialects. The application is based on the Google Cloud Speech API, which consists of automatic learning algorithms capable of transforming audio into written text in real time (Automatic Speech Recognition-ASR).

To train ASR, researchers have to record thousands of sentences corresponding to hundreds of hours of recording. There are also Open Source “ASR Corpuses” that can be used to train models such as LibriSpeech: a corpus of about 1000 hours of speech in English read at 16 kHz.

Once the sentences have been recorded, the sound waves are transformed into a spectogram via a mathematical operation called Fourier transform, which breaks down the complex sound wave into the simple sound waves that compose it.

Transformation of a recorded sentence into a spectogram

Then models composed of convolutional neural networks (CNN) or/and recurrent neural networks (LSTM, GRU) are fed with pieces of spectogram (Input) to determine as output : the letter corresponding to the emitted sound.

In 2020, Google researchers developed a prototype of glass that allows the transcription of audio into text to be inscribed on the lenses. They also compared the use of Live Transcribe on cell phones and on the lenses. The result: the glasses allow the disabled person to move around without risk, to better perceive their environment and to better follow a conversation with several people.

From “Speech to text” to “Lip to speech”

After Speech to Text: the deaf, the hearing impaired, people suffering from aphasia may soon benefit from Lip to Speech. The generation of sound or text will no longer necessarily pass through sound but through images.

In May 2020, researchers published a research paper on the Lip2Wav model: capable of synthesizing sound from a “silent” video of a speaking people.

Generation of a sound from a video

Like Parrotron or Google Transcribe, the models are based on Convolutional Neural Networks (CNN) and LSTM (Long Short Term Memory networks).

Complex Lip2Wav architecture composed of an encoder built with 3D Conv and a decoder built with LSTM

Diplodocus interested in the applications of artificial intelligence to healthcare. Twitter : @