Speech recognition

Speech Recognition

10.0    Overview

As Wall Street analyst Mary Meeker noted in 2016, people can speak at a rate of 150 words per minute but can only type 20-80 words per minute. As a result, speech recognition capabilities can have a huge impact on productivity. Just as Captain Kirk ran the Starship Enterprise using voice commands, much of the world uses speech recognition daily.

When we talk to Siri on iPhones or Google Assistant on Android devices, our speech is converted to text almost as accurately as if we had typed the text ourselves. Moreover, it’s three times faster and more accurate.  These advantages are even more noticeable for non-roman character sets such as Mandarin Chinese (Ruan et al, 2016).

People also use speech recognition to create documents, emails, medical notes, and much more. Call centers, voice mail systems, and even auto manufacturers are incorporating speech recognition to enable users to give commands by voice. Speech recognition systems take an audio signal as input and produce a string of characters as output.

The input might be a spoken query to Siri on an iPhone, it might be a customer on a telephone stating what they need to an automated voice response system, or it might be a traveler in a foreign country who wants to speak into their smartphone and have the smartphone respond by saying the same thing in the foreign language.

Speech recognition handles the front end of each of these scenarios. For the Siri query, speech recognition is responsible for transforming the audio signal into a string of text. In a chatbot, that string of text is interpreted into a response or action. The same is true for the customer service example.

For the foreign traveler, the actual translation of the string of characters will be handled by a machine translation system. Recently, systems like Google Translate and Dragon Dictate been able to do a pretty good job of speaker-independent, unlimited vocabulary, conversational speech recognition. In fact, Microsoft and Baidu recently published papers on speech recognition systems that transcribe better than humans in controlled environments with unaccented, clear speech (though no system is as good as a human in noisy environments, environments where speakers have accents, and environments where people don’t speak clearly).

But it’s been a long road to get to this point.  In fact, the Watson DeepQA system that beat two Jeopardy! champions just seven years ago received its questions as a string of text because using speech recognition technology would likely have put the system at a significant disadvantage as a result of incorrectly converting the spoken questions into text strings.

Interestingly, in just a few short years, speech recognition technology has improved to the point where providing the questions verbally wouldn’t put the computer system at a disadvantage. Let’s talk about how we got this far so quickly after so many years of minimal success and also talk about the limitations of what has been achieved.

10.1    The speech signal

Human speech starts with the larynx which produces sound energy. The actual sounds are produced by the vocal cords repeatedly coming together and blocking the airflow from the lungs. The sounds of speech are produced by lips, tongue, and teeth making movements that route the flow of air to different parts of the oral and nasal passages. This results in the blocking of certain sound frequencies and the frequencies that emerge and form the speech sounds we hear.

The human ear can hear frequencies between 20Hz and 20kHz (20,000Hz) but most speech occurs at frequencies under 8kHz. If we take samples of an acoustic signal over time and apply a mathematical algorithm named a fast fourier transform, the result will be a spectrogram that shows the frequencies that make up a sound from low to high on the y-axis and how they change over time on the x-axis. The amplitude or intensity is a third dimension in a spectrogram that is often indicated by color intensity.  The image below is a spectrogram of me speaking the words “Speech recognition is hard”:

Speech recognition: spectrogram of speech with pauses

In creating this spectrogram, I paused between words. One can see a clear segmentation between words.

10.2    Automatic speech recognition challenges

In addition to the different pronunciations and dialects and the vast number of words in the world’s languages, there are several dimensions that impact the performance and ease of development of speech recognition systems:

Continuous speech vs isolated words: Speech recognition is much easier if the speaker pauses between words. The reason is straightforward – it is much easier to identify word boundaries if there are pauses. This is the reason that early commercial dictation systems required pauses between words.  To create the spectrogram below, I spoke at a conversational pace without any forced pauses:

Speech recognition: A spectrogram of "speech recognition is hard" without pauses between words

Here, it is difficult to discern the word segmentation. Early speech recognition systems required the speaker to pause between words to make word segmentation easier.

Vocabulary size: It is far easier to build a speech system for a small limited vocabulary. As will be discussed below, the first speech recognition system developed in 1952 was limited to 10 words – one for each digit. This effectively turned it into a machine learning classification problem with 10 categories.

As vocabulary size increases, the classification problem becomes more and more difficult. For recognition of everyday speech, even if one has a 170,000-word vocabulary, there will still be numerous Out Of Vocabulary words such as proper names and technical terms.

External knowledge: In many cases, speech can’t be accurately transcribed without reference to information that is not contained in the speech signal. For example, homonyms are words that sound alike. Examples are “to”, “two”, and “too” or “write”, “right”, and “Rite” (as in the proper name Rite Aid). The information contained in the speech signal is simply insufficient to determine which word was spoken.

Human listeners rely on knowledge that is independent of the speech signal. For example, assuming they are familiar with the pharmacy chain Rite Aid, they know that if a word that sounds like “right” is followed by the word “aid”, then the two words together reference the pharmacy chain.

Another example is the recognition of words not known to the speech system. For example, if I say “print document xxxxy25”, I can use grammatical rules and world knowledge to help determine that xxxxy25 is a document name.

External knowledge about the type of speech is also used. If I know the speech input is broadcast news in English, I can use English grammar rules to help figure out what is being spoken when my speech algorithm offers up alternative interpretations of the input. In contrast, if the speech signal is a telephone conversation, ungrammatical speech segments are commonplace and grammar is not as reliable a cue.

Similarly, if the speaker vocabulary is restricted, for example, to a set of commands, the selection of the correct word is made easier by reference to the allowed set of words. Earlier commercial speech recognition systems were often limited to a small set of vocabulary words in order to provide adequate performance.

Last and perhaps most important, the speech recognition system will not usually recognize all phonemes (and words) with perfect accuracy. In many cases, the speech system may deem multiple phonemes to be possible interpretations of a segment of the speech signal and external knowledge must be used to determine the most likely word. Similarly, in converted sequences of phonemes to words, multiple words might be possible interpretations, and external knowledge about which words are allowed and/or are likely to occur in sequence is required to pick the most likely word.

Individual speaker differences: Different speakers have different vocal tract physiologies and maybe have non-native or regional accents. Some speak slowly and clearly. Others speak fast and are sometimes difficult for humans to understand.

Read speech vs conversational speech: It is easier to understand the speech of someone reading a book or other printed text than it is to understand conversational speech which has a lot of added complexity such as

    • Hesitations, stutters, and words cut off mid-utterance
    • Laughter, sneezing, and coughing
    • Non-lexical utterances (e.g. “huh”, “umm”)
    • Confirmations (“uhhuh” is 1% of the words in a conversation (Matsoukas et al 2006)
    • Mispronunciations (and speaker corrections)
    • Poor articulation especially of function words (i.e. prepositions, articles, and pronouns)
    • Emotion

Environment: There are several environmental conditions that effectively degrade the speech signal including

    • Background conversations
    • Other types of background noise
    • Handset and/or microphone quality
    • Landline, VoIP, or cellular connection quality
    • Additionally, if the audio is recorded in stereo, the two channels need to be synchronized
    • Multiple speakers
    • Language and dialect identification

When someone uses a smartphone to dictate, request a translation, ask Siri, or do any other voice-initiated task, the underlying system must either be trained on all languages and dialects or have separate systems for each language and perhaps for each dialect. The latter is a far more tractable approach; however, it requires a pre-processing step of determining what language/dialect is being spoken. This is typically implemented using a machine learning classification algorithm in which the languages are the categories.

Prosody: Human speech is full of differences in loudness, vowel length, and pitch placed on syllables and/or words. For example, in English, a higher pitch at the end of a sentence often conveys a question. In some languages, like Mandarin Chinese, syllables with a different pitch need to be interpreted differently. In many languages, pitch indicates sentence boundaries and/or emphasis. The duration and/or rhythm of pitch changes (i.e. the intonation) can also signal surprise, anger, or other emotions.

Language and dialect training data: No matter what algorithm is used to train a speech system, the amount of labeled data available is key. There are huge amounts of labeled data available for major languages such as English and Spanish though less so for regional dialects of those languages. For example, the Google Cloud speech recognition system only offers speech recognition for 120 of the world’s 7000+ languages as of September 2019.

Code switching: There are many bilingual speakers in the world and these speakers often switch languages back-and-forth when they communicate. speech recognition systems need to detect these switches and be able to process both languages.

Codecs:  Audio data is stored in many different formats.  Numerous codecs (programs that encode audio) are used to compress audio streams.  Each format has different characteristics such as the amount of loss that occurred during compression.

 

10.3   Speech processing tasks

Automatic speech recognition (ASR) has had a major impact on our everyday lives.  However, there are many other speech processing tasks that have been studied that have the potential to impact our everyday lives.  These include:

  • Speech translation:  Translating input speech into a different language
  • Speaker identification:  Identifying the speaker in an audio stream.  Each utterance is classified as having been produced by one of a set of registered speakers.
  • Speaker verification: Determine if an utterance matched a model of a specific speaker.
  • Language identification: Identifying the language of an utternace
  • Emotion recognition:  Classifying the emotions in an utterance.
  • Voice conversion:  Converting the voice of one speaker to sound like another speaker
  • Text-to-speech:  Producing speech from text
  • Speech diarization and separation:  Separation aims at extracting each speaker from overlapped speech, while diarization identifies time boundaries of speech segments produced by the same speaker.
  • Phone and/or phoneme classification:  Identifying the phones and/or phonemes in an audio stream

Different speech tasks require different types of representational learning.  For example, speaker identification presumably requires learning only low-level acoustic representation where speech translation requires learning a higher-level semantic representation.

10.4   Speech recognition datasets

Several datasets of speech recognition data plus correct transcripts have been created including the following datasets mentioned in this chapter:

  • Switchboard (Godfrey et al, 1992) A dataset of 260 hours of speech consisting of 2400 telephone conversations among 543 speakers.
  • Librispeech (Panayotov et al, 2015) A corpus of 1000 hours of speech derived from audiobooks.  There are three training datasets of 100 hours, 360 hours, and 500 hours of speech.  There are two dev datasets:  A dataset with clean data and a noisy dataset with challenging speech.  There are also two test datasets.
  • Multilingual LibriSpeech (Pratap et al, 2020) A multilingual speech corpus derived from audiobooks.  It contains speech in 8 languages, including 44.5K hours of English and 6K hours of other languages.
  • CallHome (Canavan et al, 1997) A dataset consisting of 120 30-minute telephone conversations.
  • 2000 Hub5 (Linguistic Data Consortium, 2002) A dataset consisting of 40 English telephone conversation that was used in the 2000 HUB5 evalution sponsored by NIST.
  • TIMIT (Garofolo et al, 1993a) A corpus containing recordings of 630 speakers of 8 major dialects of American English, each reading 10 phonetically rich sentences. The corpus includes time-aligned orthographic, phonetic and word transcriptions.
  • WSJ (Garofolo et al, 1993b)This corpus contains read speech from the Wall Street Journal.  There are two datasets known as WSJ0 and WSJ1.
  • CoVoST 2 (Wang et al, 2020) A multilingual corpus with speech-to-text translations into English from 21 languages and from English into 15 languages.
  • CVSS (Jia et al, 2022) is a multilingual speech-to-speech corpus based on the CoVost 2 dataset and include sentence-level translation pairs from 21 languages into English.
  • Babel (Gales et al, 2014) A multilingual corpus with speech recognition and keyword spotting datasets in 45 languages.
  • VoxLingua107 (Valk et al, 2020) A multilingual language recognition corpus with an average of 62 hours of audio for 107 languages.
  • VoxPopuli (Wang et al, 2021) A multilingual corpus with 400K hours of unlabeled speech in 23 languages plus 1.8K hours of transcribed speech in 15 languages.
  • CommonVoice (Ardila et al, 2020) A multilingual corpus with 7,335 hours of transcribed speech in 60 languages.
  • FLEURS (Conneau et al, 2022) 102-language parallel speech dataset with 12 hours of supervised speech per language.

10.5   Speech recognition prior to ~2010

As discussed below, deep learning technology revolutionized speech recognition.  This section provides a historical perspective on the techniques used prior to the advent of deep learning.

10.5.1    Consonant- and vowel-based methods

The first published attempt at speech recognition was developed in 1952 by a group of researchers at the Bell Laboratories facility in Murray Hill, New Jersey (Davis et al, 1952). Computers had been invented but weren’t in widespread use and at the time it was probably as easy to develop custom circuitry as it was to program a computer (never mind getting access to one of the few available computers).

This group of researchers developed an ingenious custom circuit that ended with 10 gas tubes. A single male speaker would utter a random string of digits – pausing at least 1/3 of a second between digits – and the correct gas tube would light up for each digit with 97% accuracy. The discrimination algorithm built into the circuit essentially identified vowel patterns and mapped them to digits.

This approach was only feasible by limiting the speaker vocabulary to 10 digits. If one tried to recognize all 170,000+ English words in this fashion, it wouldn’t work because vowels alone don’t provide enough information to discriminate among a large set of words. At a minimum, an algorithm would need to consider consonants as well as vowels.

Unfortunately, even basing a speech recognition algorithm on consonants and vowels is problematic. Though there are only twenty-six characters in the English language, individual characters do not always have the same sounds. For example, in the phrase “speech recognition,” there are three e’s, and each has a different sound. The first ‘e’ is hard. The second ‘e’ is silent, and the third ‘e’ is soft. Worse, these letters will sound different when pronounced by various speakers. Regional pronunciation patterns and the many dialects of English exacerbate this issue.

10.5.2   Word-based methods

In the late 1970’s, Bell Laboratory researchers (e.g. Rabiner and Wilpon, 1979) experimented with word-based methods.  They started with a dataset of words with many examples of each word.  They used a clustering algorithm to cluster the examples of the same word together.  For each test word, they used a distance measure to find the closest cluster.

The word-based approach is problematic for two reasons:  First, full words have different regional pronunciations. I pronounce tomato and potato with a hard ‘a.’ As a child, I was surprised to learn that in other parts of the US, people pronounce these words with a soft ‘a.’  Creating algorithms based on words is also problematic.

Second, there are just too many words.  The Oxford dictionary contains over 170,000 English words, and that does not include proper names. To create a training table based on words would require numerous spoken examples of each of the 170,000 words.

One reason is that different speakers pronounce words differently. Another reason is that the audio waveform for a word will change based on the word spoken before the word and the one that came after. Also, some words (like “there” and “their”) have different spellings but sound the same.

Another issue is that the supervised learning algorithm would need 170,000 output categories. The higher the number of output categories, the harder it is to train a supervised learning system.

10.5.3  Phonemes

Instead, many speech recognition systems use sub-word units that have much lower numbers of possibilities. The most commonly used sub-word unit is the phoneme (others are syllables and multiple phonemes). Phonemes represent the distinctive sounds made by human speakers independent of language.

There are 13 to 21 vowel phonemes and 22 to 26 consonant phonemes in the English language depending upon dialect. Standard American English has 44 phonemes. The International Phonetic Association Handbook (IPA) is generally recognized as the definitive source on the phonemes that represent the world’s languages. If we include all languages, depending on which expert you ask, we could be looking at a total number of phonemes into the several hundreds or maybe thousands.

The discussion below will occasionally refer to phones as opposed to phonemes. A phone refers to an acoustic sound. A phoneme refers to any phone that is part of a word and which, if swapped with another phone, would change the meaning of the word. Essentially, you can think of a phone as the raw sound and a phoneme as a conceptual categorization of the sound. Part of the speech recognition process (the acoustic model) involves processing the phones that comprise the audio signals and determining which phonemes best represent the phones that comprise the audio signal.

The IPA contains a good description of how the human anatomy produces each phone/phoneme. The IPA classifies consonant phonemes based on the place of articulation (e.g. bilabial, dental, …), the manner of articulation (e.g. plosive, nasal, …), and the phonation type (voiced vs voiceless). It classifies vowel phonemes based on tongue height/advancement, nasality, and lip rounding. Using phonemes (or sub-words) instead of words has several advantages:

  • Training is much easier.  There are far fewer phonemes (e.g. 40 or so in English) than words (e.g. 170K for English) so far fewer labeled acoustic samples of are required.
  • The system will do much better on out of vocabulary words and proper names.
  • The computational load is far less.

10.5.4  Stages of speech recognition systems

Prior to the advent of machine learning, most speech recognition systems had the following stages:

Stages of traditional speech recognition systems

10.5.4.1    Preprocessing of the acoustic signal

The first stage of speech recognition was to break the acoustic signal down into small windows or frames of about 20 to 25 milliseconds. Because the speech signal for each phoneme is typically relatively constant for about 10 to 20 milliseconds, just randomly picking these windows makes it unlikely that each window will capture a phoneme.

To make sure some windows capture phonemes, researchers use overlapping windows. For example, they will start a new window every ten milliseconds. Then a mathematical algorithm (usually a fast fourier transform) is used to convert the signal in each frame to frequencies and amplitudes over time that can be seen in a spectrogram. Additionally, some filtering is used to reduce noise and to delete low and high frequencies that don’t typically convey critical information.

The speech recognition task starts with a training table that has one row per spoken sentence in the training table. The columns are the windows (i.e., the audio signal in each window) and the words in the spoken sentence. The task is to learn a function that will translate the windows into words.

10.5.4.2    Feature extraction

The next step was to take the continuous frequency and amplitude information in each window and transform it into a fixed number of features, eg. between 40 and 100 features that can be analyzed. To the human ear, the difference between a sound with a frequency of 300 Hz and a sound with a frequency of 400 Hz is perceived about the same as the difference between a 400 Hz and a 500 Hz sound. However, above about 500 Hz, it starts to take a larger and larger difference in frequency to be perceived the same way a 100 Hz difference is perceived at frequencies under 500 Hz. As a result, algorithms that extract these features take this into account. These algorithms include:

  • Linear predictive analysis (LPC)
  • Linear predictive cepstral coefficients (LPCC)
  • Perceptual linear predictive coefficients (PLP)
  • Mel-frequency cepstral coefficients (MFCC)
  • Power spectral analysis (FFT)
  • Mel scale cepstral analysis (MEL)
  • Relative spectra filtering of log domain coefficients (RASTA)
  • First-order derivative (DELTA)

For the purposes of this chapter, it is not important to understand how these algorithms work or differ, just to recognize their names and acronyms as feature extraction algorithms. Speech systems often combine features using different feature extraction algorithms. These algorithms also normalize the speech signal in several ways such as compensating for the effects of different types of microphones.

For further information and the details of how these algorithms work see Shrawankar and Thakare (2013) and Potamianos et al (2009). The result of the feature extraction stage is a set of feature values (i.e. a vector of features) for each window. For example, 80 features per window is common, 81 features if an overall energy level is included as a feature.

10.5.4.3    Decoding

Before the use of deep learning, most speech recognition systems used an acoustic model (see below) combined with a pronunciation model (see below) to calculate the most likely word strings for a set of observed features and use a language model to estimate the probability of each word string in the language. The word string with the highest value of acoustic model probability multiplied by the language model probability is selected as the best answer.

Acoustic model:  The acoustic model is essentially an algorithm for identifying the phoneme in each window. As one might expect, during the training phase, the data is acquired to drive the acoustic model. In particular, the system learns typical values of each feature for each phoneme. It would be nice if the feature value were identical for every instance of a given phoneme in the training data but that is far from the case. In fact, if there is enough training data, the system will observe a distribution of values for each feature for a given phoneme.

A common assumption is that these values take on a normal (or Gaussian) distribution and the training is used to estimate the parameters of this Gaussian distribution. The combination of estimated Gaussian distributions is termed a Gaussian mixture model (GMM). One can mathematically compute a distance from a new feature vector to each of the learned feature vectors and thereby determine the relative likelihood that each phoneme is a match for the window.

If one used only the GMMs to determine the correct phoneme for each window, recognition accuracy wouldn’t be very good for larger vocabulary tasks. To augment the GMMs, most acoustic models incorporate a model of the likelihood of the transition from one phoneme to another.

The most common type of likelihood models are hidden markov models (HMMs). HMMs are not specific to acoustic models and are used in many different applications. Deng et al (1990) showed that HMMs reduce recognition errors by 40% for a 2500 word vocabulary speech recognition task. For a detailed technical exposition of the use of HMMs in speech recognition see Gales and Young (2007).

Pronunciation model:  The pronunciation model contains the phoneme sequence of each word. For example, a pronunciation model for English would have an entry for the word “house” composed of three phonemes: hh, aw, and s). If the speech recognition task involves a limited vocabulary (e.g. the 10 digits or 200 words), there will be relatively little confusability in the dictionary and converting a sequence of phonemes (or even a sequence of phoneme possibilities) into a word can be easily done with high accuracy. The larger the number of words, the more difficult it is to achieve high accuracy. Unlimited vocabularies are the most difficult.

Language model:  Using an acoustic model without a language model can lead to significant errors. The acoustic model makes hypotheses about the phonemes in a window and then the word strings for the sequences of phonemes and there will typically be several hypotheses for all or some of the words. Some hypotheses can be eliminated by the use of a language model. For example, the acoustic model might come up with two possible word sequences:

John made the right guess

John made the write guess

Both sequences sound the same to the speech system (and to people). The language model contains the likelihood of different sequences of words. In the example above, the language model will probably include a reasonably high count for the bigram “right guess” but will almost certainly have a zero count for “write guess”.

The language model can, therefore, be used to determine that (1) is correct and (2) is incorrect. A language model is not required for limited vocabularies (e.g. digit recognition) but is essential for large and unlimited vocabulary speech tasks. For tasks in the middle such as IVRs and command and control systems, the language model can be a very small rule-based grammar.

10.6    Speech recognition using deep learning

Commercial companies such as Google, Microsoft, Apple, and Baidu have invested heavily in speech recognition based on deep learning because speech recognition is a critical capability for web searches, intelligent assistants, translation, dictation, and numerous other functions that take place on smartphones and PC’s.

These heavyweights also brought huge financial resources to the table. For example, when Google wanted to study speech recognition for Arabic dialect differences, there were no adequate corpora of conversation voice data for the different dialects. Google went out and hired speakers of the different dialects to create 15,000 utterances for each dialect in both quiet and noisy environments. For the most part, they focused on solutions based on feedforward networks (DNNs), convolutional networks (CNNs), and RNNs.

Over the last decade, they have driven speech recognition performance almost to the level of human transcriptionists as discussed below. The GMM-HMM models discussed above dominated research efforts for almost 30 years even though they had well-known limitations (Mohamed et al, 2009). Many researchers had proposed neural network approaches as alternatives but, given the focus on demonstrable results, neural networks always seemed incapable of producing better results – at least partly because the computational demands were just too high for computers of that era.

10.6.1  Hybrid systems

That all started to change when researchers at the University of Toronto (Mohamed et al, 2009) in the lab of Geoffrey Hinton used a DNN to create an acoustic model for phone recognition that outperformed the previous state of the art by 1.3%. The network outputs the probabilities of each phone and a bigram language model was used to choose the final output phones. Then they (Mohamed et al, 2010) paired the DNN (pre-trained with the DBN) with an HMM to further improve performance.

In 2012, a joint paper by a group of well-known researchers at Google, Microsoft, IBM, and the University of Toronto (Hinton et al, 2012) showed a 1.4% improvement over the state of the art GMM-HMM acoustic model using a DNN-HMM acoustic model. IBM researchers (Sainath et al, 2013) further improved on this idea by using a convolutional neural network instead of a DNN.  NTT researchers (Asami et al, 2017) found that using knowledge distillation techniques in a teacher-student architecture improved performance over using a CNN. 

Other deep learning architectures have also been pursued (e.g. Chorowski et al, 2014; Chorowski et al, 2015). Hybrid models are still in wide use in commercial speech systems (Li, 2022) primarily because they have undergone decades of production optimization.  In contrast, end-to-end systems (see below) which perform better are relatively new. Researchers from Google (Rao et al, 2015) and Microsoft (Yao et al, 2015) have also used bi-directional LSTMs to replace pronunciation models in hybrid systems.

10.6.2   End-to-end systems

Historically, speech recognition systems were created by combining acoustic, pronunciation, and language models that were trained independently with different training objectives.  End-to-end systems use a single deep neural network.  They start with an acoustic signal as input and produce words as output.  By using a single objective function, they are more likely to find a global minimum than a conglomeration of modules that are trained with different objectives (Li, 2022).

In 2015, a group of CMU researchers developed an end-to-end speech recognition model that didn’t require an HMM, a pronunciation dictionary, or a language model (Chan et al, 2015) named listen, attend, and spell (LAS).  The LAS architecture is illustrated below:

Listen, Attend, Spell network for speech recognition

It has two RNNs and an attention mechanism. The first RNN is named the listener and converts speech signals into high-level features. The second is named the speller. The speller converts the high-level features into an output string. The speller RNN uses an attention mechanism to determine which high-level features to focus on.  

Google Brain researchers (Park et al, 2019) improved on the LAS architecture by using a data augmentation technique that they named SpecAugment.  This technique adds observations that are created by warping input features, masking blocks of frequency channels, and masking blocks of time steps.  By augmenting the data in this fashion, these researchers were able to achieve state of the art results on the LibriSpeech 960h and Switchboard 300h datasets.

Another successful end-to-end architecture is the Connectionist Temporal Classification (CTC) architecture.  CTC was used as the basis for several systems including the Deep Speech 2 system created by Baidu researchers (Amodei et al, 2015). ESPnet (Watanabe, 2017) is an open source end-to-end speech processing toolkit.

10.6.3   Self-supervised learning

As discussed in Chapter 4, self-supervised learning is used to learn representations that enhance performance of downstream natural language tasks and reduce the number of labeled observations required.

10.6.3.1  Contrastive predictive coding

The same is now happening for speech recognition.  Meta AI researchers created wav2vec (Schneider et al, 2019).  The result is improved performance and smaller amounts of required labeled data.

Wav2vec uses a convolutional neural network architecture in which each 30 msec audio segment is either unmodified or corrupted with a 10 msec segment from another audio stream.  The self-supervised objective is a contrastive loss that requires determining if the 30 msec segment is the original or is corrupted and was trained with 1000 hours of unlabeled speech. 

These representations were then used as a starting point for supervised learning of speech recognition.  The resulting system produced comparable performance to Deep Speech 2 with two orders of magnitude less labeled supervised training data.

In vq-wav2vec (Baevski et al, 2020a) and wav2vec 2.0 (Baevski et al, 2020b), the contrastive loss approach from wav2vec is used to identify a set of discretized speech units.  These speech units are used as targets in a masked prediction task that is similar to the masked language modeling task in BERT for learning contextualized speech representations.  The result was state of the art performance on TIMIT phoneme recognition and on WSJ speech recognition.

w2v-BERT (Chung et al, 2021) builds on vq-wav2vec but rather than performing the contrastive loss and mask prediction tasks sequentially, both are done simultaneous in an end-to-end fashion.  The result was state of the art performance on the LibriSpeech dataset.

wav2vec-Switch (Wang et al, 2022) was developed by starting with wav2vec 2.0 and training it to be more robust in a noisy background.  In addition to the contrastive loss used to develop wav2vec 2.0, the quantized representations of the original and noisy speech were added as additional prediction targets of each other. 

Quantization (van den Oord et al, 2018) creates a compact representation that allows for high fidelity reconstruction. Meta researchers (Défossez et al, 2022) have even used wav2vec 2.0 to turn brain signals into text.  The signals were obtained non-invasively from magneto-encephalography and electro-encephalography.

10.6.3.2  Autoregressive predictive coding

Another self-supervised learning approach is to use autoregressive predictive codingChung et al (2019) transformed each frame of a spectrogram into a set of feature vectors and trained a model to learn to predict next frame.  The goal was to preserve information about the original signals to enable downstream tasks to select which knowledge to use.  The result was both a reduction in the required downstream model size and a reduction in the amount of labeled data required for speech recognition, speech translation, and speaker identification.

They also showed that different levels of speech information are captured at different layers.  Lower layers are more discriminative for speaker indentification.  Upper layers provide more phonetic content.

Amazon researchers (Ling et al, 2020) created DeCoAR (deep contextualized acoustic representations) using a strategy inspired by ELMo.  They pre-trained the system to reconstruct a frame by using both a forward language model and a backward language model and using a training objective that minimized the loss on both models.  Then they used the DeCoAR features to train the system for speech recognition using the LibriSpeech benchmark (Panayotov et al, 2015).  With the DeCoAR features, they were able to train a system with 100 hours of labeled data that was comparable to supervised training with 960 hours of labeled data using the original feature set.

They (Ling and Liu, 2020) then built on this framework by putting a vector quantization layer (for data compression) between the encoder and decoder.  They used two objective functions, reconstruction loss and diversity loss in the quantization layer (i.e. ability to stochastically generate different reconstructions).  Using this architecture, they only needed 10 hours of labeled data for comparable performance.

10.6.3.3   Masked predictive encoding

HuBERT (Hsu et al, 2021) improved on the wav2vec 2.0 performance by adding an unsupervised clustering algorithm that identifies the acoustic units. Mockinjay (Liu et al, 2020) was trained to predict the current frame based on both past and future context frames.  Fine-tuning the pre-trained model improved the state of the art on phoneme classification, speaker recognition, and sentiment classification.

Taiwanese researchers (Liu et al, 2021) created TERA which altered the masked frame stochastically varying time, frequency, and magnitude.  They found that time alteration led to more accurate phoneme prediction, keyword detection, and speech recognition, as it leads the model to learn richer phonetic content. The frequency alteration effectively improved speaker prediction accuracy, as it leads the model to encode speaker identity. The magnitude alteration improved performance for all tasks, as it potentially increases data diversity for pre-training.

Amazon researchers (Xu et al, 2022) showed that a second decoder pass using a bidirectional BERT model to rescore the candidate outputs significantly improves performance on the LibriSpeech datasets.

10.6.3.4   Self-training

Noisy student training (aka self-training) has been used as far back as 1998 for speech recognition (Zavaliagkos and Colthurst, 1998). Meta researchers (Kahn et al, 2020) used this approach to achieve 33.9% better performance on the noisy LibriSpeech dataset than on a baseline model trained solely with supervised data.  The model achieve 59.3% better performance on the clean LibriSpeech dataset.

Self-training has also been shown to be complementary to the use of pre-trained models such as word2vec 2.0.  The combination of self-training and pre-training leads to higher performance than pre-training alone (Xu et al, 2020).  

10.6.3.5   Conformer models

Transformer-based models excel at modeling long-range global context such as lengthy natural language and/or speech inputs.  Convolutional neural networks excel at extracting local feature patterns especially in image processing and more recently in speech processing. Conformer models developed by Google (Gulati et al, 2020) combine convolutional and transformers networks to get the best of both worlds.  The Google researchers showed that conformer models outperform both stand-alone transformer and CNN models. Google Brain researchers (Zhang et al, 2020) achieved state of the art performance on the LibriSpeech dataset using conformer models pre-trained with wav2vec 2.0.  They used SpecAugment to add observations to the dataset and trained the model using noisy student training.  In a subsequent paper (Zhang et al, 2021), they showed that, by pre-training an 8 billion parameter conformer model on a million hours of unlabed audio, they were able to match the state of the art by fine-tuning a model with only 3% (34K hours) of the labeled training data.

10.7  Multilingual models

Multilingual models are trained on multiple languages simultaneously.  Google researchers (Li et al, 2021) showed that multilingual models achieve parity with monolingual models without degrading performance on high-resource languages as long as there is sufficient scale. They also showed that size matters for speech just like it does for text-based NLP. 

They found that increasing model size reduces training time.  Their one billion parameter model reached the accuracy of their 500 million parameter model with 34% less training time.  Additionally, they found that adding depth to a network works better than width and that large encoders do better than large decoders.

Another group of researchers (Babu et al, 2021) found a similar result.  They developed XLS-R, which is a large-scale version of wav2vec 2.0.  It contains two billion parameters and was trained on almost 500K hours of speech in 128 languages, which is an order of magnitude more data than any previous model. XLS-R improved on the state of the art performance on the BABEL, Multilingual LibriSpeech, CommonVoice, and VoxPopuli speech recognition benchmarks by 14-34%.  It also improved the state of the art for speech translation on the CoVoST-2 benchmark.  It can translate between English speech and 21 other languages, including some low-resource languages.  It also showed improved performance on translation of speech in low-resource languages to English.  Finally, it improved the state of the art on the VoxLingua107 language identification benchmark.

This research also showed that a large enough model can perform as well with cross-lingual pre-training as with English-only pre-training when translating English speech into other languages. Rather than doing pre-training following by supervising learning, the JUST system (Bai et al, 2021) trains both at once end-to-end.  The result is state of the art performance on the Multilingual LibriSpeech dataset, showing that this technique is a useful approach to speech recognition for low-resource languages.

Instead of pre-training used unlabeled data, OpenAI researchers (Radford et al, 2022) scraped 680K hours of multilingual audio paired with transcripts from the web.  Using a plain vanilla transformer model, they used supervised learning on this large dataset to create the Whisper model.  Whisper demonstrates speech recognition in multiple languages as well as translation from many languages into English.  This research also showed that the use of such a large and diverse dataset produced robustness to accents, background noise, and technical terminology.  Whisper is available as open source.

Meta reseachers are exploring text-to-speech and speech-to-text in the lab on 1,100 languages.  To facilitate research on these systems, they created a parallel dataset of readings from the New Testament in all 1,100 languages which provides 32 hours of data per language and added unlabeled recordings of over 2,900 other languages.  One of these systems, SeamlessM4T performs “speech-to-text, speech-to-speech, text-to-speech, and text-to-text translations for up to 100 languages depending on the task”.

In 2022, Google announced an initiative to build language technology for 1,000 languages.  In 2023, Google introduced it Universal Speech Model (Zhang et al, 2023) that has achieved speech models that can translate over 100 languages.

See also Yadav and Sitaram (2022) for a more comprehensive review of multilingual speech models circa 2022. For a more in-depth review of research into self-supervised speech representation learning, see Mohammed et al (2022), also circa 2022.  For an open source self-supervised learning toolkit for speech see S3PRL.

10.8  Other speech tasks

Besides speech recognition, considerable research has been done on other speech tasks.

10.8.1  Speech-to-speech translation

Historically, speech-to-speech translation was done by a sequence of models.  First, the speech in one language was converted to text, then the text was translated to the target language, and then speech in the target language was generated from the text.

More recently, research has focused on direct translation without an intermediate text representation.  Direct translation has four main benefits:

(1) It avoids the cascading of errors that can occur by feeding the results of the speech-to-text system to the text-to-text translation system and then to the text-to-speech system.

(2) It preserves the non-textual information from the speech input such as voice, emotion, and tone.

(3) It has the potential to reduce translation latency.

(4) It can be used for the thousands of spoken languages that do not have a writing system and/or are in danger of extinction.

Google researchers (Jia et al, 2019) developed Translatoron which was trained end-to-end on Spanish-to-English speech translation without an intermediate text representation.  The follow-on Translatoron 2 system (Jia et al, 2021) performed only slightly worse than a benchmark system that used an intermediate text representation. Meta has two ongoing language translation initiatives. 

The No Language Left Behind project (Costa-jussà et al, 2022), discussed in the chapter on machine translation, provides text-based machine translation for 200 languages.  A companion speech project, named the Universal Translator translates speech-to-speech.  Together, the two projects have the long-term goal of addressing all the world’s languages, even those languages that are only spoken and have no written language.

Meta researchers (Lee et al, 2021) showed that it is possible to do direct speech translation without the benefits of any text training materials and followed that up with impressive results using their S2ST system (Lee et al, 2022).  S2ST applies a self-supervised discrete speech encoder on the target speech to identify discrete representations of the target speech.  Then it trains a sequence-to-sequence speech-to-unit translation (S2UT) model to predict those discrete representations that perform at a level comparable to the Translatoron system.  They have demonstrated this technology by producing a system that translates between Hokkien (a Chinese unwritten dialect) and English.

Another group of Meta researchers (Tang et al, 2022) combined information from a self-supervised learning speech subtask and a text-to-text task to improve speech translation (and speech recognition).

10.8.2  Language identification

Language identification is an important issue for companies like Google who support speech recognition in 72 languages and hundreds of dialects as of September 2022.  Each language has a different recognition system and the language of each utterance must be identified in order to pass the utterance to the correct system. 

Dialect identification is an especially tricky problem. For some languages like English and Spanish, speakers of different dialects can generally understand one another. For others, this is not the case.

There have been many classical approaches to language identification (Zissman and Berkling, 2001; Brϋmmer et al, 2009; Martinez et al, 2011). However, they have been superseded by neural network approaches.

In 2014, Google researchers (Lopez-Moreno et al, 2014) showed that use of a DNN for language identification improved performance over earlier methods by up to 70% provided there was enough training data and then showed another 28% increase by using an LSTM-based DNN (Gonzalez-Dominguez et al, 2014).

One approach to the problem is to develop a technique for dialect recognition based on features such as intonational cues, rhythmic differences, and phonotactic constraints (i.e. syllable structure, consonant clusters, and vowel sequences) (Elfeky et al 2015). Another approach is to train a single recognition system on multiple dialects.

However, historically, this tends to reduce performance. For example, in 2012, Google researchers (Biadsy et al, 2012) studied speech recognition for five different Arabic dialects. Not surprisingly, they found that a system trained on one dialect performed far better on that dialect than it did on any of the other Arabic dialects.

Google researchers (B. Li et al, 2017) also found a way around this issue. As part of each training sample, they included a token identifying the dialect for the utterance. The somewhat surprising result is that a single system trained on seven English dialects slightly outperformed the individual dialect systems.

More recently, state-of-the-art performance on language identification was obtained by a group of Chinese researchers Fan et al (2021) using self-supervised learning. State of the art results have also been achieved using self-supervised learning for speaker verification Fan et al (2021), emotion recognition (Pascual et al, 2019), voice conversion (Lin et al, 2021), text-to-speech (Alvarez et al, 2019), and phone classification (Chung et al, 2019). Meta’s Massively Multilingual Speech models can now identify 4,000 spoken languages using multilingual training.

10.9  Performance of machines compared to humans

The primary measure of accuracy on speech recognition tasks is the word error rate (WER) which is the number of incorrect words plus the number of missing words plus the number of added words divided by the total number of words.  Researchers compare human and speech system WER’s to gauge progress though there has been some debate on how human WER’s should be measured.

In 2017, a group of Microsoft researchers (Xiong et al, 2017) were able to surpass human WER performance on the Switchboard dataset (5.8% for the Microsoft system and 5.9% for humans) and on the CallHome (11.0% for Microsoft and 11.3% for humans). The Switchboard dataset has extremely clean audio quality. The Microsoft system used both LSTM-based acoustic and language models.

However, a group of IBM researchers (Saon et al, 2017) disputed Microsoft’s claim of better-than-human performance arguing that the Switchboard human benchmark should be 5.1% and the CallHome benchmark should be 6.8%. IBM’s own system which was a combination of LSTM and CNNs performed at 5.5% on the Switchboard dataset and 10.3% on the CallHome dataset.

These neural networks for speech recognition have been shown to be vulnerable to adversarial attacks. Researchers created mathematical perturbations of audio streams that have no impact on the ability of people to identify the words in the audio stream. Yet, these perturbations cause deep learning systems that previously identified the words with high accuracy to fail to recognize any words (Carlini and Wagner, 2018).  This is an indication that these deep learning networks function very differently than humans.  

Computer Vision >   © aiperspectives.com, 2020. Unauthorized use and/or duplication of this material without express and written permission from this site’s owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given with appropriate and specific direction to the original content.