Automatically determining the words present in a piece of audio - not deriving meaning from the words, converting text to speech, or recognizing who is speaking. Speech is the fastest way of communicating - 150-200 words per minute as opposed to around 60 for typing. It also requires no specialized training, can be done without hands. Text is also easier to process and takes up much less data than raw audio.
Transcribing audio rapidly - making records quickly by speaking into a computer. If speech really worked well, in many cases it would be a better way to communicate with a computer, especially if you're in a car or in front of your TV. Also would be helpful for people who can't type or type slowly. A real-world problem, with a huge market, not easy but not impossible, lots of available data, techniques are applicable to a lot of other domains - signal processing, probability and statistics, phonetics/linguistics, NLP, machine learning…
Early methods were pretty independent - used spectral analysis with statistical training or language modeling with a small vocabulary and small number of users. DARPA funded a program to integrate all the individual ideas in linguistics, artificial intelligence to make a system with a larger vocabulary (1000 words) and reasonable speed (a few times real time). Systems can be knowledge driven, where we try to apply our knowledge about speech and linguistics to a computer algorithm, or data driven, where we build a big dumb system which is fed lots of data. Even early on, the data-driven approaches proved superior. The data-driven approach ignores everything we know about phonetics and linguistics and finds the most probable word sequence given the audio and train models based on data. Later on, the E-M algorithm, n-gram models, Gaussian Mixtures, HMMs, Viterbi decoding were developed to find that most probable word sequence, all the while computing power was catching up to algorithms. Since the 1990s, the basic algorithms are the same, except with much much more training data (1000x more) which is the main reason for performance gains.
The typical system outputs GMMs, which are modeled as a sequence using an HMM and then is converted to language using an n-gram model. Advances have mostly been in adaptation (dealing with anything other than normal speech) and discriminative training. New technologies like Deep Belief Networks are on the cusp of adoption. Systems may be speaker independent or dependent - may need “enrollment” to learn each person's speech; be small or large vocabulary; transcribe isolated words (speaking slowly) or continuous; differ in their domain - air travel vs. e-mail dictation, read vs. spontaneous; differ in their language; differ in how noisy the speech can be; differ in bandwidth. Current systems are still not very close to as good as human recognition. Some systems do the processing locally, others upload to a computer. There are still many distortions that may trip up a system - especially when the training data does not match the testing data.
Create many examples of words and train a system on those examples. Audio is converted into quantized samples at some rate - typically according to what information we expect to be available. Try to come up with a metric that determines the minimum distance between the query audio and the learned audio. A basic system could use Euclidian distance on the samples, which would be very bad. Instead, we should find a representation of the audio (features), and a good distance metric - which is a very difficult problem. Very simple changes in the audio can trip up the representation of the audio - volume, rate, pitch, accent, dialect, voice quality, coarticulation, microphone position, background noise, reverberation. The feature extraction and distance metric should be independent of these distortions.
Many systems do not ignore everything we know about speech - words and phonemes, a bit of human perception, but not nouns, vowels, etc. We need to focus first on what features we should use - air pressure at time t, loudness, energy/phase of a frequency, estimated position of a speaker's lips… Look at human speech production for ideas. To produce speech, air comes out of the speaker's lungs and the speaker's vocal chords vibrate to produce sound which is modulated by the glottis (lips), jaw, tongue, etc. The actual information about how the human articulates is not useful, instead we use derived features, by modeling speech using a source-filter model - glottis, vocal tract, and lip radiation filters. It's generally better to model speech perception-based features (MFCC, PLP) instead of production.
We can characterize the sounds generated as phonemes, of which there are about 40 or 50 for English, with differences based on the speaker and their dialect. Phonemes don't solely depend on the letter, it also depends on the letter context - the different allophones of the word. Sounds may be voiced (F unvoiced, V voiced) - all vowels are voided and stops/plosives where the oral cavity is blocked then opened. Each phoneme has a different characteristic frequency shape, with a specific pitch and formants (harmonics). Phonemes can be classified into different classes - vowels, dipthongs, consonants…
Sound waves travel into the ear and are bandpass filtered into different overlapping bands. Human physiology (based on experiments) is used as the justification for the features used. Humans do not hear frequencies linearly - we are sensitive different amounts to different frequencies, which may be characterized using equal loudness contours. Human's perception of distance between frequencies can be modeled by a warping curve, which models how well humans differentiate between frequencies, such as the Mel scale.
Have an audio sample for each word in your vocabulary, and for each test sample, find the entry in the training data which is the closest to the signal and retrieve its label. To use a simple distance measure, we should extract features from the audio. Features should capture the essential information of a signal in a compact way and ignore everything irrelevant (noise, etc). Early on, analog filter banks were used, then linear predictive coding and its spectra. Most recently, many systems use mel frequency cepstral coefficients and perceptual linear prediction which may be augmented to perform better.
Because sample rates are high and a single sample has very little information, audio is grouped into frames which extend over a few milliseconds over which features are extracted. So, we go from a representation where we have a single sample every <math>1/f_s</math> sections; now we have a feature vector of about forty or so values every few milliseconds.
To calculate any cepstral domain feature, we compute the short term spectrum (over each frame to look at the frequency distribution), and convert it to the cepstrum. Each of LPC, MFCC, and PLP do these steps differently - LPC inspired by human production, MFCC and PLP are inspired by human perception. The short-term spectrum is used because it models what humans perceive - we want a small enough region so we don't miss out on anything, but we still capture all the information we need. The cepstrum is the inverse Fourier transform of the magnitude spectrum. The cepstral transform is homomorphic - it obeys superposition. The cepstrum is intended to separate the source of the excitation from the filter (the shape of the vocal tract). The cepstrum captures the spectral envelope and pitch; the low coefficients correspond to the envelope.
When taking the spectrum over frames, we must choose a window, a frame size, and a hop size, each of which is an important choice. Windowing multiplies a sequence of values to the signal. Empirically the frame spacing (hop size) is typically chosen to be ten milliseconds. The window length (frame size) is typically chosen to be short enough to not smear transients and long enough that the spectrum output does not depend heavily on what part of the signal you are windowing - a trade-off between time and frequency resolution. It's typically chosen to be about 23 milliseconds. All windowing you do is multiplication in the time domain, and corresponds to convolution in the frequency domain. If the signal is periodic, ideally the spectrum will be unchanged by the window function. This change can be quantified by taking the window's DFT, which reveals the main lobe width and side lobe suppression. If the main lobe is wide, frequency resolution will be lower; if the side lobe suppression is not good, lower energy frequencies will be distorted.
Want to model the vocal tract size like a filter with some transfer function <math>H(z)</math>, assuming the output is a linear combination of previous samples and some excitation. Excitation is an impulse train for pitched speech, and noise for fricatives/plosives. We solve for the filter coefficients to minimize the prediction error vs. the observed signal. To minimize the error, the derivative is taken and set to zero. We can take the Z transform of the linear prediction to get the linear prediction spectrum. The LPC spectrum follows the peaks of the spectrum, it achieves a spectral envelope. The error signal is “whitened” - has the spectral envelope removed. Empirically you set the number of coefficients to <math>f_s/1000 + 3 \pm 1</math>. The LPC coefficients are very sensitive to input signal frequency and are highly intercorrelated in a nonlinear fashion. We can't use the LPC spectrum because it's not compact, instead we will use the LPC cepstrum - the inverse DFT of the log of the LPC spectrum. Given the LPC coefficients, it's easy to calculate the LPC cepstrum. 12-20 cepstrum coefficients are typically taken.
Humans don't hear all frequencies equally; the mel scale approximates the spacing in human perception. The mel binning treats the cochlea as a filterbank, where the filterbank spacing mimics the Mel scale instead of having equally spaced frequencies as in the normal spectrum. The mel scale divides frequencies into triangular filters, where under 1 kHz the spacing is linear and above it's logarithmically spaced. The triangular filters are a crude approximation to the response of the bands in our ear. Once the spectrum is mel-binned, the DCT (DFT of symmetrized signal) of the log magnitude is taken to obtain the cepstrum.
PLP also calculates the mel frequency binning, but instead of taking the logarithm takes the cube root of the power spectrum instead and then take the inverse DFT to produce “fake” autocorrelation coefficients.
Typically, some additional steps are taken. Pre-emphasis applies a filter to compensate for the lowpass filter rolloff caused by the glottal-lip combination. This filtering produces a multiplication in the frequency domain which gives gain to higher frequencies depending on the filter's single coefficient. The “delta” features are taken also which take the difference of the features from frame to frame in order to model the trajectories of features - taking the first and second derivatives over time. These values are appended to the feature vector which increases its dimension. The difference is typically taken over one or two frames.
No one uses LPC cepstra anymore, people sometimes use PLP, sometimes MFCC. Often they will create a system using both PLP and MFCC and use whichever works better. These techniques seem arbitrary but thousands of feature extraction methods have been tried and these seem to work best or at least as well as everything else. Some people have tried to learn the transformations from data, but this hasn't helped yet. The holy grail is to match the signal processing humans do. Other features have included articulatory features (position of mouth, lips), creating transformations on the features which try to reduce redundancy, normalizing the vocal tract's effect, adapting transforms to the speaker, and features which try to maximally separate certain sounds from others.
In general, test/train data examples (sequences of feature vectors) won't be the same length or be aligned. We want to warp our test data do be the same length as our train data and we want to find the “endpoints” of the speech so that the words are aligned. The alignment scheme needs to be nonlinear because the phoneme spacing may be different too. This time warping should be constrained - we want to start at the beginning, continue forwards without any large jumps, and end at the end. At any time, we can either align a segment to the next segment or the same segment. What we want to do is compute the distance for all possible alignments and find the smallest distance. The alignment is done with dynamic programming.
We want to solve the dynamic time warping (shortest path) problem by caching subproblem solutions instead of recomputing them. An alignment is just some path from the beginning to end, without going backwards. Each step will have some cost according to the distance between the feature vectors. To minimize the computation, we should cache all of these costs as they are calculated to avoid redundancy. In general, there's a bias against taking two steps instead of one (traveling along the diagonal) so sometimes the off-diagonal steps are weighted, which are all heuristic. You can also skip paths where there's a step that is above a certain threshold.
Dynamic time warping is easy to implement, and can model arbitrary time warping, but the distance measures are heuristic - there's no reason we should use a Euclidian distance metric with equal vector dimension weightings. The penalties for different paths is also heuristic. There's also no guarantee of optimality or convergence. Because of the freedom of paths, you can have problems where a very small word (or a small portion of a word) is matched with a longer word because they happen to match well locally. When the size of the training set and the number of templates for word increases, the computation time increases a great deal too. Improvements may come from learning these heuristics from the data itself - the distance measure, graph weights, and graph structure using theories and models from probabilities, statistics, etc.
GMMs provide a principled way to create a probabilistic model of the feature vectors associated with a speech sound and the distance between the vectors. In DTW, once the training and test feature vectors have been warped by some function, we can calculate the total distance for a word as the total distance between each feature vector for each frame. What if instead of only having a single training sample we have many training samples? You could take the mean of the samples and compute that distance, or compute the distance to the closest sample… Instead of computing a distance, we can instead find the word for which the probability that the test word would most likely fall in the group of training vectors for a given word. If we can formulate the probability that a set of feature vectors is a certain word, you can create all kinds of optimality constraints. If we had an infinite amount of data, we could estimate the probabilities perfectly, but we only have a finite amount of data so we have to follow these principles on principle.
We want to estimate the probability that a point in the training data lies at some arbitrary point, and we model this with a Gaussian distribution, which is characterized by a mean (center of the data) and variance (how wide the data is spread). Gaussians are “valid” in the sense that the area under them (taken infinitely) is one. If you generate a many sums of random numbers with the same distribution you will get a Gaussian distribution, and any linear operation on a Gaussian random variable will produce a Gaussian distribution. When you compute the negative logarithm of a Gaussian, you get something that looks like a Euclidian distance. In may dimensions, if the dimensions do not correlate (knowing something about one dimension of a variable doesn't tell you anything about another dimension - even if the variance is different), the Gaussian distribution is just the product of distributions in one dimension. In higher dimensions, you can define a covariance matrix which describes the correlation between different dimensions. This matrix has around the number of dimensions squared terms to estimate. If you assume the covariance matrix is diagonal (features which are uncorrelated), then you only have to estimate a single parameter for each dimension. When the covariance matrix is diagonal, the log likelihood is similar to the Euclidian distance and is quick to compute.
Given a set of training data, we want to estimate the means and covariance matrix such that the resulting distribution “matches” the data as well as possible. The notion of “matching” can be defined as how high of a likelihood the distribution assigns to the data. The likelihood is just computed as the product of the individual likelihoods. To achieve the maximum likelihood, you choose the mean and covariances such that the log likelihood of the data itself is maximized. The maximum likelihood will estimate the true (based on an infinite amount of data) parameter values. When you have diagonal covariance, you can just compute the mean and variance directly - in speech recognition, you usually use diagonal covariance matrices for this reason.
In general, your data will not necessarily be characterized by a single distribution - more likely it will be the sum of multiple Gaussian distributions. So, we'd like to model the data as the linear combination of an arbitrary number of Gaussians, where each Gaussian has a weight and all weights add up to one. In this way, distributions that don't look Gaussian can be described by a few Gaussian distributions. With a mixture we need to estimate each distribution's mean and covariance matrix as well as its weight. There is no longer a closed-form expression, so we need to use some optimization technique. Normally the Expectation-Maximization (E-M) Algorithm is used.
The E-M algorithm finds maximum likelihood estimates for models with hidden variables using a hill-climbing method where at each iteration it adjusts the parameter estimates such that the likelihood increases (weakly) to find the parameters for a locally optimal likelihood. A hidden variable is some random variable which isn't observed - for example, the parameters of the probability distribution that generated some random variable. In order to compute the probability of an observed variable, you need to sum over all possible values of the hidden variables. The Gaussian Mixture Model can be viewed as a hidden model where the variables are the Gaussian distribution itself and the weighting. In each step of the E-M iteration, you can either assign a single hidden variable for each data point, or you can compute the posterior probability for each hidden variable which is computed as the probability of the data point from one model over the sum of the probabilities of the data point from all models. At the beginning of the algorithm, you come up with the parameter values somehow. Each iteration has two steps - compute the posterior probability of each value of h for each data point (expectation), then you pretend as if the data is non-hidden where the count of each data point is due to one of the models. The posterior probabilities calculated in the estimation step are used to update the hidden variables in the maximization step. Convergence is guaranteed to a local optimum (it always gets better), although it might not be fast. It also depends a good deal on the seeding of the model.
Before using E-M, you need to choose a number of Gaussians. One method is to increase the number of Gaussians until adding more Gaussians makes performance worse. You can instead penalize the likelihood according to the number of parameters (which increases with the number of Gaussians) according to the Bayesian Information Criteria. To initialize the parameters, you can set the mixture weights to the inverse of the number of Gaussians, pick data points at random and use them to see the initial means and use the global variance for each model's variance. Many starting situations can also be tried and the one with the highest likelihood. You can also calculate a model with a single Gaussian and alter the means slightly as you add more Gaussians. You can decide when to stop according to when the log likelihood stops improving a certain amount. You can also take some data and hold it out as you train and use that to determine when the system performance does not increase.
We want to be able to model how sounds (primitive units of sounds, phonemes or parts of phonemes, labeled using GMMs) evolve over time - a probabilistic counterpart to DTW. Early HMM systems were discrete, and quantized each feature to some symbol which was fed into the HMM. Symbols were assigned some centroid according to training, and each incoming feature vector was quantized according to the closest centroid.
Random variables are independent if each instance of the variable in a sequence does not depend on a previous value in the sequence - for example, fair coin flips. In this case, we only need to build a model over a single instance of the variable, which describes the (nonnegative) probability for each possible outcome with the sum of the probabilities for each outcome summing to one, The probability of a sequence is then just the product of the probabilities of each outcome in the sequence. We can take the log of the product to get the log likelihood, which is a sum of log probabilities. To estimate the model parameters for a sequence of data, we can find the parameters which maximize the likelihood of the observed outcomes. Maximizing the log likelihood is equivalent. If the variables are independent, the probability of each outcome is just the number of times the outcome occurs divided by the total number of outcomes in the sequence.
When the value of the sequence of random variables at any point in time depends on past variables, the probability of a given value is conditional on previous outcomes. The Markov property of order <math>n</math> stipulates that the probability of a given value can only depend on <math>n</math> previous variables. For a model which is Markov order 1, we can view each variable as having a separate probability for each outcome conditional on the previous value (pairs of probabilities). This leaves the initial probability, which we treat as a special case. There is still the requirement that the probabilities sum to one. Now, the probability of a sequence is the product of the probabilities of each outcome given the previous outcome. This reduces the maximum likelihood estimation to a maximum likelihood estimation for each prior state.
The state may not just depend on the previous states, it also may depend on “hidden” states which alters the state probabilities. This additional state makes it so that there are different probabilities to go from one state or another, depending on the value of the hidden state. This allows the model to model the same amount of information as the non-hidden version with fewer states - we can model “over-arching” changes in the model without having a high order Markov model. An HMM needs to be able to find the best path in the HMM given an observed sequence, the total likelihood of the observed sequence, and there must be a way to estimate its parameters.
Given an observed sequence, we want to find the state sequence which is most likely - which involves finding the probabilities of the hidden states. Determining the state sequence requires finding the probability of the sequence for each possible hidden state, and then finding the maximum one. The total number of possible state/outcome sequences is exponential, so finding the probability of each sequence should be done using dynamic programming. Dynamic programming takes advantage of the fact that the probability of each term in the sequence depends on the previous term in the sequence and the hidden states. (Viterbi)
We want to find the total likelihood of observations, which is the sum over all hidden sequences corresponding to the observed sequence. There's an exponential number of hidden sequences, so we need to use dynamic programming again. (Forward)
We want to estimate the parameters of the HMM to maximize the likelihood of some given training data using the EM algorithm. (Forward-Backward)
Instead of assigning probabilities to discrete sequences, we want to assign probabilities to sequences 40-dimensional continuous feature vectors. Instead of having a discrete output probability distribution for each state, we have a continuous probability density function - a GMM. The probability to change states is then described by the parameters of the output distribution and calculated according to the Gaussian probability density function. Now, instead of just having an arc probability, we have a arc probability times the output probability. We also have to reestimate the GMM parameters as we estimate the HMM parameters. People often use “linear” HMMs where each state just has a probability of staying in place or going to the next state, and each state represents a third of a phoneme, with a GMM per state. The GMMs used are typically diagonal covariance. For each arc from one state to another, you have an arc probability, a vector of mixture weightings, and the parameters of each mixture model.
One problem is that even for a small amount of data, you will be multiplying many thousands of likelihoods which can approach or exceed the numerical lower positive limit of double precision, which will give a probability of zero - a meaningless answer. Instead, the log probabilities are generally used. The log probability makes the Viterbi algorithm compute the max of a sum of log probabilities instead of the product of probabilities. In the forward algorithm, exponentiation is required which can also underflow so we include a constant factor in the exponent which is cancelled out outside of the exponent.
In constructing the HMM, some various decisions must be made. Empirical data suggests certain techniques, which are domain-specific.
In a word, we cannot move “backwards” across different sounds and have it be the same word - we have to continue linearly through the different sounds, potentially staying on each sound indefinitely. This suggests a model where only arcs from one state to the next or one state to itself are possible. However, in certain words, sounds can be skipped to still give the same word. This suggests a model with “skip arcs” which do not have an output and move from one state to another. In order to account for skip arcs, the Forward and Viterbi algorithms must be modified.
Empirically, phonemes are all the roughly same length, so you might want the same number of states per phoneme. Three states per phoneme are typically used. There is no guarantee, however, that each state is actually modeling a third of a phoneme - this is just a rough guess which is used to construct the HMM system, and in the construction for the HMM the states can be used in whatever way produces a maximim likelihood.
There is some theory, eg Bayesian Information Criteria, which can indicate the number of mixtures to use. However, normally what is done is the number of mixtures is changed systematically until the accuracy appears to be good. Having too many Gaussians can lead to overfitting; having too few produces a model which can't really express different sounds and won't be able to differentiate words well.
A simple flat start is normally used - transition probabilities and mixture weights uniform, all mixture means 0, and all variances 1. Then, a single Gaussian is used for each mixture, until the FB algorithm reaches some accuracy, then you split the Gaussians into two mixtures, and continue splitting in this way until you get the number of Gaussians you want. As you go, you can also prune Gaussians which only describe a few points.
In the “one big HMM” paradigm, instead of using a separate HMM for each word in the dictionary, you make one big HMM where there is a start state which transitions to the HMM for one of the words, and an end state which all HMMs transition to. Making this extension allows for clever things to be done like representing multiple words with the same pre/suffixes as one HMM with branches to other HMMs which characterize how the words are different. In addition, it makes it easier to model multiple words.
Rather than using a single big HMM, if you want to model a fixed number of words you can chain together multiple HMMs where the start state for one HMM is the end state of the previous one. If you want to model word sequences of any length, you can just add an arc from the end state back to the start state. The words that are present can be retrieved if the HMMs in the big graph are labeled. The training data needs to be multiple word utterances instead of single digit ones. This makes training more complicated, because in your single HMM you have lots of HMMs which are actually the same but repeated. So, you want to use “parameter tying”, which forces different HMMs to have the same variables - and all the training algorithms still work.
The silent/quiet noise spacing between words can trip up a speech recognizer. One way to deal with this is to add an additional HMM for silence, empirically also with three states. Silence can occur before or after any word and for any amount of time. Using an additional HMM between words typically works better than adding a recursive arc to a state between word HMMs. Adding a silence HMM before and after a word is also useful for single word identification, but this is usually built in as the word models are built - if there is silence in the training data, states will be expended in the HMM to account for the silence.
Given a spoken sentence which is acoustically the same as another, a speech recognition system should determine which sentence was spoken based on the likelihood that the sentence would exist in a language. In other words, given some audio, the probability that the utterance is “sample” may be about the same as if it is “sam pull” but the first word is much more likely. In other words, rather than calculating the likelihood of a word given some audio, we want to calculate (via Bayes' rule) both the likelihood of the word (sequence) in the language and the likelihood of the audio given the word (sequence). The “language” is domain-specific - in some circumstances, certain words are completely impossible, while in others, words may just be somewhat more likely than others. To create a language model, we need to compute a probability distributions for word sequences. This is typically done using (hidden) Markov models trained on lots of text. For simpler domains, where there is only a small domain of possible word sequences (name dialer, digit strings), a hand-tuned constrained language model is usually best. For larger domains with lots of training data, order-n Markov models are used, called n-gram models.
If we have lots of text data to construct our language model, we can automatically construct it. HMMs are not as applicable because it is hard to decide on a good topology to model the language - the model would need to be fully interconnected model with tons of states. So, we use “non-hidden” Markov models with a higher order n - the output at a given point in time is only dependent on the previous n - 1 states. The probability of a sentence is then calculated as the probability of a word given the previous n - 1 words. In order to cope with the beginning and end of sentences, special tokens are used (to ensure the probability distribution sums to one). The maximum likelihood estimation for one of these probabilities is calculated as the number of times the whole n-gram appears divided by the number of times the initial n-1-gram appears (eg #“I hate to”/#“I hate”).
In the big HMM, before each individual HMM path, the LM weight is placed first which provides a scaling probability for that word. This probability is relatively small so a weight is used to make it relatively comparable to the acoustic model probabilities.
A simple way to evaluate the language model is to calculate the perplexity of a test set, which is calculated as the probability of a given sentence, raised to a power corresponding to the number of words in the sentence, and inverted. This roughly encodes how easy it is to predict a sentence, and varies across domain and language. Perplexity can predict word error rate within the same language model type.
The probability of an n-gram in some training set can be very different from its probability in a test set, particularly when the data is sparse (there are many n-grams which do not occur much). It is very common (maybe 15% for trigrams) for there to be an n-gram with no counts in the training set which appears in the test set - this will assign the probability of 0 to a sentence which is valid. To avoid this, we “smooth” the maximum likelihood estimates for n-grams such that zero-probability n-grams are allocated some probability, moved from elsewhere, such that the probabilities still sum to 1. One way to redistribute probability is to use the Good-Turing estimate, where the counts for n-grams is lowered to “free up” probability, which is reliable for low-counts. The previously unseen n-grams can then be given probability according to Katz back-off smoothing, which assigns a probability according to the corresponding (n-1)-gram. There are many other options, and with a good smoothing model you can make the n-grams longer.
The HMM model discussed so far which has multiple states per phoneme, trained on some data, with word sequences modeled in the “one big HMM” manner, is called the whole world model. Some issues with the whole world model include that it cannot model unseen words, can't share data among models vor various words, and the number of parameters is linear to the vocabulary size.
To reduce the number of parameters, instead of using word models for each HMM, you can use subword models - either phones, diphones, and syllables. This requires breaking down each word into syllables, which can be hard because it's not always obvious from the letters in the sentence - you have to create an exhaustive dictionary, with all possible pronunciations of each word. The dictionary needs to be very complete, and error-free, which is very hard to create. The various pronunciations can be represented as a graph, where different possible pronunciations of any part of a word are represented as parallel arcs. The phonemes in the graph are then replaced by Markov models - most common is three states with self-directed arcs.
The probabilities of each arc in a given phoneme HMM can be constrained to be equal to limit the number of parameters. We can extend this further by “tying” all parameters for all HMMs for phonemes which are the same. The arcs corresponding to a given state can also be tied. However, phonemes are quite different when they appear in different words (next to different phonemes). More specifically, phonemes have variants known as allophones which depend on the context.
Triphones represent each phoneme in the context of its neighbors. This raises the number of phonetic elements to the power of three, but not all of them occur, and some occur rarely.
Because some words will not show up in the training data, or will not have many examples, for larger vocabularies we use phoneme GMM-HMMs instead of word HMMs. That way, you can pool data for each phone over all words. If the word spoken doesn't appear in the training data, you can still recognize its phonemes just like other words. As long as the word is in the decoder vocabulary, it can still be recognized.
The context of the phone (the phones on either side of the phone) will change its characteristics. So, doing “context-independent” phonetic modeling will not work well. A huge system might model each phone context, but this is impractical. In order to choose the triphone (context dependent on phoneme before and after), a decision tree is used. Because there can be so many parameters, flat start with forward-backward doesn't help. There are also local maxima caused by hidden variables.