|Context||Columbia Neural Network Reading Group/Seminar Series|
Acoustic modeling typically involves converting some acoustic feature representation to some representation of the phonetic state. Phonemes are the smallest unit in language, e.g. different sounds made. Phonemes can be characterized by some features - such as whether they are plosive, fricative, nasal, vowels, etc, where in the vocal tract they are articulated, and whether they are voiced or unvoiced (whether the vocal tract is vibrating when producing the phoneme). Every phoneme can be characterized by this distinctive set of features. Phonemes are an abstract idea, whereas a physical instance of a phoneme in an actual utterance is called a phone.
DNNs are state-of-the-art for acoustic modeling in speech recognition tasks, but their actual behavior is not widely understood. Better understanding of the transformations they perform could facilitate improving the models. A typical speech recognition system first converts some spectro-temporal features (including their derivatives) to probabilities of their phonemes, called acoustic modeling. Recently, this has been done with deep neural networks. In this study, a network with 5 hidden layers was used, with an input of log-magnitude mel spectrogram with context and their first and second first order differences. The probability of each phoneme was computed using a softmax output layer. The network was trained on the wall street journal corpus.
Running TIMIT through the network, its response to a given phoneme can be demarcated because TIMIT is labeled on the phoneme level. It’s then possible to see how certain units in different layers respond to different phonemes, and phoneme types. To simply quantify the amount to which a given unit responds to different phonemes, the number of phonemes for which the response is statistically significant can be calculated as the “phoneme selectivity index” (PSI). By performing unsupervised clustering on the resulting PSI vectors, different phonetic features become apparent. Similar PSI clustering resulted from different hidden layers. Interestingly, instead of different nodes representing different phonemes, they instead represented different phonetic features, which can be combinatorially combined to classify different phonemes. Phonetic feature encoding also became more explicit/phonemes became more separable in higher layers. The manner of articulation was the most dominant feature represented. In addition, certain signals which didn’t matter such as the speaker gender became less discriminable in higher layers.
To determine whether different units were more selective to different subclasses of phonemes, all units which were highly selective to a given phoneme in a given layer were selected and their responses for all instances of the phoneme were computed. The resulting responses were clustered. Studying the clusters revealed that different units which were more selective to certain phoneme instances basically depended on the context of the phoneme - e.g. they were selective to a subset of different subcategories of stereotypical phonemes.