|Context||Columbia Neural Network Reading Group/Seminar Series|
There are ever-increasing numbers of mobile phones in the world, with computing power continually improving, but the audio quality of phone calls is not increasing. Furthermore, speech recognition on phones, especially in very noisy environments, is not a solved problem. Because speech is a rich signal, it requires rich models; synthesis models are rich enough to represent most speech so this work focuses on using them to help create powerful models of speech audio.
Typically, removing noise from noisy recordings involves modifying the recording in a way which retains the speech but removes the noise. This work focuses on instead extracting sufficient information from the noisy recording to re-synthesize equivalent speech, which potentially is noise-free. It's feasible to obtain a great deal of “clean” speech from mobile phones, because most of the time you are the person talking on your phone, in close proximity, and with low noise. To reconstruct a new synthesized recording of speech, a large dictionary of clean speech is used, and a DNN is used to learn an affinity between noisy/reverberated/compressed speach and entries in the dictionary; the atoms of the dictionary are then concatenated for synthesis.
DNNs are used in a variety of ways in speech enhancement tasks, including predicting clean speech directly from a noisy mixture, predicting a mask from the noisy mixture, or by learning a scalar similarity value from an input of clean and mixed speech. The latter approach requires less training data and is more adaptable. Training this kind of network involves training with pairs of “matching” clean/mixture pairs (e.g. “this clean speech corresponds to the noisy mixture”) and non-matching (by sampling from other portions of the speech recording). To synthesize a new recording, the most probable sequence of audio atoms from the clean dictionary is found by maximizing the probability of a sequence of atoms, which includes a transition matrix that includes the likelihood that one atom follows another. This can be achieved with the Viterbi algorithm. This is very effective, but the actual results depend on the representation fed into the network. For example, if MFCCs are used as the input representation of the speech, the resulting pitch may be different.
For a denoising task, in a listening test, this model resulting in high subjective quality with a high amount of perceived noise suppression. However, the intelligibility decreased when using concatenative synthesis. In a dereverberation/compression/packet loss setting, subjective quality was also high, and intelligibility was reasonably good. A simple extension would be to label dictionary elements ahead of time (with pitch, phone label, speaker ID), which would allow for a noise-robust version of those tasks by retrieving the label corresponding to each chosen dictionary element.