|Context||Columbia Neural Network Reading Group/Seminar Series|
The goal of source separation is to analyze a complex audio scene and separate it into its individual components. In the most general case, sounds may be overlapping and obscure one another, the number and types of the sources may be unknown, and there may be multiple instances of one source type (e.g. same speaker). Applications for source separation include speech enhancement, music remixing, improving speech recognition, automated auditory scene analysis for robots, understanding how human speech perception works…
Source separation is made more difficult by the fact that there is a huge variety of different sound types that we may want to separate - human voices, music (including different instruments), natural sounds, man-made sounds, and countless novel sounds. So, it's impossible to make a model for each individual sound. It's also difficult to make a huge general model for all sound types. Sounds can also obscure each other in a state-dependent way. Knowing which sound dominates makes it easy to determine states, and knowing the states makes it easy to know which sound dominates, so there's a chicken-and-egg problem.
A variety of methods have been proposed to mitigate these issues. CASA (computational auditory scene analysis) segments a spectrogram into “grouping cues”; usually there is no explicit model of the sources. This allows for a flexible generalization, but is rule-based, not learning/model based. More recently, model-based systems have been used which encode assumptions about the data (even though the assumptions usually don't match). Usually these models are not discriminative, and joint inference can be intractable. Also, there are trade-offs between speed and accuracy and they are limited to separate similar classes.
Most recently, neural networks have been used, which provide a discriminative method for source separation. These work very well in some tasks, such as “across-type separation” (speech enhancement or vocal vs. background music separation). In these tasks, usually the objective is to predict a mask which can be multiplied against a spectrogram to achieve separation. A big limitation of this kind of model is scaling up to additional/different number of sources - a trained model will not be able to generalize to different numbers of sources. In addition, when separating two sources of the same type (e.g. two speakers), it's not clear which source the network should try to be masking out in an unseen (testing) example - the “permutation problem”. To mitigate this problem, the “oracle” permutation can be given to the network, but this doesn't solve the issue.
Finally, clustering-based methods have been proposed which can handle the permutation problem. They attempt to cluster features according to similarities in local characteristics, but this makes them require context, and as the context size grows multiple sources may be present.
In the objective function of the neural network (class)-based functions, we are trying to make the network's output look like a known target. The model is then purely supervised. In clustering (partition)-based objective functions, we are trying to estimate what belongs together, and there is no need to know object class labels - just whether two objects are the same or not. This makes it a sort of semi-supervised approach. If we know the affinity (for example, which time-frequency bins come from the same speaker), we can then try to predict the affinity matrix instead. However, estimating the full affinity matrix is intractable because they can be huge. A common method to mitigate this is to estimate a low-rank approximation of the affinity matrix. Deep clustering then attempts to predict the low-rank approximation of the affinity matrix using neural networks.
As an objective, assuming we know the ground-truth label of each time-frequency bin, we can construct the ideal affinity matrix and then attempt to produce an approximation of it. The approximation is computed by estimating a high-dimensional embedding for each time-frequency bin of the spectrogram, such that paired bins are close together and unpaired bins are further away. The objective them simply attempts to make the pairwise distance matrix of the embeddings look like the affinity matrix. A network is optimized so that given a spectrogram, it can produce embeddings which satisfy this behavior. A nice property of this approach is that it avoids computing/estimating the full affinity matrix because of the structure of the objective function and its derivative.
This approach was tested on the task of separating mixes of randomly sampled utterances from the Wallstreet Journal dataset, with a variety of different mixing ratios. The embedding was computed using two bidirectional LSTM layer networks. Deep Clustering was compared against several baseline methods, including state-of-the-art NMF, CASA, and BLSTM networks with various permutation strategies. The proposed approach was better than all of the other strategies by a substantial margin (according to signal-to-interference ratios and subjective evaluation), even the NMF when using an oracle permutation. Various embedding dimensions were tested, and the best performance was found for roughly 40-dimensional embeddings.
In addition, the network trained on two-speaker mixtures still works reasonably well on three-speaker mixtures. When trained and tested on three-speaker mixtures and compared to oracle NMF, the performance was not quite as good. When attempting to separate overlapping mixtures of the same speaker, the results are also not as good, except that deep clustering still beats all of the other methods.
Future work will try different network architectures (e.g. convolutional front-end), joint training of the network and the clustering, and will test the approach on different tasks.