Proposes a polyphonic pitch tracking technique which learns spectrum/pitch associations and determines the pitches based on solving a nearest subspace search by transforming it to an overcomplete sparse coding function.
Most pitch tracking approaches try to discover periodicities in a signal, but this technique does not match the human ability to learn and recognize signals as musical notes. Instead, it may be effective for the algorithm to learn a mapping between sounds and notes. A naive approach could, given a frame of spectral data, find the pitch label corresponding to the closest (given some distance metric) spectral frame in its training set, resulting in a nearest-neighbor search. This assumes that the correct labeling exists in the training set. Using the KL-divergence on training/testing data consisting of recordings of the same instrument, the results are comparable to existing pitch trackers, suggesting that learning pitch labels from spectral vectors could be effective.
If we assume there is more than once source for a spectrum, we will need to perform a nearest subspace search instead, which attempts to match the testing point to all combinations of the training points. Because there will be many training spectra and potentially many sources, there is no efficient algorithm to compute the nearest subspace. Using a slow nearest subspace search is still effective when combining notes played by different instruments.
The nearest subspace search can be made faster by approximating the input spectrum as a sparse combination from each source training data, each used sparsely - treating each input point as a sparse combination of all of the training data. The training data is then concatenated into a single set of training spectral vectors, and the input is assumed to be approximated as a linear superposition of that data with some weight vector, with is ensured to be sparse by maximizing its l2 norm. The problem is then to minimize the distance between the test spectrum and the set of training spectra weighted by a vector whose l2 norm must simultaneously be maximized: Minimize distance(test spectrum, training spectra*weighting vector) - (sparsity importance scalar)*l2(weighting vector). The amount of sparsity can be controlled by adjusting the sparsity importance scalar. To iteratively find the weighting vector, the weighting can be estimated based on the previous weighting vector, then sparsity can be imposed and the new estimate can be normalized. Once the weighting vector is found, the labels with the largest weight for each source are used to determine the pitch. If all weights for some source are close to zero, it can be assumed that that source is not present. This approach runs many orders of magnitude faster than the brute nearest subspace search. Warped spectrums (similar to constant-Q) were used with 46 ms wide, 11ms hop Hann windows. Unpitched training data is discarded.
To test this approach, about 10,000 training spectra were calculated for each source in a woodwind quintet and the labels for 6,000 testing spectra for each of two, three, four, and all five woodwinds were calculated. As more sources were added, the tracking quality degraded. Octave errors were apparent but could be solved with median filtering and by quantizing the frequency estimates to semitones; more sophisticated approaches could be used. The amount of error was small enough to suggest that learning pitch labels for spectra can be an effective approach.