User Tools

Site Tools


Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription

Authors Nicolas Boulanger-Lewandowski, Yoshua Bengio and Pascal Vincent
Publication Info Retrieved here
Retrieval date 7/20/13

Proposes a method of modeling polyphonic music sequences using a deep recurrent neural network.


Many naturally occurring data sequences have long-term dependencies. RNNs have the capability to summarize the sequence history; they are hard to train efficiently using gradient-based optimization but Hessian-free optimization can help. A model of high-dimensional sequences should predict a conditional distribution of the next time step, which involves computing the probability for a huge number of outputs. This is impractical in general but is made feasible by RBMs, for example. Many attempts have been made to combine the properties of RNNs and RBMs including recurrent temporal RBMs. This paper applies a variant of RT-RBMs to the task of modeling a full, “piano-roll” version of polyphonic music. Most existing models are either frame-based or use a rudimentary smoothing post processing step (HMM, for example).



RBMs are energy-based models with the probability of observing (binary) hidden and visible vectors given by

$P(v, h) = \mathrm{exp}(-b_v^\top v - b^\top_h h - h^\top W v)/Z$

where $b_v, b_h$ are bias vectors for the visible and hidden units, $W$ is the mixing matrix and $Z$ is the partition function. When either $v$ or $h$ is given, the other is conditionally independent:

$P(h_i = 1 | v) = \frac{1}{1 + \mathrm{exp}(-b_h - Wv)}$

$P(v_j = 1 | h) = \frac{1}{1 + \mathrm{exp}(-b_v - W^\top v)}$

If we compute the free energy

$F(v) = -b_v^\top v - \sum_i \log(1 + \mathrm{exp}(b_h + Wv))_i$

then $P(v) = \mathrm{exp}(-F(v))/Z$. Sampling an RBM can be achieved by performing $k$ steps of sampling $h|v$ and $v|h$ (Gibbs sampling). The negative log likelihood of an input vector $v^{(l)}$ with respect to the model parameters $\Theta = \{b_v, b_h, W\}$ can be estimated by a single sample $v^{(l)\ast}$ optioned from a $k$-step Gibbs chain starting at $v^{(l)}$ (contrastive divergence):

$\frac{ \partial(-\log P(v^{(l)})) }{\partial \Theta} \approx \frac{ \partial F(v^{(l)}) }{\partial \Theta} - \frac{ \partial F(v^{(l)\ast}) }{\partial \Theta}$

An neural autoregressive distribution estimator (NADE) can be substituted for an RBM by substituting this equation with an exact gradient. Mean-field samples can be obtained from an RBM by drawing $h^\ast \sim P(h)$ and computing $P(v_j = 1|h^\star)$.


An RTRBM is a sequence of RBMs whose parameters $b_v^{(t)}, b_h^{(t)}, W^{(t)}$ depend on the sequence history $A^{(t)} = \{v^{(\tau)}, \hat{h}^{(\tau)} | \tau < t \}$ where $\hat{h}^{(t)}$ is the mean-field value of $h^{(t)}$. An RTRBM's joint probability distribution is

$P(\{v^{(t)}, h^{(t)} \}) = \prod_{t = 1}^T P(v^{(t)}, h^{(t)} | A^{(t)} )$

(what is $T$?). When only the biases depend on previous time steps we have

$b_h^{(t)} = b_h + W^\prime \hat{h}^{(t-1)}$

$b_v^{(t)} = b_v + W^{\prime \prime} \hat{h}^{(t-1)}$

which results in six parameters: $W, b_v, b_h, W^\prime, W^{\prime \prime}, \hat{h}^{(0)}$. The initial $h^{(0)}$ is used (through $W^\prime$ and $W^{\prime \prime}$) to compute the biases for the first hidden and visible units, which are used to compute the second, and so on. Hidden units must describe the conditional distributions and convey temporal information. The mean field value of the hidden units is what is transmitted over time, which makes exact inference easy given that

$\hat{h}^{(t)} = \frac{1}{1 + \mathrm{exp}(W v^{(t)} + W^\prime \hat{h}^{(t-1)} + b_h)}$


Instead of using the hidden units both for the conditional distribution and temporal information, the RNN-RBM is combines the RT-RBM with an RNN so that hidden units are not used to compute biases for subsequent hidden units. The RNN hidden units $\hat{h}^{(t)}$ are then only determined by their predecessor and the visible units at each time step, by

$\hat{h}^{(t)} = \sigma( W_2 v^{(t)} + W_3\hat{h}^{(t-1)} + b_{\hat{h}} )$

giving parameters $W, b_v, b_h, W^\prime, W^{\prime \prime}, h^{(0)}, W_2, W_3, b_{\hat{h}}$. Training is done iteratively by computing the current values of $\hat{h}^{(t)}$, calculating RBM parameters which depend on $\hat{h}^{(t)}$ and training the RBMs as usual, then computing the estimated gradient $b_v^{(t)}, b_h^{(t)}$ using BPTT to get the estimated gradient with respect to the RNN parameters. In the RNN-RBM, parameters can be tied to make $\hat{h}$ equivalent to $h$ as in the RTRBM. However, having them separate, the RNN hidden units can be much smaller than the RBM hidden units. The parameters $W, b_v$, and $b_h$ can be initialized by unsupervised pre training of the RBMs, and the parameters of the RNN can be initialized from an RNN trained with cross-entropy cost.



An RNN-RBM and an RT-RBM with binary RBM units were trained on a dataset of videos of bouncing balls. The mean frame-level squared prediction error was for the RNN-RBM was half of that for the RT-RBM. Models were also trained using Gaussian RBM units on a sequence of motion capture data, (joint angles, translations and rotations); the mean-squared error was also lower for the RNN-RBM.


Four MIDI datasets consisting of 67 hours of music were sampled at a sub-beat level to generate the data. The log-likelihood and expected frame-level accuracy was computed for the proposed models and a variety of simple models. The RNN-RBM or RNN-NADE models performed best, with an improvement in performance when using HF optimization to train the RNNs. Generated sample sequences revealed that the models were able to learn harmony rules, note correlations, and local temporal coherence, but not long-term structure.


A pre-existing acoustic model was used which outputs note probabilities for a short each frame of an audio signal, giving $P_a(v^{(t)})$. The cost of all combinations of the top $k=7$ candidates is computed by

$C = -\log (P_a(v^{(t)})) - \alpha P_s( v^{(t)} | \tilde{A}^{(t)} )$

where $P_s( v^{(t)} | \tilde{A}^{(t)} )$ is the symbolic model probability and $\tilde{A}^{(t)}$ is the history of f0 estimates. This resulted in an improvement on all datasets against an HMM model. The HMM model, however, seeks a global best smoothing, whereas the presented model is greedy.

tag/modeling_temporal_dependencies_in_high-dimensional_sequences_application_to_polyphonic_music_generation_and_transcription.txt · Last modified: 2015/12/17 14:59 (external edit)