# http://colinraffel.com/wiki/

### Site Tools

a_vocabulary-free_infinity-gram_model_for_nonparametric_bayesian_chord_progression_analysis

# A Vocabulary-Free Infinity-Gram Model for Nonparametric Bayesian Chord Progression Analysis

 Authors Kazuyoshi Yoshii, Masataka Goto Publication Info Retrieved here Retrieval date 3/24/12

Proposes a probabilistic, vocabulary-free infinity-gram model for chord progression analysis using Bayesian nonparametrics.

## Notation

Chords can be notated as 17 variations on 12 pitch classes, such as C:maj or F#:dim7 giving 205 labels ($17*12+1$) or based explicitly on their components, such as C:100010010000 for C major, giving 49153 labels ($2^{12}*12+1$).

## N-gram models

Chord patterns are closely related to composer style and musical genre and can improve automatic chord recognition algorithms (for example, by fusing with a joint probabilistic model of keys, chords, and bass notes) or for automatic transcription. N-gram models are typically used to predict chords, which treat a subsequence of n chords using as dependent on the previous (n-1) chords using a Markov model trained on observed data. Not all n-grams are observed, so heuristic smoothing must be used to assign unobserved n-grams a non-zero probability, which has no solid theoretical foundation. N-gram models also assume that the chord length is always n, and that only a limited set of chord labels are defined in advance.

More specifically, given a chord vocabulary $W$ with size $V$, $w \in W$ is a chord and $u \in W^{n-1}$ is a context of $n-1$ chords. Our observed data is $X$, a single sequence of $M$ chords $x_m \in W$, with the assumption that $x_m$ only depends on the past $n-1$ chords. The goal is to estimate $P_u(w|X)$, the probability of chord $w$ given context $u$. Letting $c_{uw}$ be the number of occurrences of chord $w$ following context $u$, the naive maximum likelihood estimate is given by $P^{ML}_u(w|X) = \frac{c_{uw}}{c_{u\dot}}$ where $c_{u\dot}$ is the total number of chords which follow context $u$ in the observed data; this causes $P_u(w|X)$ if $c_{uw} = 0$. Smoothing techniques used to prevent zero-probability include Kneser-Ney and its variations.

The hierarchical Pitman-Yor language model (HPYLM) uses Pitman-Yor (PY) processes, which are distributions over distributions over a sample space, written as $G ~ PY(d, \th, G_0)$ letting $d, \th$ be positive real numbers (discount and strength) and $G_0$ be a base measure distribution over the sample space. The HPYLM is formulated by layering PY processes in a hierarchical Bayesian manner. Given a unigram distribution $G_\phi$ over $W$, where $\phi$ is the empty context and $G_\phi(w)$ is the unigram probability of chord $w$. An n-gram distribution $G_u$ is drawn from a PY with a base measure $G_\pi$ as $G_u ~ PY(d_{|u|}, \th_{|u|}, G_{\pi(u)})$ where $\pi(u)$ is a $u$ without its first chord and $d_{|u|}$ and $\th_{|u|}$ are discount and strength parameters which depend on $|u|$; as such $G_u$ is generated recursively with $G_0(w) = 1/V$.

Once the HPYLM is defined, observed data $X$ is generated according to the Chinese restaurant franchise (CRF) stochastic process, where contexts are like restaurants, $M$ observed variables $X$ are like customers, and $V$ chord types $W$ are like dishes; each restaurant has an unbounded number of tables, each of which is served a dish. Supposing that $x_1,\cdots,x_M$ are generated sequentially, customer $x_m$ enters restaurant $u = x_{m-(n-1}} \cdots x_{m-1}$ with depth $n-1$. $t_{uw}$ is the number of tables serving dish $w$ in restaurant $u$; $c_{uwk}$ is the number of customers sitting at table $k$ and eating dish $w$. $x_m$ sits at either an existing table $k$ and eats $w$ with probability proportional to $c_{uwk} - d_{|u|}$, setting $x_m = w$ and incrementing $c_{uwk}$, or at a new table $k = t_{u\dot} + 1$ with probability proportional to $d_{|u|}t_{u\dot} + \th_{|u|}$, sending a proxy customer to $\pi(u)$ where he behaves in a recursive manner - if he eventually eats $w$ at in restaurant $\pi(u)$, $w$ is also served at the new table $k$ in restaurant $u$ and $x_m$ eats $w$, incrementing $t_{uw}$ and $c_{uwk}$ and setting $x_m = w$. If the proxy customer is sent to restaurant $\phi$ then a dish is served according to the global base measure $G_0$. Specifically, given a seating arrangement $S$, a chord $w$ following context $u$ is generated according to $P_u^{HPY}(w|S) = \frac{c_{uw\dot} - d_{|u|}t_{uw}}{c_{u\dot}+\th_{|u|}} + \frac{d_{|u|}t_{|u|\dot} + \th_{|u|}}{c_{u\dot} + \th_{|u|}}P_{\pi(u)}^{HPY}(w|S)$ where $P_{\pi(u)}^{HPY}(w|S)$ is given by substituting $\pi(u)$ back into $u$.

In order to estimate $P_u(w|X)$ in a Bayesian maner, the expected value of $P_u^{HPY}(w|S)$ is calculated under the CRF $P(S|X)$ by $P_u^{HPY}(w|X) = \sum_{S} {P_u^{HPY}(w|S)P(S|X)} \approx \frac{1}{L}\sum){l=1}^{L}P_u^{HPY}(w|S_l}$ using Gibbs sampling where $L$ is the number of many iid seating arrangements sampled from $p(S|X)$. The Gibbs sampling is carried out by adding all customers according to a posterior CRF, where each customer $x_m = w$ sits at a new or existing table, then a customer $x_m$ is selected randomly and removed from the tree and all proxies and tables which become empty are also removed, and $x_m$ is added into the tree again according to the posterior CRF. The $d_0, \cdots, d_{n-1}$ and $\th_0, \cdots ,\th_{n-1}$ parameters are set as beta and gamma prior distributions and their values are sampled from posterior distributions.

HPYLM forces all customers to enter restaurants at depth $n-1$; the variable-order PY language model (VPYLM) allows each customer to enter a restaurant at variable depth. Chords $x_m$ are associated with a latent variable $z_m$ that indicates the value of $n$; all possible values of $z_m$ are considered, making it an infinity-gram model. To determine $z_m$, $x_m$ descends the tree following $\phi \rightarrow x_{m-1} \rightarrow x_{m-2} \rightarrow \cdots$ by backtracking the context $u$, stopping at restaurant $u_i$ with probability $\et_u_i$, giving $z_m = n$ probability $P_u(n|\et) = \et_u_{n-1} \prod_{i=0}^{n-2}(1-\et_u_i)$. A beta prior distribution with parameters $\al$ and $\be$ is placed on $\et$ so that $p(\et) = \sum_{u\in tree}Beta(\et_u|\al,\be)$. Given $z_m$, $x_m$ is determined according to a CRF with seating arrangement $S$, predicting chord $w$ following context $u$ by $P_u^{VPY}(w|S) = \sum_n P_u^{VPY}(w|n, S)P_u(n|S)$ where $P_u^{VPY}(w|n, S)$ is obtained from the giant $P_u^{HPY}(w|X)$ equation above and $P_u(n|S) = \int P_u(n|\et)p(\et|S)d\et$. The predictive distribution is found as in HPYLM, except that VPYLM needs to sample $z_m$ from its posterior distribution before adding customer $x_m$ to the tree by $P_u(n|S, w)\propto P_u(w, n|S) = P_u^{VPY}(w|n,S)P_u(n|S)$.

To define a finite vocabulary even though the vocabulary is growing steadily, NPYLM was formulated using a global base measure $G_0$ over a countable infinite number of variable-length words. In language modeling, a spelling model based on letter-level VPYLM is given as a global base measure $G_0$ of a word-level VPYLM, regarding each word as a sequence of letters following a letter-level CRF where the word length is assumed to follow a Poisson distribution.

## Vocabulary-free infinity-gram model

A vocabulary-free infinity-gram model can be created by extending modern nonparametric n-gram models. In this way, the distribution of a next chord can be formalized with a probabilistic generative model of chord sequences, each chord sequence can have variable length, and any combination of notes can constitute a chord. A chord sequence model can be proposed similarly to the NPYLM model, except that chords are simultaneous combinations of notes and words are temporal sequences of letters.

Chords can be handled by formulating a probabilistic model based on the component notation as a global base measure of $G_0$ of a chord-level VPYLM, with each chord $w$ written as $w_0:w_1 \cdots w_{12}$ with $w_0$ following a 13-dimensional discrete distribution and the components following Bernoulli distributions $G_0(w) = p(w|\pi, \tau) = \pi_w_0 \sum_{i=1}^{12}\tau_i^{w_i}(1-\tau_i)^{1-w_i}$ where $\pi = \{\pi_C,\pi_{C#} \cdots, \pi_N\}$ represents the probabilities of each pitch class and “N” (no chord) and $\tau = \{\tau_1, \cdots \tau_12\}$ indicates the existence probabilities of the respective degrees with prior distributions $p(\pi, \tau) = Dir(\pi|a_0)\prod_{i=1}^{12}Beta(\tau_i|b_0, c_0)$ with parameters $a0, b0, c0 = .5$.

Given a seating arrangement $S$, the posterior distribution of $\pi$ and $\tau$ can be calculated as $p(\pi, \tau) = Dir(\pi|a_0 + n)\prod_{i=1}^{12}Beta(\tau_i|b_0 + n_i, c_0 + \bar{n}_i)$ where $v$ is one of the pitch classes or “N”, $n_v$ is the number of tables serving dishes with root note $v$ in the restaurant $\phi$, $n_i$ is the number of tables serving dishes with the $i$-th note in $\phi$, and $\bar{n}_i$ is the number of tables serving dishes without the $i$-th note in $\phi$. The predictive distribution of the next chord and the Gibbs sampling algorithm are as in VPYLM, except when a customer sits at a new table in the root restaurant, the values of $n_v$ and $n_i$ or $\bar{n}_i$ are incremented according to the components of the target chord and are incremented when a table is removed.

To test this model, chord sequences for 137 major-scale Beatles songs were used after transposition to C-major key, comprising of 10761 chords with 103 chord types for label-based notation and 149 chord types for component-based notation. The effectiveness of the infinity-gram model was compared to a variety of existing methods and the HPYLM and VPYLM models incorporating the vocabulary-free base measure $G_0$ using label-based notation. The effectiveness of vocabulary-free modeling in component-based notation was also measured. All tests used 10-fold cross validation with perplexity (average number of next-chord candidates) as the metric. VPYLM yielded the lowest perplexity in the label-based model, with best performance with $n$ was marginalized out, taking all possibilities into account. For the component-based model, vocabulary-free VPFYLM yielded perplexity substantially lower than the other models, suggesting robustness to sparseness.