music_256a

# Music 256a Course Notes

Notes from CCRMA's Music 256a, “Music, Computing and Design I: Software Design and Implementation for Computer Music”. More information is at the course website. These notes are from Fall 2009 and may or may not reflect future years; they're also in chronological order (as opposed to logical).

## Music Computing

Music computing uses the same algorithms and data structures but is more IO based and has a constant discrete time structure. You can analyze sound and come up with data which represents the timbre, pitch, level, etc of the sound. It's important to remember that musical computing has the final goal of design, because it's meant to be used and interacted with by humans. It's like cooking, you have certain ingredients (algorithms, data) and you cook to create something. You can either take in sound to analyze, or generate a waveform to output, all of which is done in real-time. Audio is shuttled around and processed with buffers. Regardless of platform, you tend to be dealing with the same paradigms, even though the API's are different. Design is important because it is our other sense with which we can interact with the computer. If programming is done with real-time in mind, doing it “offline” is trivial. Making audio software is like cooking yourself - if you go to a restaurant and eat, you are eating but you haven't made the food.

## Programming & Software

Two answers, “I don't know” (encouraging trying new things) and “it depends”. Depends on platform, context. EG desktop vs laptop vs iphone, or goto statements which should be used rarely except in nested loops. Levels of abstraction: hardware, hardware abstraction layer (HAL), operating system (OS), and the operating system's interface. We will work in between OS and the operating system interface. The operating system is a key part - what's happening under the hood is important when dealing with audio because the OS is what determines how you deal with audio. A 1st system program is terrified of failure, simplified to its bare bones, and successful beyond its intended life span. 2nd system is ambitious, conceived by academics, with many good ideas, ahead of its time, and doomed to failure. 3rd system picks and chooses essentials from 2nd system, made by good hackers, emphasizes elegance over performance, and becomes widely adopted. 4th system is maturation.

## Abstraction

Certain things in audio, graphics, networking, you can't abstract, like Windows pretending a networked file is local. It should be taken on a case-by-case-basis. “Design locally, hack locally”. Set yourself up so that it's really easy to tweak. For most programs people will spend 90% of the time on 10% of the code - don't focus on the extra 90%. Polymorphism means that you have the same algorithm that can deal and do different things to different data types. They have behavioral abstraction but the way you talk to it is the same. In this way we can have abstraction with something that is on a very basic level the same.

## Convolution

Imparting properties of one signal on another - digital filtering. Imparting eg a filter related to its response. With a linear time invariant filter, you can characterize the filter by its impulse response. EG speaking in a room; the room has an impulse response and if you apply that impulse response to another sound it will be as if the sound is in the room. Convolution is the process of playing the sound through the filter or in the space - applying the IR. In the digital domain, the code involves taking two signals, passed in as arrays of numbers (audio buffers representing the entire sound file). The size of the output buffer will be the size of the passed arrays summed; minus one. Then the convolution itself is two nested for loops. At each iteration of the loop the i+j'th member of buffy is incremented by one sample of the first sample times one sample of the other.

1
for ( i < size1 ){
for ( j < size2 ){
output[i+j] += sample1[i] * sample2[j];
}
}

What convolution is doing is basically applying the impulse signal to each sample of the effected signal and summing all of the resulting samples. Convolving is commutative. If you convolve a signal with itself repeatedly over time it will converge to a gaussian/bell curve. This process is roughly O(n^2). This process can be made faster by loop unrolling the for loops - rather than doing it one member of the buffer at a time, you do it 4, or n samples at a time. This reduces the amount of branching and allows for parallel processing. Via pipelining - overlapping tasks in the processor. Branches reduce the ability to pipeline. When dealing with streams of data, loop unrolling is not uncommon. Loop unrolling is only optimization, however, it will only improve time 50% maybe max if you're lucky. If you need it to go a whole lot faster you need to come up with a much better algorithm: using the fourier transform to do the operation in the frequency domain instead of in the time domain. This is optimization vs. asymptotic superiority. Loop unrolled is still O(n^2).

## In the Frequency Domain

To get a signal into the frequency domain, we take the FFT. The output of the FFT is a bunch of complex values in a buffer; then you do the multiplication (as complex mult) to all of the samples. But you only need one for loop:

1
for ( i < fftsize/2) {
real = sample1real * sample2real - sample1i * sample2i;
i = sample1real * sample2i + sample1i * sample2real;
}

The FFT's are O(nlog(n)) so the resulting programming is O(nlog(n)). Note - DFT is a mathematical algorithm, while FFT is a fast algorithm for calculating the Fourier transform. The only real difference in output comes from rounding in the floats used in the FFT algorithm.

## Basic Audio Programming

Audio processing at any level involves buffers and some sort of callback. Buffers are really just an array of contiguous audio samples. In general they have size with power 2. In general filled with floats, sometimes unsigned ints but in general it's single precision floats. In order to connect a buffer of audio samples to actual sound, normally you deal with an API as part of the operating system which are then sent out to hardware by the OS. The whole point is to not care about anything after the API. A library (traudio) can be added in between the application and the API so that you get the same interface between Windows, Mac, and Linux, etc. In a typical implementation, you create a pointer for RtAudio, set up a buffer size, then set up RtAudio by passing it device number, number of output channels, device number for input, number of input channels, tell it what format you want to deal with, what sample rate, buffer size, and the number of buffers. Then, there's the callback, which gets audio to and from the OS - via blocking or asynchronous/callback (the latter is more typical). You tell RtAudio what the callback method you made is, which has a certain structure expected by RtAudio. Then you go into an idle loop, with a sleep to prevent yourself from taking up CPU power. In the callback function, the entirety of the sound processing happens. It never gets called in the program - you told RtAudio that it's the callback function, so it is called by RtAudio at a rate depending on the buffer size.

## Big O's

FFT: nlog(n), bogosort: n!, FFT-based convolution: nlog(n), time-domain convolution: O(n^2), bubble sort: O(n^2)

## Karplus Strong

“Physical models” traveling waves in a string - via feedback and low pass filtering over a delay line. The delay line is creating a comb filter, with peaks determined by the length of the delay line. The comb filter essentially imposes periodicity on the signal. The comb filter is like taking bites out of a cake that is the spectrum of the sound. If you start with a really rich spectrum, like white noise, you can get “more” out of it. With the lowpass there rather than a simple attenuator, you get a more natural decay, because higher harmonics dissapate faster in an actual string.

## Moving average filter

A moving average filter just averages the value of the current input and the input one sample ago. No feedback; just an averaging between the signal and the signal one sample ago.

## Polyphony

First we must define how many voices we can have, max. Then you have to decide on how to deal with extra notes - above the max value. We use “note stealing”, FIFO style polyphony. In the case of Karplus-Strong, we need basically one delay line for each voice.

## STL

Queue is first in first out, stack is last in first out. With a queue you put stuff in one end and take them out the other; in a stack you put things on top and then take them off. You push and pop things on and off the stack. To include something from STL, you do an include without an h. <iostream>, <fstream>, <algorithm>, etc. #include <queue> for queues. In the STL, you have templated classes, which is nice because we can ask for a queue of strings, floats, whatever

1
#include <queue>
queue<int> q;
q.push(1);
q.push(2);
q.push(3);

while( !q.empty() )
{
cerr << q.front() << endl;
q.pop;
}

Outputs 1 2 3.

Note that you get elements out of the queue with a pop, but read the front element with front. Stack is similar except with top and not front.

1
#include <stack>
stack<int> q;
q.push(1);
q.push(2);
q.push(3);

while( !q.empty() )
{
cerr << q.top() << endl;
q.pop;
}

Outputs 3 2 1. A map is some form of a hash table, which essentially associates a pair together - the “key” and the “value”. An array is a map from an integer to whatever type the array is.

1
#include <map>
map<string, int> m;
m["foo"] = 24;
m["bar"] = 100;
cerr << m["foo"] << endl;

// declare and iterator
map<string, int>::iterator iter;

// iterate over the map
for( iter = m.begin(); iter != m.end; iter++)
{
// print out key and value pairs - first and second
cerr << (*iter).first << " " (*iter).second << endl;
}

Objects A class/struct a mold you can instantiate stuff out of.

1
class Foo
{

public: // Accessible by anyone

static int ourStaticVar;
// Everything static in the class is shared by all classes.
// You don't need an instance to use statics.
// This allows us to only have one instance
static Foo * getInstance();
{
if( ourFoo = NULL)
ourFoo = new Foo();
return ourFoo;
}

void sayHello()
{
cerr << "hello, I am a Foo" << endl;
cerr << m_privateVar << endl; // this is OK
}

int m_publicVar;

protected:  // Accessible by children/subclasses of Foo

int m_protectedVar;

private:

static Foo * ourFoo;

int m_privateVar;
//Constructor, which is called whenever a FOo is instantiated
Foo()
{
m_publicVar = 0;
m_protectedVar = 0;
m_privateVar = 0;
}

// Destructor, which is run when the object goes out of scope or we delete it
~Foo()
{
cerr << "see ya" << endl;
}

}; // Do not forget this semicolon

Foo * Foo::ourFoo = NULL;
int Foo::ourStaticVar = 0;

int main()
{
// Access static var without instance
Foo::ourStaticVar = 50;

// instantiation
Foo * foo = Foo::getInstance();

// Change foo. to foo=> when it's a pointer

foo->m_publicVar = 10;

foo->sayHello();

// Can't do this
// foo.m_protectedVar = 20;

// Access static var through instance
foo->ourStaticVar = 10;

return 0;
}

Say we want to put our class in a .h file. Then you instantiate everything but you don't initialize anything or provide any syntax for functions. In the cpp file, you would create functions, initialize variables, by static Foo * foo::getInstance. This way you segregate the way to use a class, and the actual implementation of it. You want to say if the function or variable is static in the .h file, not the .cpp file. You can also differentiate between functions by having void SayHello() vs voidSayHello(int m); You can also overload constructors, but not destructors. You can create classes that inherit, and extend, other classes. Eg class FooBar : public Foo, which inherits everything in Foo. FooBar is a full fledged foo with some added/different stuff. Inherited classes can override functions from their parent class, eg if the FooBar had its own sayHello. But you should then make SayHello virtual in both cases - polymorphism.

## ChucK

### Main tenants

Chucking, time, synchronizing via shreds, and on-the-fly enabling virtual machine. The input to the system is code (text). This goes through a “lexical analysis phase”, where the code is dissected into tokens. Then the tokens are sent to a syntax verifying structure. Then we send through type checking, and get out type verification (ie can't cast an int to a string). Depending on how strong the type checking is, if it makes it to this point it is not going to crash (not true in C++). Then it gets sent to the emitter from which runnable code comes out, and the compiling phase ends. The compiled code goes to a run-time platform; in ChucK this is a virtual machine.

### Lexical analysis

Done in Lex, which essentially defines a token based on a regular expression. These are the things that turn blue (when they aren't operators). EG define '*“ as MULT.

### Parsing

Takes the output of Lex (stream of tokens) and tries to make syntactical sense). It has a nondeterministic nature; so it's able to recognize more things. EG with a finite state machine (lexical analysis) you couldn't tell if a string was a palindrome, but with an automata (parsing) you can. For a while loop: WHILE LPAREN expression NPAREN statement, where expression and statement resolve to something else elsewhere.

### Syntax Verification

Creates a syntax/execution tree.

### Type Checker

Needs to deal with scope, namespaces, environments. Differs depending on language. Recursively checks the code, piece by piece (ie makes sure the while condition is a boolean first).

### Emitter

Emitter turns the resulting code into virtual machine code. Includes ChucK specific instructions like advancing time - requires the virtual machine approach to allow instructions to be simple or complicated.

### Virtual Machine

The VM actually interprets byte code separately from the audio engine. The audio engine keeps track of the ugen connectivity, but the byte code interpreter connects and disconnects. Whenever time advances (which actually happens when the shred in the byte code interpreter says it's OK) the dac polls its inputs to get audio data.

## Graphics

Soundpeek uses openGL. OpenGL skeleton code is highly copy-pastable. Uses libxtract, an open source library which contains 50+ audio features (time and frequency domain). CLAM is another library to look at, does spectral modeling synthesis. Marsyas and CLAM are more frameworks; libraries you integrate in, frameworks you are assimilated by. OpenGL also hast GLU, which has a lot of utility functions (drawing common shapes, moving the camera) and GLUT which is a utility toolkit; nor part of the OpenGL standard but widely used, a cross-platform glu (like RtAudio). GLUT also provides basic text rendering and interaction, like keyboard interaction. Audio is read into a global buffer, then soon after on a different thread the graphics program reads from this buffer and draws graphics. OpenGL has its own callback for rendering graphics, separate from the two. mutex.lock/mutex.unlock is used to prevent the audio from writing to the buffer while the graphics is reading from it, but this is really not necessary because the graphics callback copies that buffer before using it. Only one thing can lock a single mutex at once, but other non-mutex code can run.

To implement graphics, start by copying and pasting code from a working program with graphics. You also need to include the openGL library #include </GLUT/glut.h>. You also need to add the appropriate libraries/frameworks to your makefile, eg -framework GLUT. Normally there is a separate main loop for GLUT that you call once you initialize on the main thread. You then need a global, shared buffer and buffer size, and each time the callback happens you load your audio buffer into the global buffer. Then to plot, we can create a GL_LINE_STRIP via glBegin(GL_LINE_STRIP) and then glEnd(). Then we loop through the global buffer, and for each element in the buffer we call glVertex2f( x, g_buffer[i] ). We need to make x as well, and increment it some small amount.

## FFT

Key parameters: Window size (also window type), FFT size, “hop size”. What we are doing is “STFT”, short time fourier transform, where we basically take FFTs in small chunks. Whatever chunk we take the FFT of is the window. If we chop the signal directly, we add unwanted artifacts to it (multiplied it by a rectangular window). The FFT size must be a power of two - window size doesn't need to be, but should. FFT size must be bigger than window size; if it's not, you zero pad. When you do zero padding, you don't increase accuracy so much as just interpolate the spectral view. The hop size is related to the size of successive FFTs. The difference between successive FFT's is the hop size; 1/2 overlap means your hop size is 1/2 of the window size, 1/8 and 1/16 are also not uncommon, to really closely observe how the spectrum is changing. Requires a complex type; define with a struct (real and imag component). Also define modulus and magnitude of complex number. If you want to write C code within C++, check if __cplusplus is defined and if it is put extern “C” {. N samples go into the FFT; N/2 complex bins come out. When you do the FFT, the location of samples of the argument become the locations of the bins, so afterwards you cast those values as complex. You'd expect a single peak for a sine wave, but also a smearing effect tapering off the peak. Under this curve we are seeing the contour of the side lobes; the more you zero pad the more apparent the lobes are. A Hamming window is a raised cosine window, a Hann window is a direct cosine, these help remove the side lobes. They are each better for certain spots. 1/2 window size is normally a good hop size for hamming, hann, blackman, etc windows. In a spectrogram, you see three dimensions : time (x), frequency (y) and amplitude (color).

## Libsndfile

A library for reading and writing soundfiles; efficient, bug free, accepted in open-source community. Can read stereo, signed, PCM of pretty much any format (no lossy). You open a file, get a pointer back, and use it to seek, read, or write, in terms of short, int, or double. Wav files are layed out with a header, then data. In stereo, you get left, right, left, right, left, right… all of which is specified in the header file, supposedly. The header is what's different between aiff and wav, etc.

## Secret Rabbit Code

Really high quality sample rate conversion in code. Very advanced, robust, and artifact-free sample rate conversion.

## Networking

7 layer network stack model: Physical, data link, network, transport, session, presentation, application. We work network and up. IP is a way to address things on the internet, and to get from some host to another. TCP and UDP are on the transport level. TCP offers reliability; abstracts the packet-based thing into a stateful stream. Takes things like packet loss into consideration. Most of our interactions is done over TCP. Ports are like mailboxes on hosts, TCP has certain ports and so does UDP. UDP is not connection oriented; you can lose packets, but you do get a checksum, but does not rely on a lot of overhead. For music we use a combination for both of these. Soundwire for example uses their own version of UDP, which accounts for what happens when you're missing a packet. Somewhere beyond that is the protocol for soundwire, open sound control, etc.

## Open Sound Control

Sends generic, arbitrarily addressed, arbitrary size packets most often over UDP for the purpose of control.

## Sockets

Sockets are abstractions of network connections. TCP and UDP sockets are different; with a TCP socket there is a notion of state, while with UDP you are just sending data and hoping it comes in.

## OSC

At a higher level, osc: liblo is a lightweight OSC implementation in C, oscpack is simple C++ packet manipulation library. In ChucK, you can use OscSend objects, eg OscSend xmit, and to send a message, xmit.startMsg( ”/test“, “f” ) then 168 ⇒ xmit.addFloat. ChucK waits until you have added a sufficient number of floats or other arguments before sending out, then it sends out. To receive a message, use OscRecv; eg OscRecv recv, 8000 ⇒ recv.port, then recv.listen(), then recv.event ( ”/test, f“ ) @⇒ OscEvent oe then oe⇒now (wait for event to arrive), then while( oe.nextMsg) then oe.getFloat ⇒ something. Calories is a good example of the use of OscPack. You include some files from OscPack, then use g_send and g_transmit. Where in ChucK we have to wait until an event associated with receipt of a message. In c++ code, we create threads, which we have to do ourself in this case. In a C api, we call functions and get back handles to files, threads, etc which are stored externally. Sending happens whenever you want, receiving is event driven so it needs to be on another thread.

## Program System vs Program Product

You can go from a program to a program product, or a program system. The program is the core engine; the part that actually does the LPC analysis for example. The program system makes the system work in some way. The product is more how it looks, what documentation does it have. It is somewhat true that it takes the same amount of time to make a program as it does to make a program into a program product, or program system. Inherently all software developers are optimistic. Any software project will take 3 times longer than you'd think, even if you consider this fact.