Audio Analysis, Part 1: Digital Audio
16 November 2015

Today I'm going to try and explain how digital audio works. Most people have a vague idea about how sound works. "Sound is a wave!", they might say. "Sound bounces off things!", adds another. "Sound increases as you get closer to the stacks!", relates one subject matter expert. "What?", asks someone else? "I SAID SOUND INCREASES AS YOU GET CLOSER TO THE STACKS!" "Sorry I couldn't hear you over my crippling tinnitus! Dear Jesus if only I had known more about sound!"
This will be a really simple primer about digital sound waves and how they contain musical notes. To follow along at home, you'll need the excellent free audio editor Audacity. Download it if you don't have it already, then open it up to an empty window.
Right, so what does a single note "look" like? Let's start with a sine wave. Sine waves have an easily recognisable and pure-sounding tone. In our Audacity window, we'll click Generate -> Tone.

Oh look, they've helpfully picked a Sine wave as the default, with a frequency of 440 Hz and an amplitude of 0.8! (I changed the duration to only 5 seconds, as the default of 30 is a bit too long.) Let's hit OK.

That's a sine wave? Looks more like a giant blue slab to me. Let's zoom in a little bit.

Flashbacks of early high-school maths should be running through your head. Let's figure out what's going on in this picture. The graph you see here is basically tracing the path that the speaker cone will take to reproduce the sound, with +1.0 meaning pushed all the way forward and -1.0 pulled all the way back. If we think of sound output (vertical axis) as y, and time (horizontal axis) as t, here's the formula:
y = A*sin( w*t+ph )
- A is the amplitude of the wave; in our case, we picked 0.8, and sure enough we can see on the graph that the wave has peaks and troughs at +/- 0.8.
- w is a little tricky, but it corresponds to the frequency we picked. We picked 440Hz as the frequency f. Basically, we want sin() to repeat itself every 1/f seconds (also known as the period of the wave), and the function sin() takes radians as an input (so one full repetition of sin() happens after every 2*pi radians). Combining these, we get w = 2*pi/(1/f) = 2*pi*f. This means that the wave should repeat itself every (1/440) = 0.002272 seconds, which matches up with what we see on the graph.
- ph is the phase of the wave; basically, at what point along the wave pattern do we want to start our signal. Right now it's 0.
If you also suffer flashbacks to high-school music lessons, playing back the 440 Hz sine wave might also remind you of "concert A", aka. the favourite tuning note for string musicians and orchestras. Musical notes from instruments (as we'll find out later) are a mix of a whole pile of different frequency waves, but usually there's one frequency that's loudest and that's the one we assign the note value to. 440 Hz (a.ka. the A wot comes after middle C on a keyboard) is referred to as A4 in scientific pitch notation, or 69 on the MIDI scale. There are 12 notes in an octave (great choice of name, guys), and interestingly an octave jump between notes is a multiple of 2 in the frequency domain. So that means that 57 (A3) is 220 Hz, while 81 (A5) is 880 Hz! Try these out in the tone generator if you like, don't they sound real similar?
Anyway, you're probably wondering if there's a nice and easy formula for converting note values in the MIDI domain to tones in the frequency domain for your tone generator. To which my response would be "define easy":
#define FREQ_TO_MIDI( freq ) (69.0f+12.0f*log( freq/440.0f )/log( 2.0f ))
#define MIDI_TO_FREQ( midi ) (exp( log( 2.0f )*(midi-69.0f)/12.0f )*440.0f)
Sometimes it's easier to use a lookup table instead. Like this one! (thanks J. Wolfe of UNSW)

One final thing; digital audio is based on the principle of sampling; unlike an analogue audio signal, which is a continuous sound output from an electrical or mechanical process (e.g. piezo cartridge running through the grooves of a record, electric guitar plugged into a tube amp), the sound output of a digital audio signal operates at a fixed number of "samples" per second. You see that number 44100 Hz down in the bottom left corner? That's known as the sampling rate. Every second of audio can be thought of as a big two-column list with 44100 timecodes down the left hand side and 44100 speaker positions on the right. 44100 Hz is the standard sampling rate of audio CDs and most downloadable music.
Wouldn't having a fixed number of samples per second limit the quality of sound you would get? Absolutely! If you've wondered why voices on a landline telephone sound so lousy, it's because Telstra can't maintain their copper for beans and the pit outside is probably full of water. It's also because the sound is processed with G.711, an audio codec from 1972 that has an abysmal sampling rate of 8000 Hz.
In digital audio, the highest frequency sound you can physically reproduce is half the sampling rate, known as the Nyquist frequency. Which makes sense, because the tightest you'd be able to pack sound into your digital signal is with a +1.0, followed by a -1.0, followed by another +1.0… and so on. Now 44100 Hz has a Nyquist frequency of 22050 Hz, which conveniently is juuuust above the range of human hearing. Telephones have a Nyquist frequency of 4000 Hz, which is not. It's extremely audible. Try loading some music into Audacity and crank the Project Rate down to 8000 Hz; notice how it's sounding all muddy and not crisp? That's what happens when you can't play frequencies higher than 4000 Hz. It stinks, right?
Alright, enough minutiae. Time to get to the most mind-bending part about how sound works. So we're looking at our sine wave here, thinking about how it corresponds to the movement of the speaker cone, but also how nice and symmetrical it is. One question that springs to mind then, how is it possible for a good set of loudspeakers to reproduce… well, any damn sound you can think of? How do you combine multiple sounds? Wouldn't the movement of the cone look all jagged and weird?
Let's generate another tone in our project; a complimentary one to A4. Let's go for C#4 (61), as that produces a nice-sounding A major third. Add a new tone of a C#4 sine wave at 277.18Hz.

When you play that back, there's going to be a sharp edge to it. This is because the two tones we picked have a very high amplitude, enough so that combining them produces a result outside the range of -1.0 to +1.0, which you can hear as clipping. Lower the volume of the two waves by setting the little -/+ slider on the left of both tracks down to -10 dB. Listen again, you'll have a much cleaner sound.
But they're still both separate! To see the resulting signal that drives the speaker, we need to merge them. Press Ctrl-A to select both tracks, then pick Tracks -> Mix and Render to New Track

Will you look at that? The samples from each of the two waves are just added together. The resulting signal, being simple, has a few obvious features which match up with the two parent sine waves. Still, looking at it by itself it's difficult to say what frequencies are in this new combination, and it gets even harder to examine by eye as more tones are combined. And yet, if you listen closely it's obvious it contains the two tones we made earlier! How do we separate tones back out again? If our ears can do it, why can't maths?
If you haven't guessed already, a large part of audio analysis is going backwards from a complete digital audio signal to its component parts. We'll go into that more in the next thrilling installment!