In a previous post, I described the problem of identifying pauses in a video. Given some audio data, how can a computer determine whether someone is speaking?
My first approach was to take the average amplitude of the sound wave over an interval. Unfortunately, you cannot accurately distinguish pausing from speaking, because white noise often has greater average amplitude.
| Speaking: | ![]() |
| Pause: | ![]() |
And yet there is something different about the pausing sound-wave that humans can see immediately. It looks “rougher” — if you touched the wave, you might feel little spikes or bumps.
How can you describe “roughness” to a computer? I had no idea.
Fortunately, I had just discovered Andrew Ng’s excellent Machine Learning course from Stanford’s online education initiative. He spends the first few lectures discussing supervised learning problems. I realized that this might be a way to go: rather than trying to explicitly identify “roughness” in sound waves, I could select examples of “pause” and “silence” sound waves. A learning algorithm could then be trained to classify other sound waves.
I chose to model the classification hypothesis using a Sigmoid function:
This function has the incredible property that as x decreases, the function approaches 0, and as x increases, the function approaches 1. The challenge is to somehow transform features of the audio wave onto the Sigmoid function. Certain features (having a lot of high freqencies, for example) might push you towards the left, whereas other features might push you to the right. If you end up closer to 1 than to 0, then label the audio as “speaking”; otherwise, label the audio as a “pause.”
Once you figure out how to map audio features onto the Sigmoid function, you can very quickly classify new bits of audio. Just apply the mapping function, figure out the y-value of the Sigmoid, and apply the appropriate label.
There are two problems here, though. First, what “features” of an audio wave should we use? Second, how do we find the mapping function?
In my solution, I found a frequency spectrum using Welch’s method. Roughly speaking, this gives an “average” amount of each frequency component of a wave. This was a natural choice for the feature set, since the “roughness” of a wave comes from its high-frequency components. Source code is available here.
For the mapping function, I took a linear combination of the frequency components. The parameters of the model are just the amount of each component in the combination. To determine these parameters, I used Batch Gradient Descent to minimize the error when the algorithm is applied to the training set. Source code is available here.
Here’s a graph of the parameters that the algorithm found, with just 50 training examples:
Orange indicates a parameter value > 0, which pushes us to the right of the Sigmoid function. Blue indicates the opposite, pushing us to the left of the Sigmoid function.
The algorithm discovered what a human can see easily: high frequency components (the “roughness” of the wave) tend to indicate white noise. But the algorithm uses other frequencies as well. Some of these are probably due to over-fitting the training examples. However, there’s an interesting pattern with the low-frequency parameters — apparently white noise in the videos also tends to have low-frequency components.




Pingback: Finding Pauses in Videos: The Problem