This is a problem that introduced me to Fourier analysis and machine learning, with several productive detours along the way.
The company I work for produces training videos, each of which is about 350 minutes long. We needed to package the videos for a new distribution platform that required us to split the videos into smaller 5 to 15 minute segments.
Fortunately, we had written down time-stamps for when instructors switch to new topics. I knew that the ffmpeg tool on Linux could segment MP4 videos, and we had the time-stamps stored in an XML file. So I wrote a Python script to load the time-stamps, split the videos using ffmpeg, and package the videos (and other metadata) for the distributor.
Everything seemed to work, until I watched the split videos.
As it turned out, the time-stamps were accurate to within 2 seconds. That’s a very long time when you’re listening to someone speak — enough time to say several words. So the script was splitting the videos mid-word. It sounded terrible.
Unfortunately, there were hundreds of time-stamps to check (the XML file is currently 7,774 lines long). So I wanted run a program that would check the time-stamps automatically. If the instructor was speaking, then I’d need to fix the time-stamp. Otherwise, I’d assume that the time-stamp was okay.
My first thought was: check the amplitude of the sound wave. If it’s loud, then the instructor is probably talking. Otherwise, it’s probably a pause. Here’s the problem, though:
Both sound waves have similar amplitude, but one sounds like white noise, and the other is a person speaking. Many of the videos had a lot of white noise in the background — so much that the pause sound waves often had greater average amplitude than the speaking sound waves.
In the next post, I’ll describe the solution I eventually found.