Cat translator (classifying spectrograms)
Cornell’s birding app, Merlin, uses spectrograms to identify bird song. See how to build a simple cat translator using a similar approach.
Download the complete sample here: Cat translator on GitHub
What’s a spectrogram?
A spectrogram is a picture of a sound.
Here’s how to interpret the visualization:
- Time moves forward from left to right
- Pitch gets higher from bottom to top
- Loudness increases with the brightness of the color
Cat translation process
Step 1: Extract meow sound from audio
Python code sample:
from moviepy.editor import *
import refps = 44100def getTrimmedAudio( video_filename ):
audio = VideoFileClip( video_filename ).audio
soundarr = audio.to_soundarray( fps=fps, nbytes=4 )
start = getSoundStart( soundarr )
end = getSoundEnd( soundarr, start )
clip = audio.subclip( start, end )
clip.duration = ( end - start )
filename_new = re.sub( "\.mp4$", ".wav", video_filename )
clip.write_audiofile( filename_new, fps=fps, nbytes=4 )
In getTrimmedAudio, you can see the process:
- Audio is extracted from the video using moviepy
- A convenient object, called a soundarray, is created
- The start and end of the meow sound are found (see below)
- The audio is clipped from start to end
- The new subclip is saved to a .wav file
The start and end points are found using loudness as a guide:
def getSoundStart( soundarray ):
soundmax = max( [ abs(min(soundarray[:,0])),
abs(max(soundarray[:,1])) ] )
threshold = 0.3 * ( soundmax )
i = 0
while ( i < soundarray.shape[0] ) \
and ( abs( soundarray[i,0] ) < threshold ) \
and ( abs( soundarray[i,1] ) < threshold ):
i += 1
return ( i - 2000 ) / fpsdef getSoundEnd( soundarray, i ):
soundmax = max( [ abs(min(soundarray[:,0])),
abs(max(soundarray[:,1])) ] )
threshold = 0.3 * ( soundmax )
j = ( soundarray.shape[0] - 1 )
while ( j > i ) \
and ( abs( soundarray[j,0] ) < threshold ) \
and ( abs( soundarray[j,1] ) < threshold ):
j -= 1
return ( j + 2000 ) / fps
- Both functions start by determining the maximum loudness (soundmax)
- getSoundStart starts at the beginning of the audio clip and then moves forward until the meow sound begins
- getSoundEnd starts at the end of the audio clip and then moves backwards until the meow sound is still happening
- A loudness threshold is used to determine when the meow sound is happening: when the volume is louder than 30% of the maximum
The following image shows an example of identifying the meow sound in audio:
Step 2: Generate a spectrogram of the meow sound
The following Python code is based on the sample at the bottom of this page from the librosa documentation.
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as npdef saveAsSpectrogram( wav_filename ):
ts_arr, sr = librosa.load( wav_filename )
values_amp = np.abs( librosa.stft( ts_arr ) )
values_db = librosa.amplitude_to_db( values_amp, ref=np.max )
plt.figure( figsize=( 5, 5 ) )
librosa.display.specshow( values_db )
filename_new = re.sub( "\.wav$", ".spec.png", wav_filename )
plt.savefig( filename_new, bbox_inches='tight', pad_inches=0.0 )
- librosa.load reads the short meow sound from a file and returns a time series array, ts_arr
- Work with the data in ts_arr to get it into the correct format (see: librosa.stft and librosa.amplitude_to_db)
- Display the data as a specrogram in a matplotlib figure (see: librosa.display.specshow)
- Save the figure to a file
Step 3: Classify spectrogram
If you have a model* trained to classify spectrograms of the cat meowing with different intents, you can use the model to classify a spectogram like this:
import tensorflow as tf
from tensorflow import keras
import numpy as npmodel = tf.keras.models.load_model( "trained-model" )class_names = [ "feedme", "opendoor" ]def classifySpectrogram( spec ):
img = tf.keras.utils.load_img( spec, target_size=(224,224) )
img_arr = tf.keras.utils.img_to_array( img )
img_arr = tf.expand_dims( img_arr, 0 )
predictions = model.predict( img_arr )
scores = tf.nn.softmax( predictions[0] )
top_class = class_names[ np.argmax( scores ) ]
top_score = np.max( scores )
confidence = str( round( 100 * top_score, 2 ) ) + "%"
return top_class, confidence
Put that all together, and you can classify videos of the cat meowing:
*Where’d that classification model come from?
The complete sample in GitHub includes training data and code for building and training a model that can classify sample cat meow spectrograms: Complete sample in GitHub
Conclusion
The use of spectrograms for visualizing bird song is common among birders (even without AI applications.) What sounds in your life could you build a model to recognize, classify, or analyze?