Cat translator (classifying spectrograms)

Sarah Packowski
4 min readApr 25, 2022

--

Cornell’s birding app, Merlin, uses spectrograms to identify bird song. See how to build a simple cat translator using a similar approach.

Download the complete sample here: Cat translator on GitHub

Watch an overview of the cat translator project

What’s a spectrogram?

A spectrogram is a picture of a sound.

Spectrogram
Spectrogram of a cat’s meow

Here’s how to interpret the visualization:

  • Time moves forward from left to right
  • Pitch gets higher from bottom to top
  • Loudness increases with the brightness of the color

Cat translation process

Step 1: Extract meow sound from audio

Python code sample:

from moviepy.editor import *
import re
fps = 44100def getTrimmedAudio( video_filename ):
audio = VideoFileClip( video_filename ).audio
soundarr = audio.to_soundarray( fps=fps, nbytes=4 )
start = getSoundStart( soundarr )
end = getSoundEnd( soundarr, start )
clip = audio.subclip( start, end )
clip.duration = ( end - start )
filename_new = re.sub( "\.mp4$", ".wav", video_filename )
clip.write_audiofile( filename_new, fps=fps, nbytes=4 )

In getTrimmedAudio, you can see the process:

  1. Audio is extracted from the video using moviepy
  2. A convenient object, called a soundarray, is created
  3. The start and end of the meow sound are found (see below)
  4. The audio is clipped from start to end
  5. The new subclip is saved to a .wav file

The start and end points are found using loudness as a guide:

def getSoundStart( soundarray ):
soundmax = max( [ abs(min(soundarray[:,0])),
abs(max(soundarray[:,1])) ] )
threshold = 0.3 * ( soundmax )
i = 0
while ( i < soundarray.shape[0] ) \
and ( abs( soundarray[i,0] ) < threshold ) \
and ( abs( soundarray[i,1] ) < threshold ):
i += 1
return ( i - 2000 ) / fps
def getSoundEnd( soundarray, i ):
soundmax = max( [ abs(min(soundarray[:,0])),
abs(max(soundarray[:,1])) ] )
threshold = 0.3 * ( soundmax )
j = ( soundarray.shape[0] - 1 )
while ( j > i ) \
and ( abs( soundarray[j,0] ) < threshold ) \
and ( abs( soundarray[j,1] ) < threshold ):
j -= 1
return ( j + 2000 ) / fps
  • Both functions start by determining the maximum loudness (soundmax)
  • getSoundStart starts at the beginning of the audio clip and then moves forward until the meow sound begins
  • getSoundEnd starts at the end of the audio clip and then moves backwards until the meow sound is still happening
  • A loudness threshold is used to determine when the meow sound is happening: when the volume is louder than 30% of the maximum

The following image shows an example of identifying the meow sound in audio:

Clipping sound
Identifying the meow sound in audio

Step 2: Generate a spectrogram of the meow sound

The following Python code is based on the sample at the bottom of this page from the librosa documentation.

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
def saveAsSpectrogram( wav_filename ):
ts_arr, sr = librosa.load( wav_filename )
values_amp = np.abs( librosa.stft( ts_arr ) )
values_db = librosa.amplitude_to_db( values_amp, ref=np.max )
plt.figure( figsize=( 5, 5 ) )
librosa.display.specshow( values_db )
filename_new = re.sub( "\.wav$", ".spec.png", wav_filename )
plt.savefig( filename_new, bbox_inches='tight', pad_inches=0.0 )
  1. librosa.load reads the short meow sound from a file and returns a time series array, ts_arr
  2. Work with the data in ts_arr to get it into the correct format (see: librosa.stft and librosa.amplitude_to_db)
  3. Display the data as a specrogram in a matplotlib figure (see: librosa.display.specshow)
  4. Save the figure to a file

Step 3: Classify spectrogram

If you have a model* trained to classify spectrograms of the cat meowing with different intents, you can use the model to classify a spectogram like this:

import tensorflow as tf
from tensorflow import keras
import numpy as np
model = tf.keras.models.load_model( "trained-model" )class_names = [ "feedme", "opendoor" ]def classifySpectrogram( spec ):
img = tf.keras.utils.load_img( spec, target_size=(224,224) )
img_arr = tf.keras.utils.img_to_array( img )
img_arr = tf.expand_dims( img_arr, 0 )
predictions = model.predict( img_arr )
scores = tf.nn.softmax( predictions[0] )
top_class = class_names[ np.argmax( scores ) ]
top_score = np.max( scores )
confidence = str( round( 100 * top_score, 2 ) ) + "%"
return top_class, confidence

Put that all together, and you can classify videos of the cat meowing:

Running sample code in a Python notebook
Output from running sample code in a Python notebook

*Where’d that classification model come from?

The complete sample in GitHub includes training data and code for building and training a model that can classify sample cat meow spectrograms: Complete sample in GitHub

Image of cat translator web app
Cat translator web app

Conclusion

The use of spectrograms for visualizing bird song is common among birders (even without AI applications.) What sounds in your life could you build a model to recognize, classify, or analyze?

--

--

Sarah Packowski
Sarah Packowski

Written by Sarah Packowski

Design, build AI solutions by day. Experiment with input devices, drones, IoT, smart farming by night.

No responses yet