Cat translator (classifying spectrograms)

4 min readApr 25, 2022

Cornell’s birding app, Merlin, uses spectrograms to identify bird song. See how to build a simple cat translator using a similar approach.

Download the complete sample here: Cat translator on GitHub

Watch an overview of the cat translator project

What’s a spectrogram?

A spectrogram is a picture of a sound.

Here’s how to interpret the visualization:

Time moves forward from left to right
Pitch gets higher from bottom to top
Loudness increases with the brightness of the color

Cat translation process

Step 1: Extract meow sound from audio

Python code sample:

from moviepy.editor import *
import refps = 44100def getTrimmedAudio( video_filename ):
    audio    = VideoFileClip( video_filename ).audio
    soundarr = audio.to_soundarray( fps=fps, nbytes=4 )
    start    = getSoundStart( soundarr )
    end      = getSoundEnd( soundarr, start )
    clip     = audio.subclip( start, end )
    clip.duration = ( end - start )
    filename_new = re.sub( "\.mp4$", ".wav", video_filename )
    clip.write_audiofile( filename_new, fps=fps, nbytes=4 )

In getTrimmedAudio, you can see the process:

Audio is extracted from the video using moviepy
A convenient object, called a soundarray, is created
The start and end of the meow sound are found (see below)
The audio is clipped from start to end
The new subclip is saved to a .wav file

The start and end points are found using loudness as a guide:

def getSoundStart( soundarray ):
    soundmax = max( [ abs(min(soundarray[:,0])),
                      abs(max(soundarray[:,1])) ] )
    threshold = 0.3 * ( soundmax )
    i = 0
    while ( i < soundarray.shape[0] ) \
      and ( abs( soundarray[i,0] ) < threshold ) \
      and ( abs( soundarray[i,1] ) < threshold ):
        i += 1
    return ( i - 2000 ) / fpsdef getSoundEnd( soundarray, i ):
    soundmax = max( [ abs(min(soundarray[:,0])),
                      abs(max(soundarray[:,1])) ] )
    threshold = 0.3 * ( soundmax )
    j = ( soundarray.shape[0] - 1 )
    while ( j > i ) \
      and ( abs( soundarray[j,0] ) < threshold ) \
      and ( abs( soundarray[j,1] ) < threshold ):
        j -= 1
    return ( j + 2000 ) / fps

Both functions start by determining the maximum loudness (soundmax)
getSoundStart starts at the beginning of the audio clip and then moves forward until the meow sound begins
getSoundEnd starts at the end of the audio clip and then moves backwards until the meow sound is still happening
A loudness threshold is used to determine when the meow sound is happening: when the volume is louder than 30% of the maximum

The following image shows an example of identifying the meow sound in audio:

Clipping sound — Identifying the meow sound in audio

Step 2: Generate a spectrogram of the meow sound

The following Python code is based on the sample at the bottom of this page from the librosa documentation.

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as npdef saveAsSpectrogram( wav_filename ):
    ts_arr, sr = librosa.load( wav_filename )
    values_amp = np.abs( librosa.stft( ts_arr ) )
    values_db = librosa.amplitude_to_db( values_amp, ref=np.max )
    plt.figure( figsize=( 5, 5 ) )
    librosa.display.specshow( values_db )
    filename_new = re.sub( "\.wav$", ".spec.png", wav_filename )
    plt.savefig( filename_new, bbox_inches='tight', pad_inches=0.0 )

librosa.load reads the short meow sound from a file and returns a time series array, ts_arr
Work with the data in ts_arr to get it into the correct format (see: librosa.stft and librosa.amplitude_to_db)
Display the data as a specrogram in a matplotlib figure (see: librosa.display.specshow)
Save the figure to a file

Step 3: Classify spectrogram

If you have a model* trained to classify spectrograms of the cat meowing with different intents, you can use the model to classify a spectogram like this:

import tensorflow as tf
from tensorflow import keras
import numpy as npmodel = tf.keras.models.load_model( "trained-model" )class_names = [ "feedme", "opendoor" ]def classifySpectrogram( spec ):
    img = tf.keras.utils.load_img( spec, target_size=(224,224) )
    img_arr = tf.keras.utils.img_to_array( img )
    img_arr = tf.expand_dims( img_arr, 0 )
    predictions = model.predict( img_arr )
    scores = tf.nn.softmax( predictions[0] )
    top_class = class_names[ np.argmax( scores ) ]
    top_score = np.max( scores )
    confidence = str( round( 100 * top_score, 2 ) ) + "%"
    return top_class, confidence

Put that all together, and you can classify videos of the cat meowing:

Running sample code in a Python notebook — Output from running sample code in a Python notebook

*Where’d that classification model come from?

The complete sample in GitHub includes training data and code for building and training a model that can classify sample cat meow spectrograms: Complete sample in GitHub

Image of cat translator web app — Cat translator web app

Conclusion

The use of spectrograms for visualizing bird song is common among birders (even without AI applications.) What sounds in your life could you build a model to recognize, classify, or analyze?