Automatic Speech Recognition

By: Kevin Vegda
Posted: May 09, 2024

This is the first time I’ve worked on audio data for deep learning - here’s what I learned.

Audio deep learning has made certain tasks which required deep domain expertise quite trivial now. To carry out automatic speech recognition with ML a little over a decade ago, you’d have needed a linguistics expert who understands stuff like phonemes so that manually labeled data that has alignment between the audio and the transcription can be procured. This would then be fed into the ML model to train it to predict the transcription.

Now, we have algorithms like CTC - Connectionist Temporal Classification, which can automatically come up with the correct alignment between a visual sequence like a mel spectrogram of an audio or handwriting and text. This massively decreases the effort it takes for us to train ASR systems.

I trained one such system in PyTorch using two networks - a CNN that encodes the audio inputs we have into mel spectrograms that represent it in the time and frequency domain, and an RNN that decodes the encoded representation into text. We use CTC Loss to get combined decoding + alignment. It’s quite the feat.

I trained my model for ~1.5 hours on a GTX3090 with a fraction of the LibriSpeech datasets available on Torchaudio. The model went through 6 epochs in that time and was able to come up with something resembling the correct words in the transcription. A few more epochs and we’d have a pretty decent system. And all this with greedy decoding, which isn’t even the best approach, quality-wise. Beam search of the candidate transcriptions is supposed to give much better results and synergises well with CTC loss. That’s where we’re going next!