
Building a High-Accuracy Speech-to-Text Model: A Step-by-Step Tutorial for Beginners
Introduction
In this tutorial, we’ll walk you through building a speech-to-text (STT) model using Python and deep learning libraries. This guide is designed for beginners, so no prior experience with machine learning is required. We’ll use accessible tools and datasets to ensure you can follow along on your own computer.
What You’ll Learn
- Setting up the development environment
- Downloading and preparing a speech dataset
- Preprocessing audio and text data
- Building a neural network model for speech recognition
- Training the model
- Evaluating model performance
- Running inference on new audio files
Prerequisites
- Basic knowledge of Python programming
- A computer with internet access
Setting Up the Environment
1. Install Python
Ensure you have Python 3.7 or higher installed. You can download it from the official website.
2. Install Required Libraries
We’ll use the following Python libraries:
numpy
pandas
matplotlib
librosa
(for audio processing)PyTorch
andtorchaudio
(for building and training the model)
Open your command prompt or terminal and run:
pip install numpy pandas matplotlib librosa torch torchaudio
Downloading and Preparing the Dataset
1. Download the Dataset
We’ll use a subset of the LibriSpeech dataset, which contains English speech and corresponding text transcripts.
For this tutorial, we’ll use the “train-clean-100” subset (approximately 6 GB).
Download Link: LibriSpeech train-clean-100
2. Extract the Dataset
Extract the downloaded .tar.gz
file to a directory on your computer. Note the path; we’ll need it later.
3. Directory Structure
After extraction, the dataset directory structure should look like this:
LibriSpeech
└── train-clean-100
├── 19
│ └── 198
│ ├── 19-198-0000.flac
│ ├── 19-198-0001.flac
│ └── ...
├── 26
├── 27
└── ...
Each subdirectory contains audio files (.flac
) and corresponding transcript files.
Data Preprocessing
Create a new Python script or Jupyter notebook and follow along.
1. Import Libraries
import os
import pandas as pd
import numpy as np
import librosa
import torch
import torchaudio
from torch.utils.data import Dataset, DataLoader
from torchaudio.transforms import MelSpectrogram
from sklearn.model_selection import train_test_split
2. Collect File Paths and Transcripts
We need to create a DataFrame that contains the paths to the audio files and their corresponding transcripts.
def create_manifest(data_path):
transcripts = []
for root, dirs, files in os.walk(data_path):
for file in files:
if file.endswith('.trans.txt'):
with open(os.path.join(root, file), 'r') as f:
lines = f.readlines()
for line in lines:
parts = line.strip().split(' ')
transcript = ' '.join(parts[1:]).lower()
audio_file = os.path.join(root, parts[0] + '.flac')
transcripts.append({'audio_path': audio_file, 'transcript': transcript})
return pd.DataFrame(transcripts)
# Replace 'your_dataset_path' with the actual path
data_path = 'path/to/LibriSpeech/train-clean-100'
manifest_df = create_manifest(data_path)
3. Inspect the Data
print(manifest_df.head())
print(f"Total samples: {len(manifest_df)}")
4. Split into Training and Validation Sets
train_df, val_df = train_test_split(manifest_df, test_size=0.1, random_state=42)
print(f"Training samples: {len(train_df)}, Validation samples: {len(val_df)}")
5. Define Character Mapping
We’ll work at the character level for simplicity.
char_map_str = """
' 0
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9
j 10
k 11
l 12
m 13
n 14
o 15
p 16
q 17
r 18
s 19
t 20
u 21
v 22
w 23
x 24
y 25
z 26
"""
char_map = {}
index_map = {}
for line in char_map_str.strip().split('\n'):
ch, index = line.split()
char_map[ch] = int(index)
index_map[int(index)] = ch
6. Data Preprocessing Functions
def text_to_int_sequence(text):
return [char_map.get(c, char_map[' ']) for c in text]
def int_sequence_to_text(seq):
return ''.join([index_map[i] for i in seq])
Creating a Custom Dataset
1. Define the Dataset Class
class SpeechDataset(Dataset):
def __init__(self, df, char_map, transform=None):
self.df = df
self.char_map = char_map
self.transform = transform
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
audio_path = self.df.iloc[idx]['audio_path']
transcript = self.df.iloc[idx]['transcript']
# Load audio
waveform, sample_rate = torchaudio.load(audio_path)
# Resample to 16kHz
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
waveform = resampler(waveform)
sample_rate = 16000
# Get spectrogram
if self.transform:
spectrogram = self.transform(waveform)
else:
spectrogram = waveform
# Convert transcript to int sequence
transcript_seq = text_to_int_sequence(transcript)
transcript_seq = torch.Tensor(transcript_seq).int()
return spectrogram.squeeze(0).transpose(0, 1), transcript_seq
2. Define Collate Function for DataLoader
Because audio and transcript lengths vary, we need to pad them in batches.
def collate_fn(batch):
spectrograms = []
transcript_seqs = []
input_lengths = []
target_lengths = []
for (spectrogram, transcript_seq) in batch:
spectrograms.append(spectrogram)
transcript_seqs.append(transcript_seq)
input_lengths.append(spectrogram.shape[0])
target_lengths.append(len(transcript_seq))
spectrograms = torch.nn.utils.rnn.pad_sequence(spectrograms, batch_first=True)
transcript_seqs = torch.nn.utils.rnn.pad_sequence(transcript_seqs, batch_first=True)
return spectrograms, transcript_seqs, input_lengths, target_lengths
3. Create Data Loaders
transform = torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_mels=128)
train_dataset = SpeechDataset(train_df.reset_index(drop=True), char_map, transform=transform)
val_dataset = SpeechDataset(val_df.reset_index(drop=True), char_map, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False, collate_fn=collate_fn)
Building the Model
We’ll build a simple Recurrent Neural Network (RNN) with Connectionist Temporal Classification (CTC) loss.
1. Define the Model
import torch.nn as nn
import torch.nn.functional as F
class SpeechRecognitionModel(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SpeechRecognitionModel, self).__init__()
self.lstm = nn.LSTM(input_size, hidden_size, num_layers=2, bidirectional=True, batch_first=True)
self.fc = nn.Linear(hidden_size * 2, output_size)
def forward(self, x):
x, _ = self.lstm(x)
x = self.fc(x)
x = F.log_softmax(x, dim=2)
return x
2. Instantiate the Model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
input_size = 128 # Number of Mel features
hidden_size = 256
output_size = len(char_map) # Number of characters
model = SpeechRecognitionModel(input_size, hidden_size, output_size).to(device)
Training the Model
1. Define Loss Function and Optimizer
We’ll use CTC loss, which is suitable for sequence-to-sequence models without alignment.
criterion = nn.CTCLoss(blank=0, zero_infinity=True)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
2. Training Loop
num_epochs = 5 # Increase this number for better performance
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
for i, (inputs, targets, input_lengths, target_lengths) in enumerate(train_loader):
inputs = inputs.to(device)
targets = targets.to(device)
optimizer.zero_grad()
outputs = model(inputs)
# CTC Loss expects (T, N, C)
outputs = outputs.permute(1, 0, 2)
loss = criterion(outputs, targets, input_lengths, target_lengths)
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 10 == 0:
print(f"Epoch {epoch+1}/{num_epochs}, Step {i+1}/{len(train_loader)}, Loss: {loss.item():.4f}")
print(f"Epoch {epoch+1} completed with average loss: {running_loss/len(train_loader):.4f}")
Evaluating the Model
1. Validation Loop
def cer(prediction, reference):
# Character Error Rate calculation
prediction = ''.join(prediction).replace(' ', '')
reference = ''.join(reference).replace(' ', '')
errors = sum(1 for a, b in zip(prediction, reference) if a != b) + abs(len(prediction) - len(reference))
return errors / len(reference)
model.eval()
total_cer = 0.0
with torch.no_grad():
for i, (inputs, targets, input_lengths, target_lengths) in enumerate(val_loader):
inputs = inputs.to(device)
targets = targets.to(device)
outputs = model(inputs)
outputs = outputs.permute(1, 0, 2)
# Get the best path
decoded_output, _ = torch.max(outputs, dim=2)
for j in range(len(decoded_output)):
pred_indices = decoded_output[j].cpu().numpy()
pred_text = int_sequence_to_text([int(i) for i in pred_indices])
target_indices = targets[j].cpu().numpy()
target_text = int_sequence_to_text([int(i) for i in target_indices])
total_cer += cer(pred_text, target_text)
print(f"Average CER: {total_cer / len(val_loader):.4f}")
Running Inference on New Audio Files
1. Load and Preprocess Audio
def predict(audio_path, model, transform, device):
waveform, sample_rate = torchaudio.load(audio_path)
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
waveform = resampler(waveform)
spectrogram = transform(waveform)
spectrogram = spectrogram.squeeze(0).transpose(0, 1)
spectrogram = spectrogram.to(device)
spectrogram = spectrogram.unsqueeze(0)
with torch.no_grad():
outputs = model(spectrogram)
outputs = outputs.permute(1, 0, 2)
decoded_output, _ = torch.max(outputs, dim=2)
pred_indices = decoded_output[0].cpu().numpy()
pred_text = int_sequence_to_text([int(i) for i in pred_indices])
return pred_text
2. Test the Model
test_audio_path = 'path/to/your/test_audio.flac' # Provide path to a .flac audio file
predicted_text = predict(test_audio_path, model, transform, device)
print(f"Predicted Transcript: {predicted_text}")
Conclusion
Congratulations! You’ve built a simple speech-to-text model from scratch. While this model may not achieve state-of-the-art performance due to limited data and training time, it serves as a foundational step into the world of speech recognition.
Next Steps
- Increase Epochs: Train the model for more epochs to improve performance.
- More Data: Use larger datasets for training.
- Advanced Models: Explore more complex architectures like Transformers.
- Fine-Tuning: Fine-tune pre-trained models available in libraries like Hugging Face Transformers.
Troubleshooting Tips
- CUDA Errors: If you’re using a GPU, ensure that PyTorch is installed with CUDA support and your GPU drivers are up to date.
- Memory Issues: Reduce the batch size if you encounter out-of-memory errors.
- Accuracy Issues: Make sure your data preprocessing steps are correct, and consider tuning hyperparameters.