Building a High-Accuracy Speech-to-Text Model: A Step-by-Step Tutorial for Beginners

A visually engaging illustration of a waveform gradually transforming into text symbols, representing the process of AI converting speech to text. The image starts with sound waves on the left, transitioning smoothly into clear, readable text on the right. The background subtly features AI-themed elements like microchips and circuits, giving a futuristic feel. The color palette is cool and modern, creating a high-tech and innovative atmosphere. The image symbolizes the seamless transformation enabled by AI technology.

November 14, 2024

By: Vishal V / Artificial Inteligence / 0 Comments

Building a High-Accuracy Speech-to-Text Model: A Step-by-Step Tutorial for Beginners

Introduction

In this tutorial, we’ll walk you through building a speech-to-text (STT) model using Python and deep learning libraries. This guide is designed for beginners, so no prior experience with machine learning is required. We’ll use accessible tools and datasets to ensure you can follow along on your own computer.

What You’ll Learn

Setting up the development environment
Downloading and preparing a speech dataset
Preprocessing audio and text data
Building a neural network model for speech recognition
Training the model
Evaluating model performance
Running inference on new audio files

Prerequisites

Basic knowledge of Python programming
A computer with internet access

Setting Up the Environment

1. Install Python

Ensure you have Python 3.7 or higher installed. You can download it from the official website.

2. Install Required Libraries

We’ll use the following Python libraries:

numpy
pandas
matplotlib
librosa (for audio processing)
PyTorch and torchaudio (for building and training the model)

Open your command prompt or terminal and run:

pip install numpy pandas matplotlib librosa torch torchaudio

Downloading and Preparing the Dataset

1. Download the Dataset

We’ll use a subset of the LibriSpeech dataset, which contains English speech and corresponding text transcripts.

For this tutorial, we’ll use the “train-clean-100” subset (approximately 6 GB).

Download Link: LibriSpeech train-clean-100

2. Extract the Dataset

Extract the downloaded .tar.gz file to a directory on your computer. Note the path; we’ll need it later.

3. Directory Structure

After extraction, the dataset directory structure should look like this:

LibriSpeech
└── train-clean-100
    ├── 19
    │   └── 198
    │       ├── 19-198-0000.flac
    │       ├── 19-198-0001.flac
    │       └── ...
    ├── 26
    ├── 27
    └── ...

Each subdirectory contains audio files (.flac) and corresponding transcript files.

Data Preprocessing

Create a new Python script or Jupyter notebook and follow along.

1. Import Libraries

import os
import pandas as pd
import numpy as np
import librosa
import torch
import torchaudio
from torch.utils.data import Dataset, DataLoader
from torchaudio.transforms import MelSpectrogram
from sklearn.model_selection import train_test_split

2. Collect File Paths and Transcripts

We need to create a DataFrame that contains the paths to the audio files and their corresponding transcripts.

def create_manifest(data_path):
    transcripts = []
    for root, dirs, files in os.walk(data_path):
        for file in files:
            if file.endswith('.trans.txt'):
                with open(os.path.join(root, file), 'r') as f:
                    lines = f.readlines()
                    for line in lines:
                        parts = line.strip().split(' ')
                        transcript = ' '.join(parts[1:]).lower()
                        audio_file = os.path.join(root, parts[0] + '.flac')
                        transcripts.append({'audio_path': audio_file, 'transcript': transcript})
    return pd.DataFrame(transcripts)

# Replace 'your_dataset_path' with the actual path
data_path = 'path/to/LibriSpeech/train-clean-100'
manifest_df = create_manifest(data_path)

3. Inspect the Data

print(manifest_df.head())
print(f"Total samples: {len(manifest_df)}")

4. Split into Training and Validation Sets

train_df, val_df = train_test_split(manifest_df, test_size=0.1, random_state=42)
print(f"Training samples: {len(train_df)}, Validation samples: {len(val_df)}")

5. Define Character Mapping

We’ll work at the character level for simplicity.

char_map_str = """
' 0
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9
j 10
k 11
l 12
m 13
n 14
o 15
p 16
q 17
r 18
s 19
t 20
u 21
v 22
w 23
x 24
y 25
z 26
"""
char_map = {}
index_map = {}
for line in char_map_str.strip().split('\n'):
    ch, index = line.split()
    char_map[ch] = int(index)
    index_map[int(index)] = ch

6. Data Preprocessing Functions

def text_to_int_sequence(text):
    return [char_map.get(c, char_map[' ']) for c in text]

def int_sequence_to_text(seq):
    return ''.join([index_map[i] for i in seq])

Creating a Custom Dataset

1. Define the Dataset Class

class SpeechDataset(Dataset):
    def __init__(self, df, char_map, transform=None):
        self.df = df
        self.char_map = char_map
        self.transform = transform
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        audio_path = self.df.iloc[idx]['audio_path']
        transcript = self.df.iloc[idx]['transcript']
        
        # Load audio
        waveform, sample_rate = torchaudio.load(audio_path)
        
        # Resample to 16kHz
        if sample_rate != 16000:
            resampler = torchaudio.transforms.Resample(sample_rate, 16000)
            waveform = resampler(waveform)
            sample_rate = 16000
        
        # Get spectrogram
        if self.transform:
            spectrogram = self.transform(waveform)
        else:
            spectrogram = waveform
        
        # Convert transcript to int sequence
        transcript_seq = text_to_int_sequence(transcript)
        transcript_seq = torch.Tensor(transcript_seq).int()
        
        return spectrogram.squeeze(0).transpose(0, 1), transcript_seq

2. Define Collate Function for DataLoader

Because audio and transcript lengths vary, we need to pad them in batches.

def collate_fn(batch):
    spectrograms = []
    transcript_seqs = []
    input_lengths = []
    target_lengths = []
    
    for (spectrogram, transcript_seq) in batch:
        spectrograms.append(spectrogram)
        transcript_seqs.append(transcript_seq)
        input_lengths.append(spectrogram.shape[0])
        target_lengths.append(len(transcript_seq))
    
    spectrograms = torch.nn.utils.rnn.pad_sequence(spectrograms, batch_first=True)
    transcript_seqs = torch.nn.utils.rnn.pad_sequence(transcript_seqs, batch_first=True)
    
    return spectrograms, transcript_seqs, input_lengths, target_lengths

3. Create Data Loaders

transform = torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_mels=128)

train_dataset = SpeechDataset(train_df.reset_index(drop=True), char_map, transform=transform)
val_dataset = SpeechDataset(val_df.reset_index(drop=True), char_map, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False, collate_fn=collate_fn)

Building the Model

We’ll build a simple Recurrent Neural Network (RNN) with Connectionist Temporal Classification (CTC) loss.

1. Define the Model

import torch.nn as nn
import torch.nn.functional as F

class SpeechRecognitionModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SpeechRecognitionModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers=2, bidirectional=True, batch_first=True)
        self.fc = nn.Linear(hidden_size * 2, output_size)
    
    def forward(self, x):
        x, _ = self.lstm(x)
        x = self.fc(x)
        x = F.log_softmax(x, dim=2)
        return x

2. Instantiate the Model

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

input_size = 128  # Number of Mel features
hidden_size = 256
output_size = len(char_map)  # Number of characters

model = SpeechRecognitionModel(input_size, hidden_size, output_size).to(device)

Training the Model

1. Define Loss Function and Optimizer

We’ll use CTC loss, which is suitable for sequence-to-sequence models without alignment.

criterion = nn.CTCLoss(blank=0, zero_infinity=True)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

2. Training Loop

num_epochs = 5  # Increase this number for better performance

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for i, (inputs, targets, input_lengths, target_lengths) in enumerate(train_loader):
        inputs = inputs.to(device)
        targets = targets.to(device)
        
        optimizer.zero_grad()
        outputs = model(inputs)
        
        # CTC Loss expects (T, N, C)
        outputs = outputs.permute(1, 0, 2)
        
        loss = criterion(outputs, targets, input_lengths, target_lengths)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
        if i % 10 == 0:
            print(f"Epoch {epoch+1}/{num_epochs}, Step {i+1}/{len(train_loader)}, Loss: {loss.item():.4f}")
    
    print(f"Epoch {epoch+1} completed with average loss: {running_loss/len(train_loader):.4f}")

Evaluating the Model

1. Validation Loop

def cer(prediction, reference):
    # Character Error Rate calculation
    prediction = ''.join(prediction).replace(' ', '')
    reference = ''.join(reference).replace(' ', '')
    errors = sum(1 for a, b in zip(prediction, reference) if a != b) + abs(len(prediction) - len(reference))
    return errors / len(reference)

model.eval()
total_cer = 0.0

with torch.no_grad():
    for i, (inputs, targets, input_lengths, target_lengths) in enumerate(val_loader):
        inputs = inputs.to(device)
        targets = targets.to(device)
        
        outputs = model(inputs)
        outputs = outputs.permute(1, 0, 2)
        
        # Get the best path
        decoded_output, _ = torch.max(outputs, dim=2)
        for j in range(len(decoded_output)):
            pred_indices = decoded_output[j].cpu().numpy()
            pred_text = int_sequence_to_text([int(i) for i in pred_indices])
            target_indices = targets[j].cpu().numpy()
            target_text = int_sequence_to_text([int(i) for i in target_indices])
            total_cer += cer(pred_text, target_text)
    
    print(f"Average CER: {total_cer / len(val_loader):.4f}")

Running Inference on New Audio Files

1. Load and Preprocess Audio

def predict(audio_path, model, transform, device):
    waveform, sample_rate = torchaudio.load(audio_path)
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        waveform = resampler(waveform)
    
    spectrogram = transform(waveform)
    spectrogram = spectrogram.squeeze(0).transpose(0, 1)
    spectrogram = spectrogram.to(device)
    spectrogram = spectrogram.unsqueeze(0)
    
    with torch.no_grad():
        outputs = model(spectrogram)
        outputs = outputs.permute(1, 0, 2)
        decoded_output, _ = torch.max(outputs, dim=2)
        pred_indices = decoded_output[0].cpu().numpy()
        pred_text = int_sequence_to_text([int(i) for i in pred_indices])
        return pred_text

2. Test the Model

test_audio_path = 'path/to/your/test_audio.flac'  # Provide path to a .flac audio file
predicted_text = predict(test_audio_path, model, transform, device)
print(f"Predicted Transcript: {predicted_text}")

Conclusion

Congratulations! You’ve built a simple speech-to-text model from scratch. While this model may not achieve state-of-the-art performance due to limited data and training time, it serves as a foundational step into the world of speech recognition.

Next Steps

Increase Epochs: Train the model for more epochs to improve performance.
More Data: Use larger datasets for training.
Advanced Models: Explore more complex architectures like Transformers.
Fine-Tuning: Fine-tune pre-trained models available in libraries like Hugging Face Transformers.

Troubleshooting Tips

CUDA Errors: If you’re using a GPU, ensure that PyTorch is installed with CUDA support and your GPU drivers are up to date.
Memory Issues: Reduce the batch size if you encounter out-of-memory errors.
Accuracy Issues: Make sure your data preprocessing steps are correct, and consider tuning hyperparameters.

Building a High-Accuracy Speech-to-Text Model: A Step-by-Step Tutorial for Beginners

Introduction

What You’ll Learn

Prerequisites

Setting Up the Environment

1. Install Python

2. Install Required Libraries

Downloading and Preparing the Dataset

1. Download the Dataset

2. Extract the Dataset

3. Directory Structure

Data Preprocessing

1. Import Libraries

2. Collect File Paths and Transcripts

3. Inspect the Data

4. Split into Training and Validation Sets

5. Define Character Mapping

6. Data Preprocessing Functions

Creating a Custom Dataset

1. Define the Dataset Class

2. Define Collate Function for DataLoader

3. Create Data Loaders

Building the Model

1. Define the Model

2. Instantiate the Model

Training the Model

1. Define Loss Function and Optimizer

2. Training Loop

Evaluating the Model

1. Validation Loop

Running Inference on New Audio Files

1. Load and Preprocess Audio

2. Test the Model

Conclusion

Next Steps

Troubleshooting Tips

Company

Services