Skip to content

Audio Classification

This example shows how to build an audio classifier with Ludwig, mapping raw audio recordings to discrete categories such as emotion, sound event, or spoken intent.

Audio classification covers a wide range of real-world applications: detecting the emotional state of a speaker, recognizing environmental sounds (glass breaking, dog barking, sirens), identifying music genres, or classifying spoken commands in a voice interface. Ludwig handles the full pipeline — loading audio files, computing spectral features, training an encoder, and predicting the output category — from a single config file.

We'll use the EMO-DB (Berlin Database of Emotional Speech) dataset. EMO-DB contains 535 German-language utterances recorded by 10 professional actors under controlled studio conditions. Each utterance is labeled with one of seven emotions: anger, boredom, disgust, fear, happiness, neutral, or sadness. Despite its small size, EMO-DB is a widely used benchmark for speech emotion recognition and is large enough to demonstrate the full Ludwig workflow.

Dataset

The dataset contains the following columns:

column type description
audio audio Path to the WAV audio file of the utterance
emotion category Target emotion: anger, boredom, disgust, fear, happiness, neutral, sadness
gender category Speaker gender: male or female
age number Speaker age in years

Sample rows:

audio emotion gender age
03a01Wa.wav anger male 31
08b10Lb.wav boredom female 34
09a07Eb.wav disgust female 21
14b06Aa.wav anger male 26
12a01Fb.wav happiness female 32

Download Dataset

Downloads the dataset and writes emodb.csv plus the audio files to the current directory.

ludwig datasets download emodb

Downloads the EMO-DB dataset into a pandas DataFrame.

from ludwig.datasets import emodb

# Loads the dataset as a pandas.DataFrame
train_df, test_df, val_df = emodb.load()

The loaded DataFrame contains all columns listed above plus a split column (0 = train, 1 = test, 2 = validation).

Train

Define Ludwig Config

The Ludwig config declares the machine learning task. It tells Ludwig which columns to use as input and what to predict.

For this example we predict emotion from the raw audio signal using a stacked CNN encoder, which applies a series of 1-D convolutional layers over the mel-spectrogram representation of the audio.

With config.yaml:

input_features:
  - name: audio
    type: audio
    encoder:
      type: stacked_cnn
output_features:
  - name: emotion
    type: category
trainer:
  epochs: 20

With config defined in a Python dict:

config = {
    "input_features": [
        {
            "name": "audio",
            "type": "audio",
            "encoder": {
                "type": "stacked_cnn",  # 1-D CNN over mel-spectrogram frames
            },
        }
    ],
    "output_features": [
        {
            "name": "emotion",
            "type": "category",  # 7-class classification
        }
    ],
    "trainer": {
        "epochs": 20,
    },
}

Create and Train a Model

ludwig train --config config.yaml --dataset "ludwig://emodb"
import logging
from ludwig.api import LudwigModel
from ludwig.datasets import emodb

train_df, test_df, val_df = emodb.load()

# Construct Ludwig model from the config dictionary
model = LudwigModel(config, logging_level=logging.INFO)

# Train the model
results = model.train(
    training_set=train_df,
    validation_set=val_df,
    test_set=test_df,
)
train_stats = results.train_stats
output_directory = results.output_directory

Training produces a model checkpoint directory under results/. Ludwig automatically computes a mel-spectrogram from each WAV file, runs the stacked CNN encoder, and optimizes the cross-entropy loss for the 7-way emotion classification.

Evaluate

Generates predictions and performance statistics for the held-out test set.

ludwig evaluate \
    --model_path results/experiment_run/model \
    --dataset "ludwig://emodb" \
    --split test \
    --output_directory test_results
# Generates predictions and performance statistics for the test set
test_stats, predictions, output_directory = model.evaluate(
    test_df,
    collect_predictions=True,
    collect_overall_stats=True,
)

Visualize Metrics

Visualize the confusion matrix to see which emotions are most often confused with each other.

ludwig visualize \
    --visualization confusion_matrix \
    --ground_truth_metadata results/experiment_run/model/training_set_metadata.json \
    --test_statistics test_results/test_statistics.json \
    --output_directory visualizations \
    --file_format png
from ludwig.visualize import confusion_matrix

confusion_matrix(
    [test_stats],
    model.training_set_metadata,
    "emotion",
    top_n_classes=[7],
    model_names=[""],
    normalize=True,
)

Visualize learning curves to see how loss and accuracy evolved during training.

ludwig visualize \
    --visualization learning_curves \
    --ground_truth_metadata results/experiment_run/model/training_set_metadata.json \
    --training_statistics results/experiment_run/training_statistics.json \
    --file_format png \
    --output_directory visualizations
from ludwig.visualize import learning_curves

learning_curves(train_stats, output_feature_name="emotion")

Make Predictions on New Audio

Once the model is trained, generate predictions for new audio files.

Create a new_audio.csv file containing the paths to the audio files you want to classify:

audio
/path/to/recording1.wav
/path/to/recording2.wav
/path/to/recording3.wav
ludwig predict \
    --model_path results/experiment_run/model \
    --dataset new_audio.csv \
    --output_directory predictions
import pandas as pd

new_audio = pd.DataFrame(
    {
        "audio": [
            "/path/to/recording1.wav",
            "/path/to/recording2.wav",
            "/path/to/recording3.wav",
        ]
    }
)

predictions, output_directory = model.predict(new_audio)
print(predictions[["emotion_predictions", "emotion_probability"]])

Prediction outputs are written to predictions/predictions.parquet. For each audio file Ludwig outputs the predicted emotion label and the probability for each of the 7 classes.

Tips

Audio Preprocessing

Ludwig's audio feature automatically computes spectrogram features before passing them to the encoder. You can control the feature type and parameters explicitly:

input_features:
  - name: audio
    type: audio
    encoder:
      type: stacked_cnn
    preprocessing:
      audio_feature:
        type: fbank         # mel-filterbank energies
        num_filter_bands: 80
        window_length_in_s: 0.025
        window_shift_in_s: 0.010
      audio_file_length_limit_in_s: 5.0  # clip/pad all files to 5 seconds
      norm: per_file

Available audio_feature types: - fbank — mel-filterbank energies (recommended for speech) - stft — short-time Fourier transform magnitude - stft_phase — STFT magnitude + phase - group_delay — group delay features

Encoder Options

Encoder Description When to use
stacked_cnn 1-D CNNs over time frames Fast baseline, works well with mel-spectrogram
rnn Bidirectional LSTM Good for variable-length sequences
transformer Self-attention over frames Best accuracy when data is sufficient
parallel_cnn Multiple CNN filter sizes in parallel Captures multi-scale temporal patterns

For small datasets like EMO-DB, stacked_cnn is a practical choice. For larger datasets, consider the transformer encoder:

input_features:
  - name: audio
    type: audio
    encoder:
      type: transformer
      num_heads: 4
      num_layers: 2
      hidden_size: 256

Using Additional Speaker Features

EMO-DB also includes speaker metadata. Adding gender and age as additional inputs can improve accuracy on small datasets:

input_features:
  - name: audio
    type: audio
    encoder:
      type: stacked_cnn
  - name: gender
    type: category
  - name: age
    type: number
output_features:
  - name: emotion
    type: category
trainer:
  epochs: 20

Hyperparameters to Tune

  • preprocessing.audio_file_length_limit_in_s — set to the 95th-percentile duration in your dataset to avoid excessive zero-padding
  • trainer.learning_rate — try 1e-3 (default) down to 1e-4 for deeper encoders
  • trainer.batch_size — audio tensors are large; reduce to 16 or 8 if you run out of GPU memory
  • encoder.num_filters in CNN layers — increase for larger datasets (e.g., 64, 128, 256)
  • trainer.epochs with trainer.early_stop — set early_stop: 5 to stop when validation accuracy plateaus

Other Ludwig Datasets for Audio Classification

Ludwig provides several additional audio classification datasets you can use directly:

Dataset Ludwig name Description Size
ESC-50 esc50 Environmental sound classification, 50 classes (sirens, animals, nature, etc.) 2,000 clips
MINDS-14 minds14 Multilingual banking intent recognition from spoken audio, 14 intents × 14 languages ~14,000 clips
Speech MASSIVE speech_massive Spoken intent and slot detection across 51 languages, 60 intent classes ~115,000 clips

ESC-50: Environmental Sound Classification

ESC-50 is a benchmark dataset for environmental sound classification containing 2,000 5-second recordings organized into 50 semantic classes (dog, rain, clock, chainsaw, etc.).

ludwig datasets download esc50

The ESC-50 config predicts target (50-class category) from audio_path:

input_features:
  - name: audio_path
    type: audio
    encoder:
      type: stacked_cnn
    preprocessing:
      audio_file_length_limit_in_s: 5.0
output_features:
  - name: target
    type: category
trainer:
  epochs: 50

MINDS-14: Multilingual Intent from Speech

MINDS-14 tests whether a model can recognize banking-domain intents (e.g., "transfer funds", "check balance") from spoken audio across 14 languages and dialects.

ludwig datasets download minds14