Skip to content

↑ Audio Features

Preprocessing

Example of a preprocessing specification (assuming the audio files have a sample rate of 16000):

preprocessing:
    type: fbank
    missing_value_strategy: bfill
    audio_file_length_limit_in_s: 7.5
    fill_value: null
    norm: null
    window_length_in_s: 0.04
    window_shift_in_s: 0.02
    window_type: hamming
    in_memory: true
    padding_value: 0.0
    num_fft_points: null
    num_filter_bands: 80
    mode: lazy
    prefetch_size: null
    lazy_cache_dir: null

Ludwig supports reading audio files using PyTorch's Torchaudio library. This library supports WAV, AMB, MP3, FLAC, OGG/VORBIS, OPUS, SPHERE, and AMR-NB formats.

Parameters:

  • type (default: fbank) : Defines the type of audio feature to be used. Options: fbank, group_delay, raw, stft, stft_phase. See explanations for each type here.
  • missing_value_strategy (default: bfill) : What strategy to follow when there's a missing value in an audio column Options: fill_with_const, fill_with_mode, bfill, ffill, drop_row. See Missing Value Strategy for details.
  • audio_file_length_limit_in_s (default: 7.5): Float value that defines the maximum limit of the audio file in seconds. All files longer than this limit are cut off. All files shorter than this limit are padded with padding_value
  • fill_value (default: null): The value to replace missing values with in case the missing_value_strategy is fill_with_const
  • norm (default: null): Normalization strategy for the audio files. If None, no normalization is performed. If per_file, z-norm is applied on a 'per file' level Options: per_file, null.
  • window_length_in_s (default: 0.04): Defines the window length used for the short time Fourier transformation. This is only needed if the audio_feature_type is 'raw'.
  • window_shift_in_s (default: 0.02): Defines the window shift used for the short time Fourier transformation (also called hop_length). This is only needed if the audio_feature_type is 'raw'.
  • window_type (default: hamming): Defines the type window the signal is weighted before the short time Fourier transformation. Options: bartlett, blackman, hamming, hann.
  • in_memory (default: true): Defines whether the audio dataset will reside in memory during the training process or will be dynamically fetched from disk (useful for large datasets). In the latter case a training batch of input audio will be fetched from disk each training iteration.
  • padding_value (default: 0.0): Float value that is used for padding.
  • num_fft_points (default: null): Defines the number of fft points used for the short time Fourier transformation
  • num_filter_bands (default: 80): Defines the number of filters used in the filterbank. Only needed if audio_feature_type is 'fbank'

  • mode (default: lazy): Preprocessing mode for audio features. 'eager' decodes all files during preprocessing and stores tensors in the Parquet cache. 'lazy' stores file paths and decodes per batch during training, keeping peak memory bounded to batch_size × clip_size. 'lazy_cached' behaves like 'lazy' on the first training epoch but writes decoded tensors to a numpy memmap alongside the Parquet cache; subsequent epochs read from the memmap directly, eliminating decode overhead. Options: eager, lazy, lazy_cached.

  • prefetch_size (default: null): Number of batches to prefetch in a background thread while the GPU processes the current batch. None (default) selects automatically: 0 for 'eager' mode, 4 for 'lazy' and 'lazy_cached' (epoch 1). After the first epoch in 'lazy_cached' mode, prefetch is automatically disabled since memmap reads are fast enough. Set to 0 to disable prefetch entirely, or to a positive integer to override the automatic selection.
  • lazy_cache_dir (default: null): Directory in which to cache audio files when the source data is in-memory (e.g. a HuggingFace dataset). Only used when mode is 'lazy' or 'lazy_cached' and the input entries are not already paths to existing files. When None, defaults to ~/.cache/ludwig/lazy_media//. Has no effect when the input column already contains local file paths. Note: this controls the file cache for in-memory sources; the decoded memmap for 'lazy_cached' mode is placed next to the Parquet cache instead.

Preprocessing parameters can also be defined once and applied to all audio input features using the Type-Global Preprocessing section.

Preprocessing Modes

Ludwig supports three preprocessing modes for audio features, controlled by the mode parameter:

Mode Preprocessing memory Training epoch 1 Training epoch 2+ Best for
eager High (O(N×tensor)) Fast Fast Small datasets that fit in RAM
lazy (default) Low (O(batch)) Slower (decode-bound) Slower Large datasets
lazy_cached Low (O(batch)) Fast (GPU pipelined) Very fast (memmap) Large datasets, any GPU speed

mode: lazy (default)

Ludwig stores file paths in the processed dataset and decodes audio clips on-the-fly, one batch at a time, during training. Decoding runs in a ThreadPoolExecutor that overlaps with the GPU forward pass. For FBANK features, the thread-pool size is automatically capped to avoid CPU over-subscription (each decode already uses PyTorch's internal thread pool).

mode: lazy_cached

On the first training epoch, audio is decoded per batch (same as lazy) and written to a numpy memmap alongside the Parquet cache. From epoch 2 onward, the memmap is read directly (~0.1 ms/batch), eliminating decode overhead entirely.

mode: eager

All audio files are decoded during preprocessing and stored as tensors in the Parquet cache. Use this only when the full decoded dataset fits comfortably in memory.

Configuration Examples

input_features:
  - name: audio
    type: audio
    preprocessing:
      mode: lazy               # default
      prefetch_size: null      # auto (4 for lazy/lazy_cached, 0 for eager)
      lazy_cache_dir: null     # default: ~/.cache/ludwig/lazy_media/<feature_name>/
      audio_file_length_limit_in_s: 7.5
      type: fbank
      num_filter_bands: 80
input_features:
  - name: audio
    type: audio
    preprocessing:
      mode: lazy_cached        # decode+cache on epoch 1; memmap from epoch 2+
      lazy_cache_dir: /fast/nvme/audio_cache
input_features:
  - name: audio
    type: audio
    preprocessing:
      mode: eager              # decode everything upfront

See Choosing a Preprocessing Mode for a full comparison.

Lazy Preprocessing with HuggingFace Datasets

When loading a HuggingFace dataset (e.g. datasets.load_dataset(...)), audio columns are delivered as Python dicts — not file paths:

{
    "array": np.ndarray,  # decoded waveform, shape (samples,)
    "sampling_rate": 16000,  # sample rate in Hz
    "path": "/path/to/cache.wav",  # optional: HF's local cache path
}

Ludwig handles this transparently:

  1. If path points to an existing file on disk (HuggingFace's local cache), Ludwig reuses that file directly — no copy is made.
  2. Otherwise, Ludwig writes the waveform to a WAV file in lazy_cache_dir and uses that path.

The cache is persistent and idempotent — subsequent runs skip the write step entirely.

Controlling the Cache Directory

lazy_cache_dir controls where WAV files are written for in-memory sources (HuggingFace datasets). The decoded memmap for lazy_cached mode is placed next to the Parquet cache, not inside lazy_cache_dir.

input_features:
  - name: speech
    type: audio
    preprocessing:
      mode: lazy_cached
      lazy_cache_dir: /fast/nvme/my_project/audio_cache

The per-feature subdirectory is created automatically.

Bare Tensor Inputs

If your dataset delivers bare torch.Tensor objects (shape (channels, samples) or (samples,)) instead of dicts, Ludwig treats them the same as the in-memory dict case: tensors are written to WAV files in lazy_cache_dir using the sample rate recorded in the feature metadata.

Input Features

Audio files are transformed into one of the following types according to type under the preprocessing configuration.

  • raw: Audio file is transformed into a float valued tensor of size N x L x W (where N is the size of the dataset and L corresponds to audio_file_length_limit_in_s * sample_rate and W = 1).
  • stft: Audio is transformed to the stft magnitude. Audio file is transformed into a float valued tensor of size N x L x W (where N is the size of the dataset, L corresponds to ceil(audio_file_length_limit_in_s * sample_rate - window_length_in_s * sample_rate + 1/ window_shift_in_s * sample_rate) + 1 and W corresponds to num_fft_points / 2).
  • fbank: Audio file is transformed to FBANK features (also called log Mel-filter bank values). FBANK features are implemented according to their definition in the HTK Book: Raw Signal -> Preemphasis -> DC mean removal -> stft magnitude -> Power spectrum: stft^2 -> mel-filter bank values: triangular filters equally spaced on a Mel-scale are applied -> log-compression: log(). Overall the audio file is transformed into a float valued tensor of size N x L x W with N,L being equal to the ones in stft and W being equal to num_filter_bands.
  • stft_phase: The phase information for each stft bin is appended to the stft magnitude so that the audio file is transformed into a float valued tensor of size N x L x 2W with N,L,W being equal to the ones in stft.
  • group_delay: Audio is transformed to group delay features according to Equation (23) in this paper. Group_delay features has the same tensor size as stft.

The encoder parameters specified at the feature level are:

  • tied (default null): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.

Example audio feature entry in the input features list:

name: audio_column_name
type: audio
tied: null
encoder:
    type: parallel_cnn

Encoders

Audio feature encoders include all Sequence Features encoders as well as the pretrained audio encoders described below.

Encoder type and encoder parameters can also be defined once and applied to all audio input features using the Type-Global Encoder section.

Wav2Vec2 Encoder

The Wav2Vec2 encoder (Baevski et al., "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations", NeurIPS 2020) processes raw audio waveforms using self-supervised contrastive learning over masked latent representations. It produces contextualized speech features suitable for speech recognition, audio classification, and speaker identification.

Wav2Vec2 expects raw waveform input at 16kHz sample rate.

Default pretrained model: facebook/wav2vec2-base

encoder:
    type: wav2vec2
    skip: false
    adapter: null
    use_pretrained: true
    saved_weights_in_checkpoint: false
    trainable: true
    reduce_output: mean
    pretrained_model_name_or_path: facebook/wav2vec2-base

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • reduce_output (default: mean):
  • pretrained_model_name_or_path (default: facebook/wav2vec2-base): Name or path of the pretrained model. Can be a HuggingFace model hub identifier (e.g. 'facebook/wav2vec2-base', 'facebook/wav2vec2-large-xlsr-53') or a local path to a saved model directory.

Whisper Encoder

The Whisper encoder (Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision", ICML 2023) is the encoder portion of OpenAI's Whisper model, trained on 680,000 hours of multilingual audio data. It excels at noisy and multilingual speech tasks.

Whisper expects log-mel spectrogram input (80 mel bins).

Default pretrained model: openai/whisper-base

encoder:
    type: whisper
    skip: false
    adapter: null
    use_pretrained: true
    saved_weights_in_checkpoint: false
    trainable: true
    reduce_output: mean
    pretrained_model_name_or_path: openai/whisper-base

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • reduce_output (default: mean):
  • pretrained_model_name_or_path (default: openai/whisper-base): Name or path of the pretrained model. Can be a HuggingFace model hub identifier (e.g. 'openai/whisper-base', 'openai/whisper-small', 'openai/whisper-medium', 'openai/whisper-large-v3') or a local path to a saved model directory.

HuBERT Encoder

The HuBERT encoder (Hsu et al., "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units", IEEE/ACM TASLP 2021) uses self-supervised masked prediction to learn speech representations. It is particularly effective for speaker verification, emotion recognition, and audio classification tasks.

HuBERT expects raw waveform input at 16kHz sample rate.

Default pretrained model: facebook/hubert-base-ls960

encoder:
    type: hubert
    skip: false
    adapter: null
    use_pretrained: true
    saved_weights_in_checkpoint: false
    trainable: true
    reduce_output: mean
    pretrained_model_name_or_path: facebook/hubert-base-ls960

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • reduce_output (default: mean):
  • pretrained_model_name_or_path (default: facebook/hubert-base-ls960): Name or path of the pretrained model. Can be a HuggingFace model hub identifier (e.g. 'facebook/hubert-base-ls960', 'facebook/hubert-large-ls960-ft') or a local path to a saved model directory.

Output Features

There are no audio decoders at the moment.

If this unlocks an interesting use case for your application, please file a GitHub Issue or ping the Community Discord.