↑ Audio Features
Preprocessing¶
Example of a preprocessing specification (assuming the audio files have a sample rate of 16000):
preprocessing:
    type: fbank
    missing_value_strategy: bfill
    audio_file_length_limit_in_s: 7.5
    norm: null
    window_length_in_s: 0.04
    window_shift_in_s: 0.02
    window_type: hamming
    fill_value: null
    in_memory: true
    padding_value: 0.0
    num_fft_points: null
    num_filter_bands: 80
Ludwig supports reading audio files using PyTorch's Torchaudio library. This library supports WAV, AMB, MP3, FLAC, OGG/VORBIS, OPUS, SPHERE, and AMR-NB formats.
Parameters:
- type(default:- fbank) : Defines the type of audio feature to be used. Options:- fbank,- group_delay,- raw,- stft,- stft_phase. See explanations for each type here.
- missing_value_strategy(default:- bfill) : What strategy to follow when there's a missing value in an audio column Options:- fill_with_const,- fill_with_mode,- bfill,- ffill,- drop_row. See Missing Value Strategy for details.
- audio_file_length_limit_in_s(default:- 7.5): Float value that defines the maximum limit of the audio file in seconds. All files longer than this limit are cut off. All files shorter than this limit are padded with padding_value
- norm(default:- null): Normalization strategy for the audio files. If None, no normalization is performed. If per_file, z-norm is applied on a 'per file' level Options:- per_file,- null.
- window_length_in_s(default:- 0.04): Defines the window length used for the short time Fourier transformation. This is only needed if the audio_feature_type is 'raw'.
- window_shift_in_s(default:- 0.02): Defines the window shift used for the short time Fourier transformation (also called hop_length). This is only needed if the audio_feature_type is 'raw'.
- window_type(default:- hamming): Defines the type window the signal is weighted before the short time Fourier transformation. Options:- bartlett,- blackman,- hamming,- hann.
- fill_value(default:- null): The value to replace missing values with in case the missing_value_strategy is fill_with_const
- in_memory(default:- true): Defines whether the audio dataset will reside in memory during the training process or will be dynamically fetched from disk (useful for large datasets). In the latter case a training batch of input audio will be fetched from disk each training iteration. Options:- true,- false.
- padding_value(default:- 0.0): Float value that is used for padding.
- num_fft_points(default:- null): Defines the number of fft points used for the short time Fourier transformation
- num_filter_bands(default:- 80): Defines the number of filters used in the filterbank. Only needed if audio_feature_type is 'fbank'
Preprocessing parameters can also be defined once and applied to all audio input features using the Type-Global Preprocessing section.
Input Features¶
Audio files are transformed into one of the following types according to type under the preprocessing configuration.
- raw: Audio file is transformed into a float valued tensor of size- N x L x W(where- Nis the size of the dataset and- Lcorresponds to- audio_file_length_limit_in_s * sample_rateand- W = 1).
- stft: Audio is transformed to the- stftmagnitude. Audio file is transformed into a float valued tensor of size- N x L x W(where- Nis the size of the dataset,- Lcorresponds to- ceil(audio_file_length_limit_in_s * sample_rate - window_length_in_s * sample_rate + 1/ window_shift_in_s * sample_rate) + 1and- Wcorresponds to- num_fft_points / 2).
- fbank: Audio file is transformed to FBANK features (also called log Mel-filter bank values). FBANK features are implemented according to their definition in the HTK Book: Raw Signal -> Preemphasis -> DC mean removal ->- stftmagnitude -> Power spectrum:- stft^2-> mel-filter bank values: triangular filters equally spaced on a Mel-scale are applied -> log-compression:- log(). Overall the audio file is transformed into a float valued tensor of size- N x L x Wwith- N,Lbeing equal to the ones in- stftand- Wbeing equal to- num_filter_bands.
- stft_phase: The phase information for each stft bin is appended to the- stftmagnitude so that the audio file is transformed into a float valued tensor of size- N x L x 2Wwith- N,L,Wbeing equal to the ones in- stft.
- group_delay: Audio is transformed to group delay features according to Equation (23) in this paper. Group_delay features has the same tensor size as- stft.
The encoder parameters specified at the feature level are:
- tied(default- null): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.
Example audio feature entry in the input features list:
name: audio_column_name
type: audio
tied: null
encoder: 
    type: parallel_cnn
Encoders¶
Audio feature encoders are the same as for Sequence Features.
Encoder type and encoder parameters can also be defined once and applied to all audio input features using the Type-Global Encoder section.
Output Features¶
There are no audio decoders at the moment.
If this unlocks an interesting use case for your application, please file a GitHub Issue or ping the Ludwig Slack.