Skip to content

Spoken Digit Speech Recognition

This is a complete example of training an spoken digit speech recognition model on the "MNIST dataset of speech recognition".

Download the free spoken digit dataset

git clone https://github.com/Jakobovski/free-spoken-digit-dataset.git
mkdir speech_recog_digit_data
cp -r free-spoken-digit-dataset/recordings speech_recog_digit_data
cd speech_recog_digit_data

Create an CSV dataset

echo "audio_path","label" >> "spoken_digit.csv"
cd "recordings"
ls | while read -r file_name; do
   audio_path=$(readlink -m "${file_name}")
   label=$(echo ${file_name} | cut -c1)
   echo "${audio_path},${label}" >> "../spoken_digit.csv"
done
cd "../"

Now you should have spoken_digit.csv containing 2000 examples having the following format

audio_path label
.../speech_recog_digit_data/recordings/0_jackson_0.wav 0
.../speech_recog_digit_data/recordings/0_jackson_10.wav 0
.../speech_recog_digit_data/recordings/0_jackson_11.wav 0
... ...
.../speech_recog_digit_data/recordings/1_jackson_0.wav 1

Train a model

From the directory where you have virtual environment with ludwig installed:

ludwig experiment \
  --dataset <PATH_TO_SPOKEN_DIGIT_CSV> \
  --config config_file.yaml

With config.yaml:

input_features:
    -
        name: audio_path
        type: audio
        encoder: 
            type: stacked_cnn
            reduce_output: concat
            conv_layers:
                -
                    num_filters: 16
                    filter_size: 6
                    pool_size: 4
                    pool_stride: 4
                    dropout: 0.4
                -
                    num_filters: 32
                    filter_size: 3
                    pool_size: 2
                    pool_stride: 2
                    dropout: 0.4
            fc_layers:
                -
                    output_size: 64
                    dropout: 0.4
        preprocessing:
            audio_feature:
                type: fbank
                window_length_in_s: 0.025
                window_shift_in_s: 0.01
                num_filter_bands: 80
            audio_file_length_limit_in_s: 1.0
            norm: per_file

output_features:
    -
        name: label
        type: category

training:
    early_stop: 10