Skip to content

⇅ Image Features

Input image features are transformed into a float valued tensors of size N x C x H x W (where N is the size of the dataset, C is the number of channels, and H x W is the height and width of the image (can be specified by the user). These tensors are added to HDF5 with a key that reflects the name of column in the dataset.

The column name is added to the JSON file, with an associated dictionary containing preprocessing information about the sizes of the resizing.

Supported Image Formats

The number of channels in the image is determined by the image format. The following table lists the supported image formats and the number of channels.

Format Number of channels
Grayscale 1
Grayscale with Alpha 2
RGB 3
RGB with Alpha 4

Preprocessing

During preprocessing, raw image files are transformed into numpy arrays and saved in the hdf5 format.

Note

Images passed to an image encoder are expected to have the same size. If images are different sizes, by default they will be resized to the dimensions of the first image in the dataset. Optionally, a resize_method together with a target width and height can be specified in the feature preprocessing parameters, in which case all images will be resized to the specified target size.

preprocessing:
    missing_value_strategy: bfill
    fill_value: null
    height: null
    width: null
    num_channels: null
    num_processes: 1
    num_classes: null
    resize_method: interpolate
    infer_image_num_channels: true
    infer_image_dimensions: true
    infer_image_max_height: 256
    infer_image_max_width: 256
    infer_image_sample_size: 100
    standardize_image: null
    in_memory: true
    requires_equal_dimensions: false
    infer_image_num_classes: false
    mode: lazy
    prefetch_size: null
    lazy_cache_dir: null

Parameters:

  • missing_value_strategy (default: bfill) : What strategy to follow when there's a missing value in an image column Options: fill_with_const, fill_with_mode, bfill, ffill, drop_row.
  • fill_value (default: null): The maximum number of most common tokens to be considered. If the data contains more than this amount, the most infrequent tokens will be treated as unknown.
  • height (default: null): The image height in pixels. If this parameter is set, images will be resized to the specified height using the resize_method parameter. If None, images will be resized to the size of the first image in the dataset.
  • width (default: null): The image width in pixels. If this parameter is set, images will be resized to the specified width using the resize_method parameter. If None, images will be resized to the size of the first image in the dataset.
  • num_channels (default: null): Number of channels in the images. If specified, images will be read in the mode specified by the number of channels. If not specified, the number of channels will be inferred from the image format of the first valid image in the dataset.
  • num_processes (default: 1): Specifies the number of processes to run for preprocessing images.
  • num_classes (default: null): Number of channel classes in the images. If specified, this value will be validated against the inferred number of classes. Use 2 to convert grayscale images to binary images.
  • resize_method (default: interpolate): The method to use for resizing images. Options: crop_or_pad, interpolate.
  • infer_image_num_channels (default: true): If true, then the number of channels in the dataset is inferred from a sample of the first image in the dataset.
  • infer_image_dimensions (default: true): If true, then the height and width of images in the dataset will be inferred from a sample of the first image in the dataset. Each image that doesn't conform to these dimensions will be resized according to resize_method. If set to false, then the height and width of images in the dataset will be specified by the user.
  • infer_image_max_height (default: 256): If infer_image_dimensions is set, this is used as the maximum height of the images in the dataset.
  • infer_image_max_width (default: 256): If infer_image_dimensions is set, this is used as the maximum width of the images in the dataset.
  • infer_image_sample_size (default: 100): The sample size used for inferring dimensions of images in infer_image_dimensions.
  • standardize_image (default: null): Standardize image by per channel mean centering and standard deviation scaling . Options: imagenet1k, null.
  • in_memory (default: true): Defines whether image dataset will reside in memory during the training process or will be dynamically fetched from disk (useful for large datasets). In the latter case a training batch of input images will be fetched from disk each training iteration.
  • requires_equal_dimensions (default: false): If true, then width and height must be equal.
  • infer_image_num_classes (default: false): If true, then the number of channel classes in the dataset will be inferred from a sample of the first image in the dataset. Each unique channel value will be mapped to a class and preprocessing will create a masked image based on the channel classes.

  • mode (default: lazy): Preprocessing mode for image features. 'eager' decodes all files during preprocessing and stores tensors in the Parquet cache. 'lazy' stores file paths and decodes per batch during training, keeping peak memory bounded to batch_size × image_size. 'lazy_cached' behaves like 'lazy' on the first training epoch but writes decoded tensors to a numpy memmap alongside the Parquet cache; subsequent epochs read from the memmap directly, eliminating decode overhead. Lazy mode is disabled automatically when a torchvision pretrained encoder is used. Options: eager, lazy, lazy_cached.

  • prefetch_size (default: null): Number of batches to prefetch in a background thread while the GPU processes the current batch. None (default) selects automatically: 0 for 'eager' mode, 4 for 'lazy' and 'lazy_cached' (epoch 1). After the first epoch in 'lazy_cached' mode, prefetch is automatically disabled since memmap reads are fast enough. Set to 0 to disable prefetch entirely, or to a positive integer to override the automatic selection.
  • lazy_cache_dir (default: null): Directory in which to cache image files when the source data is in-memory (e.g. a HuggingFace dataset). Only used when mode is 'lazy' or 'lazy_cached' and the input entries are not already paths to existing files. When None, defaults to ~/.cache/ludwig/lazy_media//. Has no effect when the input column already contains local file paths.

Preprocessing parameters can also be defined once and applied to all image input features using the Type-Global Preprocessing section.

Preprocessing Modes

Ludwig supports three preprocessing modes for image features, controlled by the mode parameter:

Mode Preprocessing memory Training epoch 1 Training epoch 2+ Best for
eager High (O(N×tensor)) Fast Fast Small datasets that fit in RAM
lazy (default) Low (O(batch)) Slower (decode-bound) Slower Large datasets
lazy_cached Low (O(batch)) Fast (GPU pipelined) Very fast (memmap) Large datasets, any GPU speed

mode: lazy (default)

Ludwig stores file paths in the processed dataset and decodes images on-the-fly, one batch at a time, during training. Decoding runs in a ThreadPoolExecutor that overlaps with the GPU forward pass, matching the throughput of the eager decode path.

mode: lazy_cached

On the first training epoch, images are decoded per batch (same as lazy) and written to a numpy memmap alongside the Parquet cache. From epoch 2 onward, the memmap is read directly (~0.1 ms/batch), eliminating decode overhead entirely.

mode: eager

All images are decoded during preprocessing and stored as tensors in the Parquet cache. Use this only when the full decoded dataset fits comfortably in memory.

Configuration Examples

input_features:
  - name: image
    type: image
    preprocessing:
      mode: lazy              # default
      prefetch_size: null     # auto (4 for lazy/lazy_cached, 0 for eager)
      lazy_cache_dir: null    # default: ~/.cache/ludwig/lazy_media/<feature_name>/
      height: 224
      width: 224
      num_channels: 3
      resize_method: interpolate
input_features:
  - name: image
    type: image
    preprocessing:
      mode: lazy_cached       # decode+cache on epoch 1; memmap from epoch 2+
      lazy_cache_dir: /fast/nvme/image_cache
input_features:
  - name: image
    type: image
    preprocessing:
      mode: eager             # decode everything upfront

Note

Lazy preprocessing is automatically disabled when using a TorchVision pretrained encoder (e.g. resnet, efficientnet, vit). Those encoders apply their own normalization pipeline which requires images to be decoded upfront.

See Choosing a Preprocessing Mode for a full comparison.

Lazy Preprocessing with HuggingFace Datasets

When loading a HuggingFace dataset, image columns are delivered as PIL.Image.Image objects — not file paths. Ludwig handles this transparently based on what the PIL Image carries:

  1. PIL Image opened from disk — PIL sets a .filename attribute pointing to the source file. Ludwig detects this and reuses that path directly (no copy).
  2. In-memory PIL Image (no .filename) — Ludwig saves the image as a PNG file in lazy_cache_dir and uses that path going forward.

HuggingFace may also deliver images as dicts:

{"bytes": b"...", "path": "/path/to/cached.jpg"}  # HF Image column format

Ludwig will reuse "path" if the file exists, otherwise decode "bytes" and save to cache.

Raw bytes and numpy.ndarray inputs (both HWC and CHW channel orderings) are also supported.

The cache is persistent and idempotent: subsequent runs with the same dataset skip the write step entirely.

Controlling the Cache Directory

lazy_cache_dir controls where PNG files are written for in-memory sources (HuggingFace datasets). The decoded memmap for lazy_cached mode is placed next to the Parquet cache, not inside lazy_cache_dir.

input_features:
  - name: photo
    type: image
    preprocessing:
      mode: lazy_cached
      lazy_cache_dir: /fast/nvme/my_project/image_cache

The per-feature subdirectory is created automatically. Multiple image features each get their own subdirectory named after the feature, even if they share the same lazy_cache_dir.

Input Features

The encoder parameters specified at the feature level are:

  • tied (default null): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.
  • augmentation (default False): specifies image data augmentation operations to generate synthetic training data. More details on image augmentation can be found here.

Example image feature entry in the input features list:

name: image_column_name
type: image
tied: null
encoder:
    type: stacked_cnn

The available encoder parameters are:

Encoder type and encoder parameters can also be defined once and applied to all image input features using the Type-Global Encoder section.

Encoders

Convolutional Stack Encoder (stacked_cnn)

Stack of 2D convolutional layers with optional normalization, dropout, and down-sampling pooling layers, followed by an optional stack of fully connected layers.

Convolutional Stack Encoder takes the following optional parameters:

encoder:
    type: stacked_cnn
    conv_dropout: 0.0
    output_size: 128
    num_conv_layers: null
    out_channels: 32
    conv_norm: null
    fc_norm: null
    fc_norm_params: null
    conv_activation: relu
    kernel_size: 3
    stride: 1
    padding_mode: zeros
    padding: valid
    dilation: 1
    groups: 1
    pool_function: max
    pool_kernel_size: 2
    pool_stride: null
    pool_padding: 0
    pool_dilation: 1
    conv_norm_params: null
    conv_layers: null
    fc_dropout: 0.0
    fc_activation: relu
    fc_use_bias: true
    fc_bias_initializer: zeros
    fc_weights_initializer: xavier_uniform
    num_fc_layers: 1
    fc_layers: null
    skip: false
    adapter: null
    num_channels: null
    conv_use_bias: true

Parameters:

  • conv_dropout (default: 0.0) : Dropout rate
  • output_size (default: 128) : If output_size is not already specified in fc_layers this is the default output_size that will be used for each layer. It indicates the size of the output of a fully connected layer.
  • num_conv_layers (default: null) : Number of convolutional layers to use in the encoder.
  • out_channels (default: 32): Indicates the number of filters, and by consequence the output channels of the 2d convolution. If out_channels is not already specified in conv_layers this is the default out_channels that will be used for each layer.
  • conv_norm (default: null): If a norm is not already specified in conv_layers this is the default norm that will be used for each layer. It indicates the normalization applied to the activations and can be null, batch or layer. Options: batch, layer, null.
  • fc_norm (default: null): If a norm is not already specified in fc_layers this is the default norm that will be used for each layer. It indicates the norm of the output and can be null, batch or layer. Options: batch, layer, null.
  • fc_norm_params (default: null): Parameters used if norm is either batch or layer. For information on parameters used with batch see Torch's documentation on batch normalization or for layer see Torch's documentation on layer normalization.
  • conv_activation (default: relu): If an activation is not already specified in conv_layers this is the default activation that will be used for each layer. It indicates the activation function applied to the output. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
  • kernel_size (default: 3): An integer or pair of integers specifying the kernel size. A single integer specifies a square kernel, while a pair of integers specifies the height and width of the kernel in that order (h, w). If a kernel_size is not specified in conv_layers this kernel_size that will be used for each layer.
  • stride (default: 1): An integer or pair of integers specifying the stride of the convolution along the height and width. If a stride is not already specified in conv_layers, specifies the default stride of the 2D convolutional kernel that will be used for each layer.
  • padding_mode (default: zeros): If padding_mode is not already specified in conv_layers, specifies the default padding_mode of the 2D convolutional kernel that will be used for each layer. Options: zeros, reflect, replicate, circular.
  • padding (default: valid): An int, pair of ints (h, w), or one of ['valid', 'same'] specifying the padding used forconvolution kernels.
  • dilation (default: 1): An int or pair of ints specifying the dilation rate to use for dilated convolution. If dilation is not already specified in conv_layers, specifies the default dilation of the 2D convolutional kernel that will be used for each layer.
  • groups (default: 1): Groups controls the connectivity between convolution inputs and outputs. When groups = 1, each output channel depends on every input channel. When groups > 1, input and output channels are divided into groups separate groups, where each output channel depends only on the inputs in its respective input channel group. in_channels and out_channels must both be divisible by groups.
  • pool_function (default: max): Pooling function to use. Options: max, average, avg, mean.
  • pool_kernel_size (default: 2): An integer or pair of integers specifying the pooling size. If pool_kernel_size is not specified in conv_layers this is the default value that will be used for each layer.
  • pool_stride (default: null): An integer or pair of integers specifying the pooling stride, which is the factor by which the pooling layer downsamples the feature map. Defaults to pool_kernel_size.
  • pool_padding (default: 0): An integer or pair of ints specifying pooling padding (h, w).
  • pool_dilation (default: 1): An integer or pair of ints specifying pooling dilation rate (h, w).
  • conv_norm_params (default: null): Parameters used if conv_norm is either batch or layer.
  • conv_layers (default: null): List of convolutional layers to use in the encoder.
  • fc_dropout (default: 0.0): Dropout rate
  • fc_activation (default: relu): If an activation is not already specified in fc_layers this is the default activation that will be used for each layer. It indicates the activation function applied to the output. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, gelu, silu, swish, mish, selu, prelu, relu6, hardswish, hardsigmoid, softplus, celu, swiglu, geglu, reglu, sparsemax, entmax15, null.
  • fc_use_bias (default: true): Whether the layer uses a bias vector.
  • fc_bias_initializer (default: zeros): Initializer for the bias vector. Options: constant, dirac, eye, identity, kaiming_normal, kaiming_uniform, normal, ones, orthogonal, sparse, uniform, xavier_normal, xavier_uniform, zeros.
  • fc_weights_initializer (default: xavier_uniform): Initializer for the weights matrix. Options: constant, dirac, eye, identity, kaiming_normal, kaiming_uniform, normal, ones, orthogonal, sparse, uniform, xavier_normal, xavier_uniform, zeros.
  • num_fc_layers (default: 1): The number of stacked fully connected layers.
  • fc_layers (default: null): A list of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead.
  • skip (default: false):
  • adapter (default: null):

  • num_channels (default: null): Number of channels to use in the encoder.

  • conv_use_bias (default: true): If bias not already specified in conv_layers, specifies if the 2D convolutional kernel should have a bias term.

MLP-Mixer Encoder

Encodes images using MLP-Mixer, as described in MLP-Mixer: An all-MLP Architecture for Vision. MLP-Mixer divides the image into equal-sized patches, applying fully connected layers to each patch to compute per-patch representations (tokens) and combining the representations with fully-connected mixer layers.

The MLP-Mixer Encoder takes the following optional parameters:

encoder:
    type: mlp_mixer
    dropout: 0.0
    num_layers: 8
    patch_size: 16
    skip: false
    adapter: null
    num_channels: null
    embed_size: 512
    token_size: 2048
    channel_dim: 256
    avg_pool: true

Parameters:

  • dropout (default: 0.0) : Dropout rate.
  • num_layers (default: 8) : The depth of the network (the number of Mixer blocks).
  • patch_size (default: 16): The image patch size. Each patch is patch_size² pixels. Must evenly divide the image width and height.
  • skip (default: false):
  • adapter (default: null):

  • num_channels (default: null): Number of channels to use in the encoder.

  • embed_size (default: 512): The patch embedding size, the output size of the mixer if avg_pool is true.
  • token_size (default: 2048): The per-patch embedding size.
  • channel_dim (default: 256): Number of channels in hidden layer.
  • avg_pool (default: true): If true, pools output over patch dimension, outputs a vector of shape (embed_size). If false, the output tensor is of shape (n_patches, embed_size), where n_patches is img_height x img_width / patch_size².

TorchVision Pretrained Model Encoders

Twenty TorchVision pretrained image classification models are available as Ludwig image encoders. The available models are:

  • AlexNet
  • ConvNeXt
  • DenseNet
  • EfficientNet
  • EfficientNetV2
  • GoogLeNet
  • Inception V3
  • MaxVit
  • MNASNet
  • MobileNet V2
  • MobileNet V3
  • RegNet
  • ResNet
  • ResNeXt
  • ShuffleNet V2
  • SqueezeNet
  • SwinTransformer
  • VGG
  • VisionTransformer
  • Wide ResNet

See TorchVison documentation for more details.

Ludwig encoders parameters for TorchVision pretrained models:

AlexNet
encoder:
    type: alexnet
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: base

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: base): Pretrained model variant to use. Options: base.
ConvNeXt
encoder:
    type: convnext
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: base

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: base): Pretrained model variant to use. Options: tiny, small, base, large.
DenseNet
encoder:
    type: densenet
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: 121

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: 121): Pretrained model variant to use. Options: 121, 161, 169, 201.
EfficientNet
encoder:
    type: efficientnet
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: b0

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: b0): Pretrained model variant to use. Options: b0, b1, b2, b3, b4, b5, b6, b7, v2_s, v2_m, v2_l.
GoogLeNet
encoder:
    type: googlenet
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: base

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: base): Pretrained model variant to use. Options: base.
Inception V3
encoder:
    type: inceptionv3
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: base

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: base): Pretrained model variant to use. Options: base.
MaxVit
encoder:
    type: maxvit
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: t

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: t): Pretrained model variant to use. Options: t.
MNASNet
encoder:
    type: mnasnet
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: '0_5'

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: 0_5): Pretrained model variant to use. Options: 0_5, 0_75, 1_0, 1_3.
MobileNet V2
encoder:
    type: mobilenetv2
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: base

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: base): Pretrained model variant to use. Options: base.
MobileNet V3
encoder:
    type: mobilenetv3
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: small

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: small): Pretrained model variant to use. Options: small, large.
RegNet
encoder:
    type: regnet
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: x_1_6gf

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: x_1_6gf): Pretrained model variant to use. Options: x_1_6gf, x_16gf, x_32gf, x_3_2gf, x_400mf, x_800mf, x_8gf, y_128gf, y_16gf, y_1_6gf, y_32gf, y_3_2gf, y_400mf, y_800mf, y_8gf.
ResNet
encoder:
    type: resnet
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: 50

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: 50): Pretrained model variant to use. Options: 18, 34, 50, 101, 152.
ResNeXt
encoder:
    type: resnext
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: 50_32x4d

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: 50_32x4d): Pretrained model variant to use. Options: 50_32x4d, 101_32x8d, 101_64x4d.
ShuffleNet V2
encoder:
    type: shufflenet_v2
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: x0_5

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: x0_5): Pretrained model variant to use. Options: x0_5, x1_0, x1_5, x2_0.
SqueezeNet
encoder:
    type: squeezenet
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: '1_0'

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: 1_0): Pretrained model variant to use. Options: 1_0, 1_1.
SwinTransformer
encoder:
    type: swin_transformer
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: t

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: t): Pretrained model variant to use. Options: t, s, b.
VGG
encoder:
    type: vgg
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: 11

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: 11): Pretrained model variant to use.
VisionTransformer
encoder:
    type: vit
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: b_16

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: b_16): Pretrained model variant to use. Options: b_16, b_32, l_16, l_32, h_14.
Wide ResNet
encoder:
    type: wide_resnet
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: '50_2'

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: 50_2): Pretrained model variant to use. Options: 50_2, 101_2.

Note:

  • At this time Ludwig supports only the DEFAULT pretrained weights, which are the best available weights for a specific model. More details on DEFAULT weights can be found in this blog post.
  • Some TorchVision pretrained models consume large amounts of memory. These model_variant required more than 12GB of memory:
  • efficientnet_torch: b7
  • regnet_torch: y_128gf
  • vit_torch: h_14

U-Net Encoder

The U-Net encoder is based on U-Net: Convolutional Networks for Biomedical Image Segmentation. The encoder implements the contracting downsampling path of the U-Net stack.

U-Net Encoder takes the following optional parameters:

encoder:
    type: unet
    conv_norm: batch
    skip: false
    adapter: null

Parameters:

  • conv_norm (default: batch): This is the default norm that will be used for each double conv layer.It can be null or batch. Options: batch, null.
  • skip (default: false):
  • adapter (default: null):

CLIP Encoder

The CLIP image encoder (Radford et al., "Learning Transferable Visual Models From Natural Language Supervision", ICML 2021) encodes images using CLIP's vision transformer. The resulting embeddings are aligned with text in a shared latent space, enabling zero-shot classification and multimodal tasks.

Use CLIP when you need visual features that are semantically aligned with text -- for example, when combining image and text inputs for multimodal classification, or when you want zero-shot image classification without task-specific fine-tuning.

Default pretrained model: openai/clip-vit-base-patch32

encoder:
    type: clip
    skip: false
    adapter: null
    use_pretrained: true
    trainable: true
    saved_weights_in_checkpoint: false
    pretrained_model_name_or_path: openai/clip-vit-base-patch32

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • trainable (default: true):
  • saved_weights_in_checkpoint (default: false):
  • pretrained_model_name_or_path (default: openai/clip-vit-base-patch32): HuggingFace model path or name for the CLIP vision model.

DINOv2 Encoder

The DINOv2 encoder (Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision", TMLR 2024) produces self-supervised visual features that work well as frozen backbones. Unlike CLIP, DINOv2 does not require text alignment -- it learns visual features purely from images using self-distillation.

Use DINOv2 when you want a general-purpose frozen feature extractor, especially for dense prediction tasks (segmentation, depth estimation) or when you want to avoid fine-tuning the vision backbone.

Default pretrained model: facebook/dinov2-base

encoder:
    type: dinov2
    skip: false
    adapter: null
    use_pretrained: true
    trainable: true
    saved_weights_in_checkpoint: false
    pretrained_model_name_or_path: facebook/dinov2-base

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • trainable (default: true):
  • saved_weights_in_checkpoint (default: false):
  • pretrained_model_name_or_path (default: facebook/dinov2-base): HuggingFace model path or name for the DINOv2 model.

SigLIP Encoder

The SigLIP encoder (Zhai et al., "Sigmoid Loss for Language Image Pre-Training", ICCV 2023) uses sigmoid loss instead of softmax for image-text pre-training. This enables better scaling to larger batch sizes and more efficient training compared to CLIP, while maintaining similar zero-shot capabilities.

Default pretrained model: google/siglip-base-patch16-224

encoder:
    type: siglip
    skip: false
    adapter: null
    use_pretrained: true
    trainable: true
    saved_weights_in_checkpoint: false
    pretrained_model_name_or_path: google/siglip-base-patch16-224

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • trainable (default: true):
  • saved_weights_in_checkpoint (default: false):
  • pretrained_model_name_or_path (default: google/siglip-base-patch16-224): HuggingFace model path or name for the SigLIP vision model.

ConvNeXt V2 Encoder

The ConvNeXt V2 encoder (Woo et al., "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders", CVPR 2023) improves on ConvNeXt with Global Response Normalization (GRN) and fully convolutional masked autoencoder (FCMAE) pre-training. It is a pure-CNN architecture that matches or exceeds vision transformers on ImageNet.

Available via TIMM with model variants from atto (3.7M params) to huge (660M params).

encoder:
    type: convnextv2
    skip: false
    adapter: null
    use_pretrained: true
    saved_weights_in_checkpoint: false
    trainable: true
    model_name: convnextv2_base

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_name (default: convnextv2_base): ConvNeXt V2 model variant. Improved ConvNeXt with Global Response Normalization (GRN) and FCMAE pre-training. Variants with '.fcmae_ft_in1k' are fine-tuned on ImageNet-1K. Variants with '.fcmae_ft_in22k_in1k' are pre-trained on ImageNet-22K and fine-tuned on ImageNet-1K. Variants with '_384' use 384x384 input resolution. Options: convnextv2_atto, convnextv2_femto, convnextv2_pico, convnextv2_nano, convnextv2_tiny, convnextv2_base, convnextv2_large, convnextv2_huge, convnextv2_atto.fcmae_ft_in1k, convnextv2_femto.fcmae_ft_in1k, convnextv2_pico.fcmae_ft_in1k, convnextv2_nano.fcmae_ft_in1k, convnextv2_tiny.fcmae_ft_in1k, convnextv2_base.fcmae_ft_in1k, convnextv2_large.fcmae_ft_in1k, convnextv2_huge.fcmae_ft_in1k, convnextv2_base.fcmae_ft_in22k_in1k, convnextv2_large.fcmae_ft_in22k_in1k, convnextv2_huge.fcmae_ft_in22k_in1k, convnextv2_base.fcmae_ft_in22k_in1k_384, convnextv2_large.fcmae_ft_in22k_in1k_384, convnextv2_huge.fcmae_ft_in22k_in1k_384.

Generic TIMM Encoder

The timm encoder exposes the full pytorch-image-models library as a single configurable encoder. Over 1,000 pretrained vision models are available, including all MetaFormer variants, EfficientFormer V2, DaViT, FastViT, and many more.

Install TIMM before using this encoder:

pip install timm
encoder:
  type: timm
  model_name: caformer_s18.sail_in22k_ft_in1k
  use_pretrained: true
  trainable: true

Browse available model names at timm.fast.ai.

encoder:
    type: timm
    skip: false
    adapter: null
    use_pretrained: true
    saved_weights_in_checkpoint: false
    trainable: true
    model_name: caformer_s18

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_name (default: caformer_s18): Name of the timm model to use. Any model from the timm library is supported. See https://huggingface.co/docs/timm for available models.

CAFormer Encoder

CAFormer (Yu et al., "MetaFormer Baselines for Vision", TPAMI 2024) is a hybrid MetaFormer that uses depthwise separable convolutions in the lower stages and self-attention in the upper stages. It achieves state-of-the-art accuracy/efficiency trade-offs on ImageNet.

Variant Params ImageNet top-1
caformer_s18 26 M 83.6 %
caformer_s36 39 M 84.5 %
caformer_m36 56 M 85.2 %
caformer_b36 99 M 85.5 %
encoder:
  type: caformer
  model_name: caformer_s18.sail_in22k_ft_in1k
  use_pretrained: true
encoder:
    type: caformer
    skip: false
    adapter: null
    use_pretrained: true
    saved_weights_in_checkpoint: false
    trainable: true
    model_name: caformer_s18

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_name (default: caformer_s18): CAFormer model variant. Hybrid Conv+Attention MetaFormer achieving SOTA accuracy. Variants with '.sail_in22k_ft_in1k' are pretrained on ImageNet-21K and finetuned on ImageNet-1K. Variants with '_384' use 384x384 input resolution. Options: caformer_s18, caformer_s36, caformer_m36, caformer_b36, caformer_s18.sail_in22k_ft_in1k, caformer_s18.sail_in22k_ft_in1k_384, caformer_s36.sail_in22k_ft_in1k, caformer_s36.sail_in22k_ft_in1k_384, caformer_m36.sail_in22k_ft_in1k, caformer_m36.sail_in22k_ft_in1k_384, caformer_b36.sail_in22k_ft_in1k, caformer_b36.sail_in22k_ft_in1k_384.

ConvFormer Encoder

ConvFormer replaces the attention token-mixer with a large-kernel depthwise convolution, making it a pure-CNN MetaFormer that outperforms ConvNeXt while being fully convolutional (no positional embeddings, any resolution input).

Variant Params ImageNet top-1
convformer_s18 27 M 83.0 %
convformer_s36 40 M 84.1 %
convformer_m36 57 M 84.5 %
convformer_b36 100 M 84.8 %
encoder:
  type: convformer
  model_name: convformer_s18.sail_in22k_ft_in1k
  use_pretrained: true
encoder:
    type: convformer
    skip: false
    adapter: null
    use_pretrained: true
    saved_weights_in_checkpoint: false
    trainable: true
    model_name: convformer_s18

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_name (default: convformer_s18): ConvFormer model variant. Pure CNN MetaFormer that outperforms ConvNeXt. Variants with '.sail_in22k_ft_in1k' are pretrained on ImageNet-21K and finetuned on ImageNet-1K. Options: convformer_s18, convformer_s36, convformer_m36, convformer_b36, convformer_s18.sail_in22k_ft_in1k, convformer_s18.sail_in22k_ft_in1k_384, convformer_s36.sail_in22k_ft_in1k, convformer_s36.sail_in22k_ft_in1k_384, convformer_m36.sail_in22k_ft_in1k, convformer_m36.sail_in22k_ft_in1k_384, convformer_b36.sail_in22k_ft_in1k, convformer_b36.sail_in22k_ft_in1k_384.

PoolFormer Encoder

PoolFormer uses simple average pooling as the token mixer — proving that the MetaFormer architecture itself (not the specific mixer) is responsible for the strong performance of modern vision transformers. PoolFormerV2 adds grouped-normalization and extra depth to further close the gap with attention-based models.

Variant Params ImageNet top-1
poolformerv2_s12 12 M 80.3 %
poolformerv2_s24 21 M 82.0 %
poolformerv2_s36 31 M 82.7 %
poolformerv2_m36 56 M 83.5 %
poolformerv2_m48 73 M 83.8 %
encoder:
  type: poolformer
  model_name: poolformerv2_s12
  use_pretrained: true
encoder:
    type: poolformer
    skip: false
    adapter: null
    use_pretrained: true
    saved_weights_in_checkpoint: false
    trainable: true
    model_name: poolformerv2_s12

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_name (default: poolformerv2_s12): PoolFormer model variant. MetaFormer using simple average pooling as token mixer. V2 variants use StarReLU activation and improved training recipe. Options: poolformerv2_s12, poolformerv2_s24, poolformerv2_s36, poolformerv2_m36, poolformerv2_m48, poolformer_s12, poolformer_s24, poolformer_s36, poolformer_m36, poolformer_m48.

Deprecated Encoders (planned to remove in v0.8)

Legacy ResNet Encoder

DEPRECATED: This encoder is deprecated and will be removed in a future release. Please use the equivalent TorchVision ResNet encoder instead.

Implements ResNet V2 as described in Identity Mappings in Deep Residual Networks.

The ResNet encoder takes the following optional parameters:

encoder:
    type: resnet
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: 50

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: 50): Pretrained model variant to use. Options: 18, 34, 50, 101, 152.
Legacy Vision Transformer Encoder

DEPRECATED: This encoder is deprecated and will be removed in a future release. Please use the equivalent TorchVision VisionTransformer encoder instead.

Encodes images using a Vision Transformer as described in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Vision Transformer divides the image into equal-sized patches, uses a linear transformation to encode each flattened patch, then applies a deep transformer architecture to the sequence of encoded patches.

The Vision Transformer Encoder takes the following optional parameters:

encoder:
    type: vit
    skip: false
    adapter: null
    use_pretrained: true
    model_cache_dir: null
    saved_weights_in_checkpoint: false
    trainable: true
    model_variant: b_16

Parameters:

  • skip (default: false):
  • adapter (default: null):
  • use_pretrained (default: true):
  • model_cache_dir (default: null):
  • saved_weights_in_checkpoint (default: false):
  • trainable (default: true):
  • model_variant (default: b_16): Pretrained model variant to use. Options: b_16, b_32, l_16, l_32, h_14.

Image Augmentation

Image augmentation is a technique used to increase the diversity of a training dataset by applying random transformations to the images. The goal is to train a model that is robust to the variations in the training data.

Augmentation is specified by the augmentation section in the image feature configuration and can be specified in one of the following ways:

Boolean: False (Default) No augmentation is applied to the images.

augmentation: False

Boolean: True The following augmentation methods are applied to the images: random_horizontal_flip and random_rotate.

augmentation: True

List of Augmentation Methods One or more of the following augmentation methods are applied to the images in the order specified by the user: random_horizontal_flip, random_vertical_flip, random_rotate, random_blur, random_brightness, and random_contrast. The following is an illustrative example.

augmentation:
    - type: random_horizontal_flip
    - type: random_vertical_flip
    - type: random_rotate
      degree: 10
    - type: random_blur
      kernel_size: 3
    - type: random_brightness
      min: 0.5
      max: 2.0
    - type: random_contrast
      min: 0.5
      max: 2.0

Augmentation is applied to the batch of images in the training set only. The validation and test sets are not augmented.

Following illustrates how augmentation affects an image:

Original Image

Horizontal Flip: Image is randomly flipped horizontally.

type: random_horizontal_flip

Horizontal Flip

Vertical Flip: Image is randomly flipped vertically.

type: random_vertical_flip

Vertical Flip

Rotate: Image is randomly rotated by an amount in the range [-degree, +degree]. degree must be a positive integer.

type: random_rotate
degree: 15

Parameters:

  • degree (default: 15): Range of angle for random rotation, i.e., [-degree, +degree].

Following shows the effect of rotating an image:

Rotate Image

Blur: Image is randomly blurred using a Gaussian filter with kernel size specified by the user. The kernel_size must be a positive, odd integer.

type: random_blur
kernel_size: 3

Parameters:

  • kernel_size (default: 3): Kernel size for random blur.

Following shows the effect of blurring an image with various kernel sizes:

Blur Image

Adjust Brightness: Image brightness is adjusted by a factor randomly selected in the range [min, max]. Both min and max must be a float greater than 0, with min less than max.

type: random_brightness
min: 0.5
max: 2.0

Parameters:

  • min (default: 0.5): Minimum factor for random brightness.
  • max (default: 2.0): Maximum factor for random brightness.

Following shows the effect of brightness adjustment with various factors:

Adjust Brightness

Adjust Contrast: Image contrast is adjusted by a factor randomly selected in the range [min, max]. Both min and max must be a float greater than 0, with min less than max.

type: random_contrast
min: 0.5
max: 2.0

Parameters:

  • min (default: 0.5): Minimum factor for random contrast.
  • max (default: 2.0): Maximum factor for random contrast.

Following shows the effect of contrast adjustment with various factors:

Adjust Contrast

Illustrative Examples of Image Feature Configuration with Augmentation

name: image_column_name
type: image
encoder:
    type: resnet
    model_variant: 18
    use_pretrained: true
    pretrained_cache_dir: None
    trainable: true
augmentation: false
name: image_column_name
type: image
encoder:
    type: stacked_cnn
augmentation: true
name: image_column_name
type: image
encoder:
    type: alexnet
augmentation:
    - type: random_horizontal_flip
    - type: random_rotate
      degree: 10
    - type: random_blur
      kernel_size: 3
    - type: random_brightness
      min: 0.5
      max: 2.0
    - type: random_contrast
      min: 0.5
      max: 2.0
    - type: random_vertical_flip

Output Features

Image features can be used when semantic segmentation needs to be performed. Ludwig 0.15 exposes three segmentation decoders: unet, segformer, and fpn.

Example image output feature using default parameters:

name: image_column_name
type: image
reduce_input: sum
dependencies: []
reduce_dependencies: sum
loss:
    type: softmax_cross_entropy
decoder:
    type: unet

Parameters:

  • reduce_input (default sum): defines how to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension).
  • dependencies (default []): the output features this one is dependent on. For a detailed explanation refer to Output Feature Dependencies.
  • reduce_dependencies (default sum): defines how to reduce the output of a dependent feature that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension).
  • loss (default {type: softmax_cross_entropy}): is a dictionary containing a loss type. softmax_cross_entropy is the only supported loss type for image output features. See Loss for details.
  • decoder (default: {"type": "unet"}): Decoder for the desired task. Options: unet, segformer, fpn. See Decoder for details.

Decoders

U-Net Decoder

The U-Net decoder is based on U-Net: Convolutional Networks for Biomedical Image Segmentation. The decoder implements the expansive upsampling path of the U-Net stack. Semantic segmentation supports one input and one output feature. The num_fc_layers in the decoder and combiner sections must be set to 0 as U-Net does not have any fully connected layers.

U-Net Decoder takes the following optional parameters:

decoder:
    type: unet
    conv_norm: batch
    num_stages: 4
    fc_layers: null
    num_fc_layers: 0
    fc_output_size: 256
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm: null
    fc_norm_params: null
    fc_activation: relu
    fc_dropout: 0.0

Parameters:

  • conv_norm (default: batch): This is the default norm that will be used for each double conv layer.It can be null or batch. Options: batch, null.
  • num_stages (default: 4): Number of encoder/decoder stage pairs in the UNet. The input image dimensions must be divisible by 2^num_stages. Increasing this value lets the model capture features at more spatial scales.
  • fc_layers (default: null):
  • num_fc_layers (default: 0):
  • fc_output_size (default: 256):
  • fc_use_bias (default: true):
  • fc_weights_initializer (default: xavier_uniform):
  • fc_bias_initializer (default: zeros):
  • fc_norm (default: null):
  • fc_norm_params (default: null):
  • fc_activation (default: relu):
  • fc_dropout (default: 0.0):

The decoder depth is configurable via num_stages (default 4). Each stage doubles the spatial resolution during upsampling, so the combiner output height and width must be divisible by 2 ** num_stages. Deeper stacks capture longer-range spatial context at the cost of more parameters.

SegFormer Decoder

segformer implements the lightweight all-MLP decoder from Xie et al., NeurIPS 2021. Instead of transposed convolutions it projects all encoder stages into a shared hidden size and fuses them with a single MLP — much cheaper than U-Net while remaining competitive on standard segmentation benchmarks when paired with a pretrained hierarchical encoder such as Swin or ConvNeXt V2.

output_features:
  - name: mask
    type: image
    decoder:
      type: segformer
      hidden_size: 256
      dropout: 0.1
      num_classes: 21
decoder:
    type: segformer
    hidden_size: 256
    dropout: 0.1
    fc_layers: null
    num_fc_layers: 0
    fc_output_size: 256
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm: null
    fc_norm_params: null
    fc_activation: relu
    fc_dropout: 0.0

Parameters:

  • hidden_size (default: 256): Width of the hidden MLP projection applied to the feature map before upsampling. Larger values increase capacity but also compute cost.
  • dropout (default: 0.1): Dropout probability applied after the hidden MLP projection.
  • fc_layers (default: null):
  • num_fc_layers (default: 0):
  • fc_output_size (default: 256):
  • fc_use_bias (default: true):
  • fc_weights_initializer (default: xavier_uniform):
  • fc_bias_initializer (default: zeros):
  • fc_norm (default: null):
  • fc_norm_params (default: null):
  • fc_activation (default: relu):
  • fc_dropout (default: 0.0):

FPN Decoder

fpn implements the Feature Pyramid Network decoder from Lin et al., CVPR 2017. It builds a top-down pyramid with lateral connections across multiple encoder stages, producing multi-scale features that are useful for dense prediction tasks and for segmenting objects at very different scales.

output_features:
  - name: mask
    type: image
    decoder:
      type: fpn
      num_channels: 256
      num_levels: 4
      num_classes: 80
decoder:
    type: fpn
    num_channels: 256
    num_levels: 4
    fc_layers: null
    num_fc_layers: 0
    fc_output_size: 256
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm: null
    fc_norm_params: null
    fc_activation: relu
    fc_dropout: 0.0

Parameters:

  • num_channels (default: 256): Number of channels in each FPN level after the lateral 1x1 projection. All pyramid levels are projected to this width before the top-down merge.
  • num_levels (default: 4): Number of pyramid levels to build in the top-down pathway. More levels capture coarser context; typical range is 2-5.
  • fc_layers (default: null):
  • num_fc_layers (default: 0):
  • fc_output_size (default: 256):
  • fc_use_bias (default: true):
  • fc_weights_initializer (default: xavier_uniform):
  • fc_bias_initializer (default: zeros):
  • fc_norm (default: null):
  • fc_norm_params (default: null):
  • fc_activation (default: relu):
  • fc_dropout (default: 0.0):

Decoder type and decoder parameters can also be defined once and applied to all image output features using the Type-Global Decoder section.

Loss

Softmax Cross Entropy

loss:
    type: softmax_cross_entropy
    class_weights: null
    weight: 1.0
    robust_lambda: 0
    confidence_penalty: 0
    class_similarities: null
    class_similarities_temperature: 0

Parameters:

  • class_weights (default: null) : Weights to apply to each class in the loss. If not specified, all classes are weighted equally. The value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the <UNK> class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like {class_a: 0.5, class_b: 0.7, ...}.
  • weight (default: 1.0): Weight of the loss.
  • robust_lambda (default: 0): Replaces the loss with (1 - robust_lambda) * loss + robust_lambda / c where c is the number of classes. Useful in case of noisy labels.
  • confidence_penalty (default: 0): Penalizes overconfident predictions (low entropy) by adding an additional term that penalizes too confident predictions by adding a a * (max_entropy - entropy) / max_entropy term to the loss, where a is the value of this parameter. Useful in case of noisy labels.
  • class_similarities (default: null): If not null it is a c x c matrix in the form of a list of lists that contains the mutual similarity of classes. It is used if class_similarities_temperature is greater than 0. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the <UNK> class needs to be included too).
  • class_similarities_temperature (default: 0): The temperature parameter of the softmax that is performed on each row of class_similarities. The output of that softmax is used to determine the supervision vector to provide instead of the one hot vector that would be provided otherwise for each datapoint. The intuition behind it is that errors between similar classes are more tolerable than errors between really different classes.

Loss and loss related parameters can also be defined once and applied to all image output features using the Type-Global Loss section.

Metrics

The measures that are calculated every epoch and are available for image features are the accuracy and loss. You can set either of them as validation_metric in the training section of the configuration if you set the validation_field to be the name of a category feature.