⇅ Image Features

Input image features are transformed into a float valued tensors of size N x C x H x W (where N is the size of the dataset, C is the number of channels, and H x W is the height and width of the image (can be specified by the user). These tensors are added to HDF5 with a key that reflects the name of column in the dataset.

The column name is added to the JSON file, with an associated dictionary containing preprocessing information about the sizes of the resizing.

Supported Image Formats¶

The number of channels in the image is determined by the image format. The following table lists the supported image formats and the number of channels.

Format	Number of channels
Grayscale	1
Grayscale with Alpha	2
RGB	3
RGB with Alpha	4

Preprocessing¶

During preprocessing, raw image files are transformed into numpy arrays and saved in the hdf5 format.

Note

Images passed to an image encoder are expected to have the same size. If images are different sizes, by default they will be resized to the dimensions of the first image in the dataset. Optionally, a resize_method together with a target width and height can be specified in the feature preprocessing parameters, in which case all images will be resized to the specified target size.

preprocessing:
    missing_value_strategy: bfill
    fill_value: null
    height: null
    width: null
    num_channels: null
    num_processes: 1
    num_classes: null
    resize_method: interpolate
    infer_image_num_channels: true
    infer_image_dimensions: true
    infer_image_max_height: 256
    infer_image_max_width: 256
    infer_image_sample_size: 100
    standardize_image: null
    in_memory: true
    requires_equal_dimensions: false
    infer_image_num_classes: false

Parameters:

missing_value_strategy (default: bfill) : What strategy to follow when there's a missing value in an image column Options: fill_with_const, fill_with_mode, bfill, ffill, drop_row.
fill_value (default: null): The maximum number of most common tokens to be considered. If the data contains more than this amount, the most infrequent tokens will be treated as unknown.
height (default: null): The image height in pixels. If this parameter is set, images will be resized to the specified height using the resize_method parameter. If None, images will be resized to the size of the first image in the dataset.
width (default: null): The image width in pixels. If this parameter is set, images will be resized to the specified width using the resize_method parameter. If None, images will be resized to the size of the first image in the dataset.
num_channels (default: null): Number of channels in the images. If specified, images will be read in the mode specified by the number of channels. If not specified, the number of channels will be inferred from the image format of the first valid image in the dataset.
num_processes (default: 1): Specifies the number of processes to run for preprocessing images.
num_classes (default: null): Number of channel classes in the images. If specified, this value will be validated against the inferred number of classes. Use 2 to convert grayscale images to binary images.
resize_method (default: interpolate): The method to use for resizing images. Options: crop_or_pad, interpolate.
infer_image_num_channels (default: true): If true, then the number of channels in the dataset is inferred from a sample of the first image in the dataset. Options: true, false.
infer_image_dimensions (default: true): If true, then the height and width of images in the dataset will be inferred from a sample of the first image in the dataset. Each image that doesn't conform to these dimensions will be resized according to resize_method. If set to false, then the height and width of images in the dataset will be specified by the user. Options: true, false.
infer_image_max_height (default: 256): If infer_image_dimensions is set, this is used as the maximum height of the images in the dataset.
infer_image_max_width (default: 256): If infer_image_dimensions is set, this is used as the maximum width of the images in the dataset.
infer_image_sample_size (default: 100): The sample size used for inferring dimensions of images in infer_image_dimensions.
standardize_image (default: null): Standardize image by per channel mean centering and standard deviation scaling . Options: imagenet1k, null.
in_memory (default: true): Defines whether image dataset will reside in memory during the training process or will be dynamically fetched from disk (useful for large datasets). In the latter case a training batch of input images will be fetched from disk each training iteration. Options: true, false.
requires_equal_dimensions (default: false): If true, then width and height must be equal. Options: true, false.
infer_image_num_classes (default: false): If true, then the number of channel classes in the dataset will be inferred from a sample of the first image in the dataset. Each unique channel value will be mapped to a class and preprocessing will create a masked image based on the channel classes. Options: true, false.

Preprocessing parameters can also be defined once and applied to all image input features using the Type-Global Preprocessing section.

Input Features¶

The encoder parameters specified at the feature level are:

tied (default null): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.
augmentation (default False): specifies image data augmentation operations to generate synthetic training data. More details on image augmentation can be found here.

Example image feature entry in the input features list:

name: image_column_name
type: image
tied: null
encoder: 
    type: stacked_cnn

The available encoder parameters are:

type (default stacked_cnn): the possible values are stacked_cnn, resnet, mlp_mixer, vit, and TorchVision Pretrained Image Classification models.

Encoder type and encoder parameters can also be defined once and applied to all image input features using the Type-Global Encoder section.

Encoders¶

Convolutional Stack Encoder (`stacked_cnn`)¶

Stack of 2D convolutional layers with optional normalization, dropout, and down-sampling pooling layers, followed by an optional stack of fully connected layers.

Convolutional Stack Encoder takes the following optional parameters:

encoder:
    type: stacked_cnn
    conv_dropout: 0.0
    output_size: 128
    num_conv_layers: null
    out_channels: 32
    conv_norm: null
    fc_norm: null
    fc_norm_params: null
    conv_activation: relu
    kernel_size: 3
    stride: 1
    padding_mode: zeros
    padding: valid
    dilation: 1
    groups: 1
    pool_function: max
    pool_kernel_size: 2
    pool_stride: null
    pool_padding: 0
    pool_dilation: 1
    conv_norm_params: null
    conv_layers: null
    fc_dropout: 0.0
    fc_activation: relu
    fc_use_bias: true
    fc_bias_initializer: zeros
    fc_weights_initializer: xavier_uniform
    num_fc_layers: 1
    fc_layers: null
    num_channels: null
    conv_use_bias: true

Parameters:

conv_dropout (default: 0.0) : Dropout rate
output_size (default: 128) : If output_size is not already specified in fc_layers this is the default output_size that will be used for each layer. It indicates the size of the output of a fully connected layer.
num_conv_layers (default: null) : Number of convolutional layers to use in the encoder.
out_channels (default: 32): Indicates the number of filters, and by consequence the output channels of the 2d convolution. If out_channels is not already specified in conv_layers this is the default out_channels that will be used for each layer.
conv_norm (default: null): If a norm is not already specified in conv_layers this is the default norm that will be used for each layer. It indicates the normalization applied to the activations and can be null, batch or layer. Options: batch, layer, null.
fc_norm (default: null): If a norm is not already specified in fc_layers this is the default norm that will be used for each layer. It indicates the norm of the output and can be null, batch or layer. Options: batch, layer, null.
fc_norm_params (default: null): Parameters used if norm is either batch or layer. For information on parameters used with batch see Torch's documentation on batch normalization or for layer see Torch's documentation on layer normalization.
conv_activation (default: relu): If an activation is not already specified in conv_layers this is the default activation that will be used for each layer. It indicates the activation function applied to the output. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
kernel_size (default: 3): An integer or pair of integers specifying the kernel size. A single integer specifies a square kernel, while a pair of integers specifies the height and width of the kernel in that order (h, w). If a kernel_size is not specified in conv_layers this kernel_size that will be used for each layer.
stride (default: 1): An integer or pair of integers specifying the stride of the convolution along the height and width. If a stride is not already specified in conv_layers, specifies the default stride of the 2D convolutional kernel that will be used for each layer.
padding_mode (default: zeros): If padding_mode is not already specified in conv_layers, specifies the default padding_mode of the 2D convolutional kernel that will be used for each layer. Options: zeros, reflect, replicate, circular.
padding (default: valid): An int, pair of ints (h, w), or one of ['valid', 'same'] specifying the padding used forconvolution kernels.
dilation (default: 1): An int or pair of ints specifying the dilation rate to use for dilated convolution. If dilation is not already specified in conv_layers, specifies the default dilation of the 2D convolutional kernel that will be used for each layer.
groups (default: 1): Groups controls the connectivity between convolution inputs and outputs. When groups = 1, each output channel depends on every input channel. When groups > 1, input and output channels are divided into groups separate groups, where each output channel depends only on the inputs in its respective input channel group. in_channels and out_channels must both be divisible by groups.
pool_function (default: max): Pooling function to use. Options: max, average, avg, mean.
pool_kernel_size (default: 2): An integer or pair of integers specifying the pooling size. If pool_kernel_size is not specified in conv_layers this is the default value that will be used for each layer.
pool_stride (default: null): An integer or pair of integers specifying the pooling stride, which is the factor by which the pooling layer downsamples the feature map. Defaults to pool_kernel_size.
pool_padding (default: 0): An integer or pair of ints specifying pooling padding (h, w).
pool_dilation (default: 1): An integer or pair of ints specifying pooling dilation rate (h, w).
conv_norm_params (default: null): Parameters used if conv_norm is either batch or layer.
conv_layers (default: null): List of convolutional layers to use in the encoder.
fc_dropout (default: 0.0): Dropout rate
fc_activation (default: relu): If an activation is not already specified in fc_layers this is the default activation that will be used for each layer. It indicates the activation function applied to the output. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
fc_use_bias (default: true): Whether the layer uses a bias vector. Options: true, false.
fc_bias_initializer (default: zeros): Initializer for the bias vector. Options: constant, dirac, eye, identity, kaiming_normal, kaiming_uniform, normal, ones, orthogonal, sparse, uniform, xavier_normal, xavier_uniform, zeros.
fc_weights_initializer (default: xavier_uniform): Initializer for the weights matrix. Options: constant, dirac, eye, identity, kaiming_normal, kaiming_uniform, normal, ones, orthogonal, sparse, uniform, xavier_normal, xavier_uniform, zeros.
num_fc_layers (default: 1): The number of stacked fully connected layers.
fc_layers (default: null): A list of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead.
num_channels (default: null): Number of channels to use in the encoder.
conv_use_bias (default: true): If bias not already specified in conv_layers, specifies if the 2D convolutional kernel should have a bias term. Options: true, false.

MLP-Mixer Encoder¶

Encodes images using MLP-Mixer, as described in MLP-Mixer: An all-MLP Architecture for Vision. MLP-Mixer divides the image into equal-sized patches, applying fully connected layers to each patch to compute per-patch representations (tokens) and combining the representations with fully-connected mixer layers.

The MLP-Mixer Encoder takes the following optional parameters:

encoder:
    type: mlp_mixer
    dropout: 0.0
    num_layers: 8
    patch_size: 16
    num_channels: null
    embed_size: 512
    token_size: 2048
    channel_dim: 256
    avg_pool: true

Parameters:

dropout (default: 0.0) : Dropout rate.
num_layers (default: 8) : The depth of the network (the number of Mixer blocks).
patch_size (default: 16): The image patch size. Each patch is patch_size² pixels. Must evenly divide the image width and height.
num_channels (default: null): Number of channels to use in the encoder.
embed_size (default: 512): The patch embedding size, the output size of the mixer if avg_pool is true.
token_size (default: 2048): The per-patch embedding size.
channel_dim (default: 256): Number of channels in hidden layer.
avg_pool (default: true): If true, pools output over patch dimension, outputs a vector of shape (embed_size). If false, the output tensor is of shape (n_patches, embed_size), where n_patches is img_height x img_width / patch_size². Options: true, false.

TorchVision Pretrained Model Encoders¶

Twenty TorchVision pretrained image classification models are available as Ludwig image encoders. The available models are:

AlexNet
ConvNeXt
DenseNet
EfficientNet
EfficientNetV2
GoogLeNet
Inception V3
MaxVit
MNASNet
MobileNet V2
MobileNet V3
RegNet
ResNet
ResNeXt
ShuffleNet V2
SqueezeNet
SwinTransformer
VGG
VisionTransformer
Wide ResNet

See TorchVison documentation for more details.

Ludwig encoders parameters for TorchVision pretrained models:

AlexNet¶

encoder:
    type: alexnet
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: base

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: base): Pretrained model variant to use. Options: base.

ConvNeXt¶

encoder:
    type: convnext
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: base

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: base): Pretrained model variant to use. Options: tiny, small, base, large.

DenseNet¶

encoder:
    type: densenet
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: 121

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: 121): Pretrained model variant to use. Options: 121, 161, 169, 201.

EfficientNet¶

encoder:
    type: efficientnet
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: b0

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: b0): Pretrained model variant to use. Options: b0, b1, b2, b3, b4, b5, b6, b7, v2_s, v2_m, v2_l.

GoogLeNet¶

encoder:
    type: googlenet
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: base

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: base): Pretrained model variant to use. Options: base.

Inception V3¶

encoder:
    type: inceptionv3
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: base

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: base): Pretrained model variant to use. Options: base.

MaxVit¶

encoder:
    type: maxvit
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: t

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: t): Pretrained model variant to use. Options: t.

MNASNet¶

encoder:
    type: mnasnet
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: '0_5'

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: 0_5): Pretrained model variant to use. Options: 0_5, 0_75, 1_0, 1_3.

MobileNet V2¶

encoder:
    type: mobilenetv2
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: base

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: base): Pretrained model variant to use. Options: base.

MobileNet V3¶

encoder:
    type: mobilenetv3
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: small

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: small): Pretrained model variant to use. Options: small, large.

RegNet¶

encoder:
    type: regnet
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: x_1_6gf

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: x_1_6gf): Pretrained model variant to use. Options: x_1_6gf, x_16gf, x_32gf, x_3_2gf, x_400mf, x_800mf, x_8gf, y_128gf, y_16gf, y_1_6gf, y_32gf, y_3_2gf, y_400mf, y_800mf, y_8gf.

ResNet¶

encoder:
    type: resnet
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: 50

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: 50): Pretrained model variant to use. Options: 18, 34, 50, 101, 152.

ResNeXt¶

encoder:
    type: resnext
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: 50_32x4d

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: 50_32x4d): Pretrained model variant to use. Options: 50_32x4d, 101_32x8d, 101_64x4d.

ShuffleNet V2¶

encoder:
    type: shufflenet_v2
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: x0_5

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: x0_5): Pretrained model variant to use. Options: x0_5, x1_0, x1_5, x2_0.

SqueezeNet¶

encoder:
    type: squeezenet
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: '1_0'

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: 1_0): Pretrained model variant to use. Options: 1_0, 1_1.

SwinTransformer¶

encoder:
    type: swin_transformer
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: t

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: t): Pretrained model variant to use. Options: t, s, b.

VGG¶

encoder:
    type: vgg
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: 11

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: 11): Pretrained model variant to use.

VisionTransformer¶

encoder:
    type: vit
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: b_16

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: b_16): Pretrained model variant to use. Options: b_16, b_32, l_16, l_32, h_14.

Wide ResNet¶

encoder:
    type: wide_resnet
    use_pretrained: true
    trainable: true
    model_cache_dir: null
    model_variant: '50_2'

Parameters:

use_pretrained (default: true) : Download model weights from pre-trained model. Options: true, false.
trainable (default: true) : Is the encoder trainable. Options: true, false.
model_cache_dir (default: null): Directory path to cache pretrained model weights.
model_variant (default: 50_2): Pretrained model variant to use. Options: 50_2, 101_2.

Note:

At this time Ludwig supports only the DEFAULT pretrained weights, which are the best available weights for a specific model. More details on DEFAULT weights can be found in this blog post.
Some TorchVision pretrained models consume large amounts of memory. These model_variant required more than 12GB of memory:
efficientnet_torch: b7
regnet_torch: y_128gf
vit_torch: h_14

U-Net Encoder¶

The U-Net encoder is based on U-Net: Convolutional Networks for Biomedical Image Segmentation. The encoder implements the contracting downsampling path of the U-Net stack.

U-Net Encoder takes the following optional parameters:

encoder:
    type: unet
    conv_norm: batch

Parameters:

conv_norm (default: batch): This is the default norm that will be used for each double conv layer.It can be null or batch. Options: batch, null.

Deprecated Encoders (planned to remove in v0.8)¶

Legacy ResNet Encoder¶

DEPRECATED: This encoder is deprecated and will be removed in a future release. Please use the equivalent TorchVision ResNet encoder instead.

Implements ResNet V2 as described in Identity Mappings in Deep Residual Networks.

The ResNet encoder takes the following optional parameters:

encoder:
    type: _resnet_legacy
    dropout: 0.0
    output_size: 128
    activation: relu
    norm: null
    first_pool_kernel_size: null
    first_pool_stride: null
    use_bias: true
    bias_initializer: zeros
    weights_initializer: xavier_uniform
    norm_params: null
    num_fc_layers: 1
    fc_layers: null
    resnet_size: 50
    num_channels: null
    out_channels: 32
    kernel_size: 3
    conv_stride: 1
    batch_norm_momentum: 0.9
    batch_norm_epsilon: 0.001

Parameters:

dropout (default: 0.0) : Dropout rate
output_size (default: 128) : if output_size is not already specified in fc_layers this is the default output_size that will be used for each layer. It indicates the size of the output of a fully connected layer.
activation (default: relu): if an activation is not already specified in fc_layers this is the default activation that will be used for each layer. It indicates the activation function applied to the output. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
norm (default: null): if a norm is not already specified in fc_layers this is the default norm that will be used for each layer. It indicates the norm of the output and can be null, batch or layer. Options: batch, layer, null.
first_pool_kernel_size (default: null): Pool size to be used for the first pooling layer. If none, the first pooling layer is skipped.
first_pool_stride (default: null): Stride for first pooling layer. If null, defaults to first_pool_kernel_size.
use_bias (default: true): Whether the layer uses a bias vector. Options: true, false.
bias_initializer (default: zeros): initializer for the bias vector. Options: constant, dirac, eye, identity, kaiming_normal, kaiming_uniform, normal, ones, orthogonal, sparse, uniform, xavier_normal, xavier_uniform, zeros.
weights_initializer (default: xavier_uniform): Initializer for the weights matrix. Options: constant, dirac, eye, identity, kaiming_normal, kaiming_uniform, normal, ones, orthogonal, sparse, uniform, xavier_normal, xavier_uniform, zeros.
norm_params (default: null): parameters used if norm is either batch or layer. For information on parameters used with batch see Torch's documentation on batch normalization or for layer see Torch's documentation on layer normalization.
num_fc_layers (default: 1): The number of stacked fully connected layers.
fc_layers (default: null): A list of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead.
resnet_size (default: 50): The size of the ResNet model to use.
num_channels (default: null): Number of channels to use in the encoder.
out_channels (default: 32): Indicates the number of filters, and by consequence the output channels of the 2d convolution. If out_channels is not already specified in conv_layers this is the default out_channels that will be used for each layer.
kernel_size (default: 3): An integer or pair of integers specifying the kernel size. A single integer specifies a square kernel, while a pair of integers specifies the height and width of the kernel in that order (h, w). If a kernel_size is not specified in conv_layers this kernel_size that will be used for each layer.
conv_stride (default: 1): An integer or pair of integers specifying the stride of the initial convolutional layer.
batch_norm_momentum (default: 0.9): Momentum of the batch norm running statistics.
batch_norm_epsilon (default: 0.001): Epsilon of the batch norm.

Legacy Vision Transformer Encoder¶

DEPRECATED: This encoder is deprecated and will be removed in a future release. Please use the equivalent TorchVision VisionTransformer encoder instead.

Encodes images using a Vision Transformer as described in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Vision Transformer divides the image into equal-sized patches, uses a linear transformation to encode each flattened patch, then applies a deep transformer architecture to the sequence of encoded patches.

The Vision Transformer Encoder takes the following optional parameters:

encoder:
    type: _vit_legacy
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    trainable: true
    use_pretrained: true
    pretrained_model: google/vit-base-patch16-224
    hidden_size: 768
    hidden_act: gelu
    patch_size: 16
    initializer_range: 0.02
    num_hidden_layers: 12
    num_attention_heads: 12
    intermediate_size: 3072
    layer_norm_eps: 1.0e-12
    gradient_checkpointing: false

Parameters:

hidden_dropout_prob (default: 0.1) : The dropout rate for all fully connected layers in the embeddings, encoder, and pooling.
attention_probs_dropout_prob (default: 0.1) : The dropout rate for the attention probabilities.
trainable (default: true) : Is the encoder trainable. Options: true, false.
use_pretrained (default: true) : Use pre-trained model weights from Hugging Face. Options: true, false.
pretrained_model (default: google/vit-base-patch16-224) : The name of the pre-trained model to use.
hidden_size (default: 768): Dimensionality of the encoder layers and the pooling layer.
hidden_act (default: gelu): Hidden layer activation, one of gelu, relu, selu or gelu_new. Options: relu, gelu, selu, gelu_new.
patch_size (default: 16): The image patch size. Each patch is patch_size² pixels. Must evenly divide the image width and height.
initializer_range (default: 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
num_hidden_layers (default: 12): Number of hidden layers in the Transformer encoder.
num_attention_heads (default: 12): Number of attention heads in each attention layer.
intermediate_size (default: 3072): Dimensionality of the intermediate (i.e., feed-forward) layer in the Transformer encoder.
layer_norm_eps (default: 1e-12): The epsilon used by the layer normalization layers.
gradient_checkpointing (default: false): Options: true, false.

Image Augmentation¶

Image augmentation is a technique used to increase the diversity of a training dataset by applying random transformations to the images. The goal is to train a model that is robust to the variations in the training data.

Augmentation is specified by the augmentation section in the image feature configuration and can be specified in one of the following ways:

Boolean: False (Default) No augmentation is applied to the images.

augmentation: False

Boolean: True The following augmentation methods are applied to the images: random_horizontal_flip and random_rotate.

augmentation: True

List of Augmentation Methods One or more of the following augmentation methods are applied to the images in the order specified by the user: random_horizontal_flip, random_vertical_flip, random_rotate, random_blur, random_brightness, and random_contrast. The following is an illustrative example.

augmentation:
    - type: random_horizontal_flip
    - type: random_vertical_flip
    - type: random_rotate
      degree: 10
    - type: random_blur
      kernel_size: 3
    - type: random_brightness
      min: 0.5
      max: 2.0
    - type: random_contrast
      min: 0.5
      max: 2.0

Augmentation is applied to the batch of images in the training set only. The validation and test sets are not augmented.

Following illustrates how augmentation affects an image:

Original Image

Horizontal Flip: Image is randomly flipped horizontally.

type: random_horizontal_flip

Horizontal Flip

Vertical Flip: Image is randomly flipped vertically.

type: random_vertical_flip

Vertical Flip

Rotate: Image is randomly rotated by an amount in the range [-degree, +degree]. degree must be a positive integer.

type: random_rotate
degree: 15

Parameters:

degree (default: 15): Range of angle for random rotation, i.e., [-degree, +degree].

Following shows the effect of rotating an image:

Rotate Image

Blur: Image is randomly blurred using a Gaussian filter with kernel size specified by the user. The kernel_size must be a positive, odd integer.

type: random_blur
kernel_size: 3

Parameters:

kernel_size (default: 3): Kernel size for random blur.

Following shows the effect of blurring an image with various kernel sizes:

Blur Image

Adjust Brightness: Image brightness is adjusted by a factor randomly selected in the range [min, max]. Both min and max must be a float greater than 0, with min less than max.

type: random_brightness
min: 0.5
max: 2.0

Parameters:

min (default: 0.5): Minimum factor for random brightness.
max (default: 2.0): Maximum factor for random brightness.

Following shows the effect of brightness adjustment with various factors:

Adjust Brightness

Adjust Contrast: Image contrast is adjusted by a factor randomly selected in the range [min, max]. Both min and max must be a float greater than 0, with min less than max.

type: random_contrast
min: 0.5
max: 2.0

Parameters:

min (default: 0.5): Minimum factor for random brightness.
max (default: 2.0): Maximum factor for random brightness.

Following shows the effect of contrast adjustment with various factors:

Adjust Contrast

Illustrative Examples of Image Feature Configuration with Augmentation

name: image_column_name
type: image
encoder: 
    type: resnet
    model_variant: 18
    use_pretrained: true
    pretrained_cache_dir: None
    trainable: true
augmentation: false

name: image_column_name
type: image
encoder: 
    type: stacked_cnn
augmentation: true

name: image_column_name
type: image
encoder: 
    type: alexnet
augmentation:
    - type: random_horizontal_flip
    - type: random_rotate
      degree: 10
    - type: random_blur
      kernel_size: 3
    - type: random_brightness
      min: 0.5
      max: 2.0
    - type: random_contrast
      min: 0.5
      max: 2.0
    - type: random_vertical_flip

Output Features¶

Image features can be used when semantic segmentation needs to be performed. There is only one decoder available for image features: unet.

Example image output feature using default parameters:

name: image_column_name
type: image
reduce_input: sum
dependencies: []
reduce_dependencies: sum
loss:
    type: softmax_cross_entropy
decoder:
    type: unet

Parameters:

reduce_input (default sum): defines how to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension).
dependencies (default []): the output features this one is dependent on. For a detailed explanation refer to Output Feature Dependencies.
reduce_dependencies (default sum): defines how to reduce the output of a dependent feature that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are: sum, mean or avg, max, concat (concatenates along the first dimension), last (returns the last vector of the first dimension).
loss (default {type: softmax_cross_entropy}): is a dictionary containing a loss type. softmax_cross_entropy is the only supported loss type for image output features. See Loss for details.
decoder (default: {"type": "unet"}): Decoder for the desired task. Options: unet. See Decoder for details.

Decoders¶

U-Net Decoder¶

The U-Net decoder is based on U-Net: Convolutional Networks for Biomedical Image Segmentation. The decoder implements the expansive upsampling path of the U-Net stack. Semantic segmentation supports one input and one output feature. The num_fc_layers in the decoder and combiner sections must be set to 0 as U-Net does not have any fully connected layers.

U-Net Decoder takes the following optional parameters:

decoder:
    type: unet
    num_fc_layers: 0
    fc_output_size: 256
    fc_norm: null
    fc_dropout: 0.0
    fc_activation: relu
    conv_norm: batch
    fc_layers: null
    fc_use_bias: true
    fc_weights_initializer: xavier_uniform
    fc_bias_initializer: zeros
    fc_norm_params: null

Parameters:

num_fc_layers (default: 0) : Number of fully-connected layers if fc_layers not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.
fc_output_size (default: 256) : Output size of fully connected stack.
fc_norm (default: null) : Default normalization applied at the beginnging of fully connected layers. Options: batch, layer, ghost, null.
fc_dropout (default: 0.0) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).
fc_activation (default: relu): Default activation function applied to the output of the fully connected layers. Options: elu, leakyRelu, logSigmoid, relu, sigmoid, tanh, softmax, null.
conv_norm (default: batch): This is the default norm that will be used for each double conv layer.It can be null or batch. Options: batch, null.
fc_layers (default: null): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.
fc_use_bias (default: true): Whether the layer uses a bias vector in the fc_stack. Options: true, false.
fc_weights_initializer (default: xavier_uniform): The weights initializer to use for the layers in the fc_stack
fc_bias_initializer (default: zeros): The bias initializer to use for the layers in the fc_stack
fc_norm_params (default: null): Default parameters passed to the norm module.

Decoder type and decoder parameters can also be defined once and applied to all image output features using the Type-Global Decoder section.

Loss¶

Softmax Cross Entropy¶

loss:
    type: softmax_cross_entropy
    class_weights: null
    weight: 1.0
    robust_lambda: 0
    confidence_penalty: 0
    class_similarities: null
    class_similarities_temperature: 0

Parameters:

class_weights (default: null) : Weights to apply to each class in the loss. If not specified, all classes are weighted equally. The value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the <UNK> class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like {class_a: 0.5, class_b: 0.7, ...}.
weight (default: 1.0): Weight of the loss.
robust_lambda (default: 0): Replaces the loss with (1 - robust_lambda) * loss + robust_lambda / c where c is the number of classes. Useful in case of noisy labels.
confidence_penalty (default: 0): Penalizes overconfident predictions (low entropy) by adding an additional term that penalizes too confident predictions by adding a a * (max_entropy - entropy) / max_entropy term to the loss, where a is the value of this parameter. Useful in case of noisy labels.
class_similarities (default: null): If not null it is a c x c matrix in the form of a list of lists that contains the mutual similarity of classes. It is used if class_similarities_temperature is greater than 0. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the <UNK> class needs to be included too).
class_similarities_temperature (default: 0): The temperature parameter of the softmax that is performed on each row of class_similarities. The output of that softmax is used to determine the supervision vector to provide instead of the one hot vector that would be provided otherwise for each datapoint. The intuition behind it is that errors between similar classes are more tolerable than errors between really different classes.

Loss and loss related parameters can also be defined once and applied to all image output features using the Type-Global Loss section.

Metrics¶

The measures that are calculated every epoch and are available for image features are the accuracy and loss. You can set either of them as validation_metric in the training section of the configuration if you set the validation_field to be the name of a category feature.

⇅ Image Features

Supported Image Formats¶

Preprocessing¶

Input Features¶

Encoders¶

Convolutional Stack Encoder (stacked_cnn)¶

MLP-Mixer Encoder¶

TorchVision Pretrained Model Encoders¶

AlexNet¶

ConvNeXt¶

DenseNet¶

EfficientNet¶

GoogLeNet¶

Inception V3¶

MaxVit¶

MNASNet¶

MobileNet V2¶

MobileNet V3¶

RegNet¶

ResNet¶

ResNeXt¶

ShuffleNet V2¶

SqueezeNet¶

SwinTransformer¶

VGG¶

VisionTransformer¶

Wide ResNet¶

U-Net Encoder¶

Deprecated Encoders (planned to remove in v0.8)¶

Legacy ResNet Encoder¶

Legacy Vision Transformer Encoder¶

Image Augmentation¶

Output Features¶

Decoders¶

U-Net Decoder¶

Loss¶

Softmax Cross Entropy¶

Metrics¶

Convolutional Stack Encoder (`stacked_cnn`)¶