⇅ Image Features
Input image features are transformed into a float valued tensors of size N x C x H x W
(where N
is the size of the
dataset, C
is the number of channels, and H x W
is the height and width of the image (can be specified by the user).
These tensors are added to HDF5 with a key that reflects the name of column in the dataset.
The column name is added to the JSON file, with an associated dictionary containing preprocessing information about the sizes of the resizing.
Supported Image Formats¶
The number of channels in the image is determined by the image format. The following table lists the supported image formats and the number of channels.
Format | Number of channels |
---|---|
Grayscale | 1 |
Grayscale with Alpha | 2 |
RGB | 3 |
RGB with Alpha | 4 |
Preprocessing¶
During preprocessing, raw image files are transformed into numpy arrays and saved in the hdf5 format.
Note
Images passed to an image encoder are expected to have the same size. If images are different sizes, by default they
will be resized to the dimensions of the first image in the dataset. Optionally, a resize_method
together with a
target width
and height
can be specified in the feature preprocessing parameters, in which case all images will
be resized to the specified target size.
preprocessing:
missing_value_strategy: bfill
fill_value: null
height: null
width: null
num_channels: null
num_processes: 1
num_classes: null
resize_method: interpolate
infer_image_num_channels: true
infer_image_dimensions: true
infer_image_max_height: 256
infer_image_max_width: 256
infer_image_sample_size: 100
standardize_image: null
in_memory: true
requires_equal_dimensions: false
infer_image_num_classes: false
Parameters:
missing_value_strategy
(default:bfill
) : What strategy to follow when there's a missing value in an image column Options:fill_with_const
,fill_with_mode
,bfill
,ffill
,drop_row
.fill_value
(default:null
): The maximum number of most common tokens to be considered. If the data contains more than this amount, the most infrequent tokens will be treated as unknown.height
(default:null
): The image height in pixels. If this parameter is set, images will be resized to the specified height using the resize_method parameter. If None, images will be resized to the size of the first image in the dataset.width
(default:null
): The image width in pixels. If this parameter is set, images will be resized to the specified width using the resize_method parameter. If None, images will be resized to the size of the first image in the dataset.num_channels
(default:null
): Number of channels in the images. If specified, images will be read in the mode specified by the number of channels. If not specified, the number of channels will be inferred from the image format of the first valid image in the dataset.num_processes
(default:1
): Specifies the number of processes to run for preprocessing images.num_classes
(default:null
): Number of channel classes in the images. If specified, this value will be validated against the inferred number of classes. Use 2 to convert grayscale images to binary images.resize_method
(default:interpolate
): The method to use for resizing images. Options:crop_or_pad
,interpolate
.infer_image_num_channels
(default:true
): If true, then the number of channels in the dataset is inferred from a sample of the first image in the dataset. Options:true
,false
.infer_image_dimensions
(default:true
): If true, then the height and width of images in the dataset will be inferred from a sample of the first image in the dataset. Each image that doesn't conform to these dimensions will be resized according to resize_method. If set to false, then the height and width of images in the dataset will be specified by the user. Options:true
,false
.infer_image_max_height
(default:256
): If infer_image_dimensions is set, this is used as the maximum height of the images in the dataset.infer_image_max_width
(default:256
): If infer_image_dimensions is set, this is used as the maximum width of the images in the dataset.infer_image_sample_size
(default:100
): The sample size used for inferring dimensions of images in infer_image_dimensions.standardize_image
(default:null
): Standardize image by per channel mean centering and standard deviation scaling . Options:imagenet1k
,null
.in_memory
(default:true
): Defines whether image dataset will reside in memory during the training process or will be dynamically fetched from disk (useful for large datasets). In the latter case a training batch of input images will be fetched from disk each training iteration. Options:true
,false
.requires_equal_dimensions
(default:false
): If true, then width and height must be equal. Options:true
,false
.infer_image_num_classes
(default:false
): If true, then the number of channel classes in the dataset will be inferred from a sample of the first image in the dataset. Each unique channel value will be mapped to a class and preprocessing will create a masked image based on the channel classes. Options:true
,false
.
Preprocessing parameters can also be defined once and applied to all image input features using the Type-Global Preprocessing section.
Input Features¶
The encoder parameters specified at the feature level are:
tied
(defaultnull
): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.augmentation
(defaultFalse
): specifies image data augmentation operations to generate synthetic training data. More details on image augmentation can be found here.
Example image feature entry in the input features list:
name: image_column_name
type: image
tied: null
encoder:
type: stacked_cnn
The available encoder parameters are:
type
(defaultstacked_cnn
): the possible values arestacked_cnn
,resnet
,mlp_mixer
,vit
, and TorchVision Pretrained Image Classification models.
Encoder type and encoder parameters can also be defined once and applied to all image input features using the Type-Global Encoder section.
Encoders¶
Convolutional Stack Encoder (stacked_cnn
)¶
Stack of 2D convolutional layers with optional normalization, dropout, and down-sampling pooling layers, followed by an optional stack of fully connected layers.
Convolutional Stack Encoder takes the following optional parameters:
encoder:
type: stacked_cnn
conv_dropout: 0.0
output_size: 128
num_conv_layers: null
out_channels: 32
conv_norm: null
fc_norm: null
fc_norm_params: null
conv_activation: relu
kernel_size: 3
stride: 1
padding_mode: zeros
padding: valid
dilation: 1
groups: 1
pool_function: max
pool_kernel_size: 2
pool_stride: null
pool_padding: 0
pool_dilation: 1
conv_norm_params: null
conv_layers: null
fc_dropout: 0.0
fc_activation: relu
fc_use_bias: true
fc_bias_initializer: zeros
fc_weights_initializer: xavier_uniform
num_fc_layers: 1
fc_layers: null
num_channels: null
conv_use_bias: true
Parameters:
conv_dropout
(default:0.0
) : Dropout rateoutput_size
(default:128
) : If output_size is not already specified in fc_layers this is the default output_size that will be used for each layer. It indicates the size of the output of a fully connected layer.num_conv_layers
(default:null
) : Number of convolutional layers to use in the encoder.out_channels
(default:32
): Indicates the number of filters, and by consequence the output channels of the 2d convolution. If out_channels is not already specified in conv_layers this is the default out_channels that will be used for each layer.conv_norm
(default:null
): If a norm is not already specified in conv_layers this is the default norm that will be used for each layer. It indicates the normalization applied to the activations and can be null, batch or layer. Options:batch
,layer
,null
.fc_norm
(default:null
): If a norm is not already specified in fc_layers this is the default norm that will be used for each layer. It indicates the norm of the output and can be null, batch or layer. Options:batch
,layer
,null
.fc_norm_params
(default:null
): Parameters used if norm is either batch or layer. For information on parameters used with batch see Torch's documentation on batch normalization or for layer see Torch's documentation on layer normalization.conv_activation
(default:relu
): If an activation is not already specified in conv_layers this is the default activation that will be used for each layer. It indicates the activation function applied to the output. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.kernel_size
(default:3
): An integer or pair of integers specifying the kernel size. A single integer specifies a square kernel, while a pair of integers specifies the height and width of the kernel in that order (h, w). If a kernel_size is not specified in conv_layers this kernel_size that will be used for each layer.stride
(default:1
): An integer or pair of integers specifying the stride of the convolution along the height and width. If a stride is not already specified in conv_layers, specifies the default stride of the 2D convolutional kernel that will be used for each layer.padding_mode
(default:zeros
): If padding_mode is not already specified in conv_layers, specifies the default padding_mode of the 2D convolutional kernel that will be used for each layer. Options:zeros
,reflect
,replicate
,circular
.padding
(default:valid
): An int, pair of ints (h, w), or one of ['valid', 'same'] specifying the padding used forconvolution kernels.dilation
(default:1
): An int or pair of ints specifying the dilation rate to use for dilated convolution. If dilation is not already specified in conv_layers, specifies the default dilation of the 2D convolutional kernel that will be used for each layer.groups
(default:1
): Groups controls the connectivity between convolution inputs and outputs. When groups = 1, each output channel depends on every input channel. When groups > 1, input and output channels are divided into groups separate groups, where each output channel depends only on the inputs in its respective input channel group. in_channels and out_channels must both be divisible by groups.pool_function
(default:max
): Pooling function to use. Options:max
,average
,avg
,mean
.pool_kernel_size
(default:2
): An integer or pair of integers specifying the pooling size. If pool_kernel_size is not specified in conv_layers this is the default value that will be used for each layer.pool_stride
(default:null
): An integer or pair of integers specifying the pooling stride, which is the factor by which the pooling layer downsamples the feature map. Defaults to pool_kernel_size.pool_padding
(default:0
): An integer or pair of ints specifying pooling padding (h, w).pool_dilation
(default:1
): An integer or pair of ints specifying pooling dilation rate (h, w).conv_norm_params
(default:null
): Parameters used if conv_norm is either batch or layer.conv_layers
(default:null
): List of convolutional layers to use in the encoder.fc_dropout
(default:0.0
): Dropout ratefc_activation
(default:relu
): If an activation is not already specified in fc_layers this is the default activation that will be used for each layer. It indicates the activation function applied to the output. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.fc_use_bias
(default:true
): Whether the layer uses a bias vector. Options:true
,false
.fc_bias_initializer
(default:zeros
): Initializer for the bias vector. Options:constant
,dirac
,eye
,identity
,kaiming_normal
,kaiming_uniform
,normal
,ones
,orthogonal
,sparse
,uniform
,xavier_normal
,xavier_uniform
,zeros
.fc_weights_initializer
(default:xavier_uniform
): Initializer for the weights matrix. Options:constant
,dirac
,eye
,identity
,kaiming_normal
,kaiming_uniform
,normal
,ones
,orthogonal
,sparse
,uniform
,xavier_normal
,xavier_uniform
,zeros
.num_fc_layers
(default:1
): The number of stacked fully connected layers.-
fc_layers
(default:null
): A list of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. -
num_channels
(default:null
): Number of channels to use in the encoder. conv_use_bias
(default:true
): If bias not already specified in conv_layers, specifies if the 2D convolutional kernel should have a bias term. Options:true
,false
.
MLP-Mixer Encoder¶
Encodes images using MLP-Mixer, as described in MLP-Mixer: An all-MLP Architecture for Vision. MLP-Mixer divides the image into equal-sized patches, applying fully connected layers to each patch to compute per-patch representations (tokens) and combining the representations with fully-connected mixer layers.
The MLP-Mixer Encoder takes the following optional parameters:
encoder:
type: mlp_mixer
dropout: 0.0
num_layers: 8
patch_size: 16
num_channels: null
embed_size: 512
token_size: 2048
channel_dim: 256
avg_pool: true
Parameters:
dropout
(default:0.0
) : Dropout rate.num_layers
(default:8
) : The depth of the network (the number of Mixer blocks).-
patch_size
(default:16
): The image patch size. Each patch is patch_size² pixels. Must evenly divide the image width and height. -
num_channels
(default:null
): Number of channels to use in the encoder. embed_size
(default:512
): The patch embedding size, the output size of the mixer if avg_pool is true.token_size
(default:2048
): The per-patch embedding size.channel_dim
(default:256
): Number of channels in hidden layer.avg_pool
(default:true
): If true, pools output over patch dimension, outputs a vector of shape (embed_size). If false, the output tensor is of shape (n_patches, embed_size), where n_patches is img_height x img_width / patch_size². Options:true
,false
.
TorchVision Pretrained Model Encoders¶
Twenty TorchVision pretrained image classification models are available as Ludwig image encoders. The available models are:
AlexNet
ConvNeXt
DenseNet
EfficientNet
EfficientNetV2
GoogLeNet
Inception V3
MaxVit
MNASNet
MobileNet V2
MobileNet V3
RegNet
ResNet
ResNeXt
ShuffleNet V2
SqueezeNet
SwinTransformer
VGG
VisionTransformer
Wide ResNet
See TorchVison documentation for more details.
Ludwig encoders parameters for TorchVision pretrained models:
AlexNet¶
encoder:
type: alexnet
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: base
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:base
): Pretrained model variant to use. Options:base
.
ConvNeXt¶
encoder:
type: convnext
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: base
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:base
): Pretrained model variant to use. Options:tiny
,small
,base
,large
.
DenseNet¶
encoder:
type: densenet
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: 121
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:121
): Pretrained model variant to use. Options:121
,161
,169
,201
.
EfficientNet¶
encoder:
type: efficientnet
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: b0
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:b0
): Pretrained model variant to use. Options:b0
,b1
,b2
,b3
,b4
,b5
,b6
,b7
,v2_s
,v2_m
,v2_l
.
GoogLeNet¶
encoder:
type: googlenet
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: base
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:base
): Pretrained model variant to use. Options:base
.
Inception V3¶
encoder:
type: inceptionv3
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: base
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:base
): Pretrained model variant to use. Options:base
.
MaxVit¶
encoder:
type: maxvit
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: t
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:t
): Pretrained model variant to use. Options:t
.
MNASNet¶
encoder:
type: mnasnet
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: '0_5'
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:0_5
): Pretrained model variant to use. Options:0_5
,0_75
,1_0
,1_3
.
MobileNet V2¶
encoder:
type: mobilenetv2
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: base
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:base
): Pretrained model variant to use. Options:base
.
MobileNet V3¶
encoder:
type: mobilenetv3
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: small
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:small
): Pretrained model variant to use. Options:small
,large
.
RegNet¶
encoder:
type: regnet
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: x_1_6gf
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:x_1_6gf
): Pretrained model variant to use. Options:x_1_6gf
,x_16gf
,x_32gf
,x_3_2gf
,x_400mf
,x_800mf
,x_8gf
,y_128gf
,y_16gf
,y_1_6gf
,y_32gf
,y_3_2gf
,y_400mf
,y_800mf
,y_8gf
.
ResNet¶
encoder:
type: resnet
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: 50
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:50
): Pretrained model variant to use. Options:18
,34
,50
,101
,152
.
ResNeXt¶
encoder:
type: resnext
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: 50_32x4d
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:50_32x4d
): Pretrained model variant to use. Options:50_32x4d
,101_32x8d
,101_64x4d
.
ShuffleNet V2¶
encoder:
type: shufflenet_v2
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: x0_5
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:x0_5
): Pretrained model variant to use. Options:x0_5
,x1_0
,x1_5
,x2_0
.
SqueezeNet¶
encoder:
type: squeezenet
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: '1_0'
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:1_0
): Pretrained model variant to use. Options:1_0
,1_1
.
SwinTransformer¶
encoder:
type: swin_transformer
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: t
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:t
): Pretrained model variant to use. Options:t
,s
,b
.
VGG¶
encoder:
type: vgg
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: 11
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:11
): Pretrained model variant to use.
VisionTransformer¶
encoder:
type: vit
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: b_16
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:b_16
): Pretrained model variant to use. Options:b_16
,b_32
,l_16
,l_32
,h_14
.
Wide ResNet¶
encoder:
type: wide_resnet
use_pretrained: true
trainable: true
model_cache_dir: null
model_variant: '50_2'
Parameters:
use_pretrained
(default:true
) : Download model weights from pre-trained model. Options:true
,false
.-
trainable
(default:true
) : Is the encoder trainable. Options:true
,false
. -
model_cache_dir
(default:null
): Directory path to cache pretrained model weights. -
model_variant
(default:50_2
): Pretrained model variant to use. Options:50_2
,101_2
.
Note:
- At this time Ludwig supports only the
DEFAULT
pretrained weights, which are the best available weights for a specific model. More details onDEFAULT
weights can be found in this blog post. - Some TorchVision pretrained models consume large amounts of memory. These
model_variant
required more than 12GB of memory: efficientnet_torch
:b7
regnet_torch
:y_128gf
vit_torch
:h_14
U-Net Encoder¶
The U-Net encoder is based on U-Net: Convolutional Networks for Biomedical Image Segmentation. The encoder implements the contracting downsampling path of the U-Net stack.
U-Net Encoder takes the following optional parameters:
encoder:
type: unet
conv_norm: batch
Parameters:
conv_norm
(default:batch
): This is the default norm that will be used for each double conv layer.It can be null or batch. Options:batch
,null
.
Deprecated Encoders (planned to remove in v0.8)¶
Legacy ResNet Encoder¶
DEPRECATED: This encoder is deprecated and will be removed in a future release. Please use the equivalent TorchVision ResNet encoder instead.
Implements ResNet V2 as described in Identity Mappings in Deep Residual Networks.
The ResNet encoder takes the following optional parameters:
encoder:
type: _resnet_legacy
dropout: 0.0
output_size: 128
activation: relu
norm: null
first_pool_kernel_size: null
first_pool_stride: null
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
norm_params: null
num_fc_layers: 1
fc_layers: null
resnet_size: 50
num_channels: null
out_channels: 32
kernel_size: 3
conv_stride: 1
batch_norm_momentum: 0.9
batch_norm_epsilon: 0.001
Parameters:
dropout
(default:0.0
) : Dropout rateoutput_size
(default:128
) : if output_size is not already specified in fc_layers this is the default output_size that will be used for each layer. It indicates the size of the output of a fully connected layer.activation
(default:relu
): if an activation is not already specified in fc_layers this is the default activation that will be used for each layer. It indicates the activation function applied to the output. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.norm
(default:null
): if a norm is not already specified in fc_layers this is the default norm that will be used for each layer. It indicates the norm of the output and can be null, batch or layer. Options:batch
,layer
,null
.first_pool_kernel_size
(default:null
): Pool size to be used for the first pooling layer. If none, the first pooling layer is skipped.first_pool_stride
(default:null
): Stride for first pooling layer. If null, defaults to first_pool_kernel_size.use_bias
(default:true
): Whether the layer uses a bias vector. Options:true
,false
.bias_initializer
(default:zeros
): initializer for the bias vector. Options:constant
,dirac
,eye
,identity
,kaiming_normal
,kaiming_uniform
,normal
,ones
,orthogonal
,sparse
,uniform
,xavier_normal
,xavier_uniform
,zeros
.weights_initializer
(default:xavier_uniform
): Initializer for the weights matrix. Options:constant
,dirac
,eye
,identity
,kaiming_normal
,kaiming_uniform
,normal
,ones
,orthogonal
,sparse
,uniform
,xavier_normal
,xavier_uniform
,zeros
.norm_params
(default:null
): parameters used if norm is either batch or layer. For information on parameters used with batch see Torch's documentation on batch normalization or for layer see Torch's documentation on layer normalization.num_fc_layers
(default:1
): The number of stacked fully connected layers.-
fc_layers
(default:null
): A list of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are: activation, dropout, norm, norm_params, output_size, use_bias, bias_initializer and weights_initializer. If any of those values is missing from the dictionary, the default one specified as a parameter of the encoder will be used instead. -
resnet_size
(default:50
): The size of the ResNet model to use. num_channels
(default:null
): Number of channels to use in the encoder.out_channels
(default:32
): Indicates the number of filters, and by consequence the output channels of the 2d convolution. If out_channels is not already specified in conv_layers this is the default out_channels that will be used for each layer.kernel_size
(default:3
): An integer or pair of integers specifying the kernel size. A single integer specifies a square kernel, while a pair of integers specifies the height and width of the kernel in that order (h, w). If a kernel_size is not specified in conv_layers this kernel_size that will be used for each layer.conv_stride
(default:1
): An integer or pair of integers specifying the stride of the initial convolutional layer.batch_norm_momentum
(default:0.9
): Momentum of the batch norm running statistics.batch_norm_epsilon
(default:0.001
): Epsilon of the batch norm.
Legacy Vision Transformer Encoder¶
DEPRECATED: This encoder is deprecated and will be removed in a future release. Please use the equivalent TorchVision VisionTransformer encoder instead.
Encodes images using a Vision Transformer as described in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
Vision Transformer divides the image into equal-sized patches, uses a linear transformation to encode each flattened patch, then applies a deep transformer architecture to the sequence of encoded patches.
The Vision Transformer Encoder takes the following optional parameters:
encoder:
type: _vit_legacy
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1
trainable: true
use_pretrained: true
pretrained_model: google/vit-base-patch16-224
hidden_size: 768
hidden_act: gelu
patch_size: 16
initializer_range: 0.02
num_hidden_layers: 12
num_attention_heads: 12
intermediate_size: 3072
layer_norm_eps: 1.0e-12
gradient_checkpointing: false
Parameters:
hidden_dropout_prob
(default:0.1
) : The dropout rate for all fully connected layers in the embeddings, encoder, and pooling.attention_probs_dropout_prob
(default:0.1
) : The dropout rate for the attention probabilities.trainable
(default:true
) : Is the encoder trainable. Options:true
,false
.use_pretrained
(default:true
) : Use pre-trained model weights from Hugging Face. Options:true
,false
.pretrained_model
(default:google/vit-base-patch16-224
) : The name of the pre-trained model to use.hidden_size
(default:768
): Dimensionality of the encoder layers and the pooling layer.hidden_act
(default:gelu
): Hidden layer activation, one of gelu, relu, selu or gelu_new. Options:relu
,gelu
,selu
,gelu_new
.patch_size
(default:16
): The image patch size. Each patch is patch_size² pixels. Must evenly divide the image width and height.-
initializer_range
(default:0.02
): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. -
num_hidden_layers
(default:12
): Number of hidden layers in the Transformer encoder. num_attention_heads
(default:12
): Number of attention heads in each attention layer.intermediate_size
(default:3072
): Dimensionality of the intermediate (i.e., feed-forward) layer in the Transformer encoder.layer_norm_eps
(default:1e-12
): The epsilon used by the layer normalization layers.gradient_checkpointing
(default:false
): Options:true
,false
.
Image Augmentation¶
Image augmentation is a technique used to increase the diversity of a training dataset by applying random transformations to the images. The goal is to train a model that is robust to the variations in the training data.
Augmentation is specified by the augmentation
section in the image feature configuration and can be specified in one of the following ways:
Boolean: False
(Default) No augmentation is applied to the images.
augmentation: False
Boolean: True
The following augmentation methods are applied to the images: random_horizontal_flip
and random_rotate
.
augmentation: True
List of Augmentation Methods One or more of the following augmentation methods are applied to the images in the order specified by the user: random_horizontal_flip
, random_vertical_flip
, random_rotate
, random_blur
, random_brightness
, and random_contrast
. The following is an illustrative example.
augmentation:
- type: random_horizontal_flip
- type: random_vertical_flip
- type: random_rotate
degree: 10
- type: random_blur
kernel_size: 3
- type: random_brightness
min: 0.5
max: 2.0
- type: random_contrast
min: 0.5
max: 2.0
Augmentation is applied to the batch of images in the training set only. The validation and test sets are not augmented.
Following illustrates how augmentation affects an image:
Horizontal Flip: Image is randomly flipped horizontally.
type: random_horizontal_flip
Vertical Flip: Image is randomly flipped vertically.
type: random_vertical_flip
Rotate: Image is randomly rotated by an amount in the range [-degree, +degree]. degree
must be a positive integer.
type: random_rotate
degree: 15
Parameters:
degree
(default:15
): Range of angle for random rotation, i.e., [-degree, +degree].
Following shows the effect of rotating an image:
Blur: Image is randomly blurred using a Gaussian filter with kernel size specified by the user. The kernel_size
must be a positive, odd integer.
type: random_blur
kernel_size: 3
Parameters:
kernel_size
(default:3
): Kernel size for random blur.
Following shows the effect of blurring an image with various kernel sizes:
Adjust Brightness: Image brightness is adjusted by a factor randomly selected in the range [min, max]. Both min
and max
must be a float greater than 0, with min
less than max
.
type: random_brightness
min: 0.5
max: 2.0
Parameters:
min
(default:0.5
): Minimum factor for random brightness.max
(default:2.0
): Maximum factor for random brightness.
Following shows the effect of brightness adjustment with various factors:
Adjust Contrast: Image contrast is adjusted by a factor randomly selected in the range [min, max]. Both min
and max
must be a float greater than 0, with min
less than max
.
type: random_contrast
min: 0.5
max: 2.0
Parameters:
min
(default:0.5
): Minimum factor for random brightness.max
(default:2.0
): Maximum factor for random brightness.
Following shows the effect of contrast adjustment with various factors:
Illustrative Examples of Image Feature Configuration with Augmentation
name: image_column_name
type: image
encoder:
type: resnet
model_variant: 18
use_pretrained: true
pretrained_cache_dir: None
trainable: true
augmentation: false
name: image_column_name
type: image
encoder:
type: stacked_cnn
augmentation: true
name: image_column_name
type: image
encoder:
type: alexnet
augmentation:
- type: random_horizontal_flip
- type: random_rotate
degree: 10
- type: random_blur
kernel_size: 3
- type: random_brightness
min: 0.5
max: 2.0
- type: random_contrast
min: 0.5
max: 2.0
- type: random_vertical_flip
Output Features¶
Image features can be used when semantic segmentation needs to be performed.
There is only one decoder available for image features: unet
.
Example image output feature using default parameters:
name: image_column_name
type: image
reduce_input: sum
dependencies: []
reduce_dependencies: sum
loss:
type: softmax_cross_entropy
decoder:
type: unet
Parameters:
reduce_input
(defaultsum
): defines how to reduce an input that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension).dependencies
(default[]
): the output features this one is dependent on. For a detailed explanation refer to Output Feature Dependencies.reduce_dependencies
(defaultsum
): defines how to reduce the output of a dependent feature that is not a vector, but a matrix or a higher order tensor, on the first dimension (second if you count the batch dimension). Available values are:sum
,mean
oravg
,max
,concat
(concatenates along the first dimension),last
(returns the last vector of the first dimension).loss
(default{type: softmax_cross_entropy}
): is a dictionary containing a losstype
.softmax_cross_entropy
is the only supported loss type for image output features. See Loss for details.decoder
(default:{"type": "unet"}
): Decoder for the desired task. Options:unet
. See Decoder for details.
Decoders¶
U-Net Decoder¶
The U-Net decoder is based on
U-Net: Convolutional Networks for Biomedical Image Segmentation.
The decoder implements the expansive upsampling path of the U-Net stack.
Semantic segmentation supports one input and one output feature. The num_fc_layers
in the decoder
and combiner sections must be set to 0 as U-Net does not have any fully connected layers.
U-Net Decoder takes the following optional parameters:
decoder:
type: unet
num_fc_layers: 0
fc_output_size: 256
fc_norm: null
fc_dropout: 0.0
fc_activation: relu
conv_norm: batch
fc_layers: null
fc_use_bias: true
fc_weights_initializer: xavier_uniform
fc_bias_initializer: zeros
fc_norm_params: null
Parameters:
num_fc_layers
(default:0
) : Number of fully-connected layers iffc_layers
not specified. Increasing layers adds capacity to the model, enabling it to learn more complex feature interactions.fc_output_size
(default:256
) : Output size of fully connected stack.fc_norm
(default:null
) : Default normalization applied at the beginnging of fully connected layers. Options:batch
,layer
,ghost
,null
.fc_dropout
(default:0.0
) : Default dropout rate applied to fully connected layers. Increasing dropout is a common form of regularization to combat overfitting. The dropout is expressed as the probability of an element to be zeroed out (0.0 means no dropout).fc_activation
(default:relu
): Default activation function applied to the output of the fully connected layers. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.conv_norm
(default:batch
): This is the default norm that will be used for each double conv layer.It can be null or batch. Options:batch
,null
.fc_layers
(default:null
): List of dictionaries containing the parameters of all the fully connected layers. The length of the list determines the number of stacked fully connected layers and the content of each dictionary determines the parameters for a specific layer. The available parameters for each layer are:activation
,dropout
,norm
,norm_params
,output_size
,use_bias
,bias_initializer
andweights_initializer
. If any of those values is missing from the dictionary, the default one provided as a standalone parameter will be used instead.fc_use_bias
(default:true
): Whether the layer uses a bias vector in the fc_stack. Options:true
,false
.fc_weights_initializer
(default:xavier_uniform
): The weights initializer to use for the layers in the fc_stackfc_bias_initializer
(default:zeros
): The bias initializer to use for the layers in the fc_stackfc_norm_params
(default:null
): Default parameters passed to thenorm
module.
Decoder type and decoder parameters can also be defined once and applied to all image output features using the Type-Global Decoder section.
Loss¶
Softmax Cross Entropy¶
loss:
type: softmax_cross_entropy
class_weights: null
weight: 1.0
robust_lambda: 0
confidence_penalty: 0
class_similarities: null
class_similarities_temperature: 0
Parameters:
class_weights
(default:null
) : Weights to apply to each class in the loss. If not specified, all classes are weighted equally. The value can be a vector of weights, one for each class, that is multiplied to the loss of the datapoints that have that class as ground truth. It is an alternative to oversampling in case of unbalanced class distribution. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the<UNK>
class needs to be included too). Alternatively, the value can be a dictionary with class strings as keys and weights as values, like{class_a: 0.5, class_b: 0.7, ...}
.weight
(default:1.0
): Weight of the loss.robust_lambda
(default:0
): Replaces the loss with(1 - robust_lambda) * loss + robust_lambda / c
wherec
is the number of classes. Useful in case of noisy labels.confidence_penalty
(default:0
): Penalizes overconfident predictions (low entropy) by adding an additional term that penalizes too confident predictions by adding aa * (max_entropy - entropy) / max_entropy
term to the loss, where a is the value of this parameter. Useful in case of noisy labels.class_similarities
(default:null
): If notnull
it is ac x c
matrix in the form of a list of lists that contains the mutual similarity of classes. It is used ifclass_similarities_temperature
is greater than 0. The ordering of the vector follows the category to integer ID mapping in the JSON metadata file (the<UNK>
class needs to be included too).class_similarities_temperature
(default:0
): The temperature parameter of the softmax that is performed on each row ofclass_similarities
. The output of that softmax is used to determine the supervision vector to provide instead of the one hot vector that would be provided otherwise for each datapoint. The intuition behind it is that errors between similar classes are more tolerable than errors between really different classes.
Loss and loss related parameters can also be defined once and applied to all image output features using the Type-Global Loss section.
Metrics¶
The measures that are calculated every epoch and are available for image features are the accuracy
and loss
.
You can set either of them as validation_metric
in the training
section of the configuration if you set the
validation_field
to be the name of a category feature.