⇅ Bag Features
Preprocessing¶
Bag features are expected to be provided as a string of elements separated by whitespace, e.g. "elem5 elem0 elem5 elem1". Bags are similar to set features, the only difference being that elements may appear multiple times. The bag feature encoder outputs a matrix, similar to a set encoder, except each element of the matrix is a float value representing the frequency of the respective element in the bag. Embeddings are aggregated by summation, weighted by the frequency of each element.
preprocessing:
tokenizer: space
missing_value_strategy: fill_with_const
fill_value: <UNK>
lowercase: false
most_common: 10000
Parameters:
tokenizer
(default:space
) : Defines how to transform the raw text content of the dataset column to a set of elements. The default value space splits the string on spaces. Common options include: underscore (splits on underscore), comma (splits on comma), json (decodes the string into a set or a list through a JSON parser). Options:space
,space_punct
,ngram
,characters
,underscore
,comma
,untokenized
,stripped
,english_tokenize
,english_tokenize_filter
,english_tokenize_remove_stopwords
,english_lemmatize
,english_lemmatize_filter
,english_lemmatize_remove_stopwords
,italian_tokenize
,italian_tokenize_filter
,italian_tokenize_remove_stopwords
,italian_lemmatize
,italian_lemmatize_filter
,italian_lemmatize_remove_stopwords
,spanish_tokenize
,spanish_tokenize_filter
,spanish_tokenize_remove_stopwords
,spanish_lemmatize
,spanish_lemmatize_filter
,spanish_lemmatize_remove_stopwords
,german_tokenize
,german_tokenize_filter
,german_tokenize_remove_stopwords
,german_lemmatize
,german_lemmatize_filter
,german_lemmatize_remove_stopwords
,french_tokenize
,french_tokenize_filter
,french_tokenize_remove_stopwords
,french_lemmatize
,french_lemmatize_filter
,french_lemmatize_remove_stopwords
,portuguese_tokenize
,portuguese_tokenize_filter
,portuguese_tokenize_remove_stopwords
,portuguese_lemmatize
,portuguese_lemmatize_filter
,portuguese_lemmatize_remove_stopwords
,dutch_tokenize
,dutch_tokenize_filter
,dutch_tokenize_remove_stopwords
,dutch_lemmatize
,dutch_lemmatize_filter
,dutch_lemmatize_remove_stopwords
,greek_tokenize
,greek_tokenize_filter
,greek_tokenize_remove_stopwords
,greek_lemmatize
,greek_lemmatize_filter
,greek_lemmatize_remove_stopwords
,norwegian_tokenize
,norwegian_tokenize_filter
,norwegian_tokenize_remove_stopwords
,norwegian_lemmatize
,norwegian_lemmatize_filter
,norwegian_lemmatize_remove_stopwords
,lithuanian_tokenize
,lithuanian_tokenize_filter
,lithuanian_tokenize_remove_stopwords
,lithuanian_lemmatize
,lithuanian_lemmatize_filter
,lithuanian_lemmatize_remove_stopwords
,danish_tokenize
,danish_tokenize_filter
,danish_tokenize_remove_stopwords
,danish_lemmatize
,danish_lemmatize_filter
,danish_lemmatize_remove_stopwords
,polish_tokenize
,polish_tokenize_filter
,polish_tokenize_remove_stopwords
,polish_lemmatize
,polish_lemmatize_filter
,polish_lemmatize_remove_stopwords
,romanian_tokenize
,romanian_tokenize_filter
,romanian_tokenize_remove_stopwords
,romanian_lemmatize
,romanian_lemmatize_filter
,romanian_lemmatize_remove_stopwords
,japanese_tokenize
,japanese_tokenize_filter
,japanese_tokenize_remove_stopwords
,japanese_lemmatize
,japanese_lemmatize_filter
,japanese_lemmatize_remove_stopwords
,chinese_tokenize
,chinese_tokenize_filter
,chinese_tokenize_remove_stopwords
,chinese_lemmatize
,chinese_lemmatize_filter
,chinese_lemmatize_remove_stopwords
,multi_tokenize
,multi_tokenize_filter
,multi_tokenize_remove_stopwords
,multi_lemmatize
,multi_lemmatize_filter
,multi_lemmatize_remove_stopwords
,sentencepiece
,clip
,gpt2bpe
,bert
,hf_tokenizer
.missing_value_strategy
(default:fill_with_const
) : What strategy to follow when there's a missing value in a set column Options:fill_with_const
,fill_with_mode
,bfill
,ffill
,drop_row
. See Missing Value Strategy for details.fill_value
(default:<UNK>
): The value to replace missing values with in case the missing_value_strategy is fill_with_constlowercase
(default:false
): If true, converts the string to lowercase before tokenizing. Options:true
,false
.most_common
(default:10000
): The maximum number of most common tokens to be considered. If the data contains more than this amount, the most infrequent tokens will be treated as unknown.
Input Features¶
Bag features have only one encoder type available: embed
.
The encoder parameters specified at the feature level are:
tied
(defaultnull
): name of another input feature to tie the weights of the encoder with. It needs to be the name of a feature of the same type and with the same encoder parameters.
Example bag feature entry in the input features list:
name: bag_column_name
type: bag
tied: null
encoder:
type: embed
Encoder type and encoder parameters can also be defined once and applied to all bag input features using the Type-Global Encoder section.
Encoders¶
Embed Weighted Encoder¶
graph LR
A["0.0\n1.0\n1.0\n0.0\n0.0\n2.0\n0.0"] --> B["0\n1\n5"];
B --> C["emb 0\nemb 1\nemb 5"];
C --> D["Weighted\n Sum\n Operation"];
The embed weighted encoder first transforms the element frequency vector to sparse integer lists, which are then mapped
to either dense or sparse embeddings (one-hot encodings). Lastly, embeddings are aggregated as a weighted sum where each
embedding is multiplied by its respective element's frequency.
Inputs are of size b
while outputs are of size b x h
where b
is the batch size and h
is the dimensionality of
the embeddings.
The parameters are the same used for set input features except for
reduce_output
which should not be used because the weighted sum already acts as a reducer.
encoder:
type: embed
dropout: 0.0
embedding_size: 50
output_size: 10
activation: relu
norm: null
representation: dense
force_embedding_size: false
embeddings_on_cpu: false
embeddings_trainable: true
use_bias: true
bias_initializer: zeros
weights_initializer: xavier_uniform
norm_params: null
num_fc_layers: 0
fc_layers: null
pretrained_embeddings: null
Parameters:
dropout
(default:0.0
) : Dropout probability for the embedding.embedding_size
(default:50
) : The maximum embedding size, the actual size will be min(vocabulary_size, embedding_size) for dense representations and exactly vocabulary_size for the sparse encoding, where vocabulary_size is the number of different strings appearing in the training set in the input column (plus 1 for the unknown token placeholder). output_size
(default:10
) : If output_size is not already specified in fc_layers this is the default output_size that will be used for each layer. It indicates the size of the output of a fully connected layer.activation
(default:relu
): The default activation function that will be used for each layer. Options:elu
,leakyRelu
,logSigmoid
,relu
,sigmoid
,tanh
,softmax
,null
.norm
(default:null
): The default norm that will be used for each layer. Options:batch
,layer
,null
. See Normalization for details.representation
(default:dense
): The representation of the embedding. Either dense or sparse. Options:dense
,sparse
.force_embedding_size
(default:false
): Force the embedding size to be equal to the vocabulary size. This parameter has effect only if representation is dense. Options:true
,false
.embeddings_on_cpu
(default:false
): By default embedding matrices are stored on GPU memory if a GPU is used, as it allows for faster access, but in some cases the embedding matrix may be too large. This parameter forces the placement of the embedding matrix in regular memory and the CPU is used for embedding lookup, slightly slowing down the process as a result of data transfer between CPU and GPU memory. Options:true
,false
.embeddings_trainable
(default:true
): If true embeddings are trained during the training process, if false embeddings are fixed. It may be useful when loading pretrained embeddings for avoiding fine tuning them. This parameter has effect only when representation is dense as sparse one-hot encodings are not trainable. Options:true
,false
.use_bias
(default:true
): Whether the layer uses a bias vector. Options:true
,false
.bias_initializer
(default:zeros
): Initializer to use for the bias vector. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
.weights_initializer
(default:xavier_uniform
): Initializer to use for the weights matrix. Options:uniform
,normal
,constant
,ones
,zeros
,eye
,dirac
,xavier_uniform
,xavier_normal
,kaiming_uniform
,kaiming_normal
,orthogonal
,sparse
,identity
.norm_params
(default:null
): Parameters used if norm is eitherbatch
orlayer
.num_fc_layers
(default:0
): This is the number of stacked fully connected layers that the input to the feature passes through. Their output is projected in the feature's output space.-
fc_layers
(default:null
): List of dictionaries containing the parameters for each fully connected layer. -
pretrained_embeddings
(default:null
): By default dense embeddings are initialized randomly, but this parameter allows to specify a path to a file containing embeddings in the GloVe format. When the file containing the embeddings is loaded, only the embeddings with labels present in the vocabulary are kept, the others are discarded. If the vocabulary contains strings that have no match in the embeddings file, their embeddings are initialized with the average of all other embedding plus some random noise to make them different from each other. This parameter has effect only if representation is dense.
Output Features¶
Bag types are not supported as output features at this time.