Code Intelligence
This example shows how to apply Ludwig to code intelligence tasks — using source code as text input to train classifiers, defect detectors, and documentation generators.
Code intelligence is a broad family of tasks that treat programs as structured text: detecting bugs before they reach production, generating docstrings from function bodies, searching a codebase by natural language description, predicting whether a code change introduces a vulnerability, or classifying code snippets by programming language. Because code has rich syntactic and semantic structure, pretrained code language models (CodeBERT, GraphCodeBERT, StarCoder) provide a large advantage over general-purpose text encoders.
We'll use the Code Defect Detection dataset, which contains 21,854 C/C++ functions labeled as either clean or defective. Functions were extracted from real open-source projects and the labels come from git history: functions that were later fixed in a security-related commit are labeled defective. Detecting such functions automatically can dramatically reduce the cost of code review.
Dataset¶
| column | type | description |
|---|---|---|
| func | text | Source code of the C/C++ function (one function per row) |
| target | binary | 1 = function contains a defect, 0 = function is clean |
Sample rows:
| func (truncated) | target |
|---|---|
static int foo(char *buf, int len) { if (len > MAX) { memcpy(dst, buf, len); } } |
1 |
int clamp(int val, int lo, int hi) { return val < lo ? lo : val > hi ? hi : val; } |
0 |
void process(uint8_t *data, size_t n) { for (size_t i = 0; i <= n; i++) { handle(data[i]); } } |
1 |
Download Dataset¶
Downloads the dataset and writes code_defect_detection.csv to the current directory.
ludwig datasets download code_defect_detection
Downloads the Code Defect Detection dataset into a pandas DataFrame.
from ludwig.datasets import code_defect_detection
# Loads the dataset as a pandas.DataFrame
train_df, test_df, val_df = code_defect_detection.load()
The loaded DataFrame contains the above columns plus a split column (0 = train, 1 = test, 2 = validation).
Train¶
Define Ludwig Config¶
We use CodeBERT (microsoft/codebert-base), a pretrained transformer model trained on both natural language and programming language pairs. CodeBERT understands code structure and semantics significantly better than general-purpose language models for code-related tasks.
With config.yaml:
input_features:
- name: func
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: microsoft/codebert-base
trainable: true
output_features:
- name: target
type: binary
trainer:
epochs: 10
learning_rate: 2.0e-5
batch_size: 32
With config defined in a Python dict:
config = {
"input_features": [
{
"name": "func",
"type": "text",
"encoder": {
"type": "auto_transformer",
"pretrained_model_name_or_path": "microsoft/codebert-base",
"trainable": True, # Fine-tune CodeBERT weights
}
}
],
"output_features": [
{
"name": "target",
"type": "binary", # Binary classification: defective vs. clean
}
],
"trainer": {
"epochs": 10,
"learning_rate": 2e-5,
"batch_size": 32,
}
}
Create and Train a Model¶
ludwig train --config config.yaml --dataset "ludwig://code_defect_detection"
import logging
from ludwig.api import LudwigModel
from ludwig.datasets import code_defect_detection
train_df, test_df, val_df = code_defect_detection.load()
# Construct Ludwig model from the config dictionary
model = LudwigModel(config, logging_level=logging.INFO)
# Train the model
results = model.train(
training_set=train_df,
validation_set=val_df,
test_set=test_df,
)
train_stats = results.train_stats
output_directory = results.output_directory
Ludwig tokenizes the source code using CodeBERT's byte-pair encoding (which handles identifiers, keywords, and operators correctly), passes token embeddings through the transformer, and trains a binary classifier on top of the [CLS] representation.
Evaluate¶
Generates predictions and binary classification metrics (accuracy, F1, ROC-AUC) for the test set.
ludwig evaluate \
--model_path results/experiment_run/model \
--dataset "ludwig://code_defect_detection" \
--split test \
--output_directory test_results
# Generates predictions and performance statistics for the test set
test_stats, predictions, output_directory = model.evaluate(
test_df,
collect_predictions=True,
collect_overall_stats=True,
)
Note
Defect detection datasets are typically class-imbalanced (clean functions outnumber defective ones). Pay attention to F1 and ROC-AUC rather than raw accuracy. Ludwig reports both.
Visualize Metrics¶
ludwig visualize \
--visualization confusion_matrix \
--ground_truth_metadata results/experiment_run/model/training_set_metadata.json \
--test_statistics test_results/test_statistics.json \
--output_directory visualizations \
--file_format png
from ludwig.visualize import confusion_matrix
confusion_matrix(
[test_stats],
model.training_set_metadata,
"target",
top_n_classes=[2],
model_names=[""],
normalize=True,
)
ludwig visualize \
--visualization learning_curves \
--ground_truth_metadata results/experiment_run/model/training_set_metadata.json \
--training_statistics results/experiment_run/training_statistics.json \
--file_format png \
--output_directory visualizations
from ludwig.visualize import learning_curves
learning_curves(train_stats, output_feature_name="target")
Make Predictions on New Functions¶
Create functions_to_check.csv:
func
"int read_buffer(char *buf, int size) { return fread(buf, 1, size, stdin); }"
"void safe_copy(char *dst, const char *src, size_t n) { strncpy(dst, src, n); dst[n-1] = '\0'; }"
ludwig predict \
--model_path results/experiment_run/model \
--dataset functions_to_check.csv \
--output_directory predictions
import pandas as pd
functions_to_check = pd.DataFrame({
"func": [
"int read_buffer(char *buf, int size) { return fread(buf, 1, size, stdin); }",
"void safe_copy(char *dst, const char *src, size_t n) { strncpy(dst, src, n); dst[n-1] = '\\0'; }",
]
})
predictions, output_directory = model.predict(functions_to_check)
print(predictions[["target_predictions", "target_probability"]])
Outputs include target_predictions (True = defective, False = clean) and target_probability (confidence score) in predictions/predictions.parquet.
Tips¶
Handling Long Functions¶
CodeBERT's maximum sequence length is 512 tokens. Functions longer than this are truncated. Since defects often appear in the middle or tail of a function, truncation can hurt recall. Options:
-
Increase
max_sequence_lengthup to 512 (CodeBERT's hard limit):encoder: type: auto_transformer pretrained_model_name_or_path: microsoft/codebert-base trainable: true max_sequence_length: 512 -
Use a long-context model such as
microsoft/graphcodebert-baseorSalesforce/codet5-basewhich have similar token limits but may handle code structure differently. -
Chunk and aggregate: split long functions into overlapping windows, run the model on each chunk, and aggregate predictions (e.g., take the maximum defect probability across chunks).
Class Imbalance¶
If defective functions are rare in your dataset, adjust the loss weight:
output_features:
- name: target
type: binary
loss:
positive_class_weight: 3.0 # Weight defective examples 3× more
Or use threshold tuning on the validation set to find the probability cutoff that maximizes F1.
Choosing a Code Language Model¶
| Model | HF identifier | Languages | Notes |
|---|---|---|---|
| CodeBERT | microsoft/codebert-base |
6 PL + NL | Best for classification tasks |
| GraphCodeBERT | microsoft/graphcodebert-base |
6 PL + NL | Uses data flow graphs, better for code search |
| CodeT5 | Salesforce/codet5-base |
8 PL | Good for generation tasks (code → docstring) |
| StarCoder | bigcode/starcoderbase |
80+ PL | Largest coverage, requires more GPU memory |
Code Summarization (Code → Docstring)¶
For generating docstrings from Python functions, switch the output feature to text with a generator decoder:
input_features:
- name: func
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: Salesforce/codet5-base
trainable: true
output_features:
- name: docstring
type: text
decoder:
type: generator
max_new_tokens: 64
trainer:
epochs: 5
learning_rate: 2.0e-5
batch_size: 16
Hyperparameters to Tune¶
trainer.learning_rate—1e-5to5e-5for fine-tuning; too large causes instabilitytrainer.batch_size— 32 is a good default; reduce to 16 with gradient accumulation if memory is tighttrainer.epochs— 5–15 epochs; usetrainer.early_stop: 3on validation F1encoder.max_sequence_length— set to the 95th-percentile function length in tokens (measure with CodeBERT tokenizer)output_features[0].loss.positive_class_weight— tune on a held-out set if the dataset is imbalanced
Other Ludwig Datasets for Code Intelligence¶
| Dataset | Ludwig name | Task | Input | Output |
|---|---|---|---|---|
| Code Defect Detection | code_defect_detection |
Bug detection | C/C++ function | binary (defective/clean) |
| Code Contests | code_contests |
Competitive programming | problem description | solution code |
| Codex Thinking | codex_thinking |
Chain-of-thought code reasoning | problem + code | reasoning trace |
Code Contests¶
The Code Contests dataset contains competitive programming problems paired with correct Python/C++/Java solutions. You can use it to train a model that generates code from a problem description:
ludwig datasets download code_contests
input_features:
- name: description
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: Salesforce/codet5-base
trainable: true
max_sequence_length: 512
output_features:
- name: solution
type: text
decoder:
type: generator
max_new_tokens: 512
trainer:
epochs: 3
batch_size: 4
gradient_accumulation_steps: 8
learning_rate: 3.0e-5