Text Regression - Rating Prediction
This example shows how to predict a continuous numeric score from text using Ludwig — a task known as text regression or rating prediction.
Many real-world problems require mapping a piece of text to a number rather than a discrete category: predicting the star rating of a product review, estimating the reading difficulty of an article, scoring the quality of a generated code snippet, or forecasting the engagement of a social-media post. Ludwig supports this directly by pairing a text input feature with a number output feature.
We'll use the App Reviews dataset, which contains 288,065 user reviews of mobile apps sourced from the Google Play Store. Each review is paired with a star rating between 1 and 5. This makes it an ideal benchmark for learning to predict a numeric quality score from raw text.
Dataset¶
The dataset contains the following columns:
| column | type | description |
|---|---|---|
| review | text | The full text of the user review |
| star | number | Star rating assigned by the reviewer, integer from 1 to 5 |
Sample rows:
| review | star |
|---|---|
| Total trash, waste of time. Do not download! | 1 |
| It's decent. Gets the job done but the UI could use some polish. | 3 |
| Works great and the updates keep making it better. Highly recommend! | 5 |
| Used to be good, now crashes constantly after the last update. | 2 |
| Simple, clean, does exactly what I need. No complaints. | 4 |
Download Dataset¶
Downloads the dataset and writes app_reviews.csv to the current directory.
ludwig datasets download app_reviews
Downloads the App Reviews dataset into a pandas DataFrame.
from ludwig.datasets import app_reviews
# Loads the dataset as a pandas.DataFrame
train_df, test_df, val_df = app_reviews.load()
The loaded DataFrame contains the above columns plus a split column (0 = train, 1 = test, 2 = validation).
Train¶
Define Ludwig Config¶
The Ludwig config declares the machine learning task. For text regression we use a BERT-based encoder (pre-trained on a large text corpus) fine-tuned to minimize mean squared error on the star rating.
Using a pretrained transformer encoder like BERT allows the model to leverage knowledge of language learned from billions of words, dramatically improving accuracy compared to training from scratch — especially relevant when predicting subtle numeric differences between 3-star and 4-star reviews.
With config.yaml:
input_features:
- name: review
type: text
encoder:
type: bert
trainable: true
output_features:
- name: star
type: number
trainer:
epochs: 5
learning_rate: 1.0e-5
batch_size: 32
With config defined in a Python dict:
config = {
"input_features": [
{
"name": "review",
"type": "text",
"encoder": {
"type": "bert", # Use BERT for text encoding
"trainable": True, # Fine-tune BERT weights
}
}
],
"output_features": [
{
"name": "star",
"type": "number", # Predict a continuous number (regression)
}
],
"trainer": {
"epochs": 5,
"learning_rate": 1e-5, # Small LR for fine-tuning pretrained weights
"batch_size": 32,
}
}
Create and Train a Model¶
ludwig train --config config.yaml --dataset "ludwig://app_reviews"
import logging
from ludwig.api import LudwigModel
from ludwig.datasets import app_reviews
train_df, test_df, val_df = app_reviews.load()
# Construct Ludwig model from the config dictionary
model = LudwigModel(config, logging_level=logging.INFO)
# Train the model
results = model.train(
training_set=train_df,
validation_set=val_df,
test_set=test_df,
)
train_stats = results.train_stats
output_directory = results.output_directory
Ludwig fine-tunes the BERT encoder end-to-end with mean squared error loss on the star rating. It automatically handles tokenization, padding, and attention masks.
Evaluate¶
Generates predictions and regression metrics (MAE, MSE, R²) for the held-out test set.
ludwig evaluate \
--model_path results/experiment_run/model \
--dataset "ludwig://app_reviews" \
--split test \
--output_directory test_results
# Generates predictions and regression performance statistics for the test set
test_stats, predictions, output_directory = model.evaluate(
test_df,
collect_predictions=True,
collect_overall_stats=True,
)
For number outputs Ludwig reports: - mean absolute error (MAE) — average absolute difference between predicted and true star ratings - mean squared error (MSE) - R² — coefficient of determination (1.0 is perfect)
Visualize Metrics¶
ludwig visualize \
--visualization learning_curves \
--ground_truth_metadata results/experiment_run/model/training_set_metadata.json \
--training_statistics results/experiment_run/training_statistics.json \
--file_format png \
--output_directory visualizations
from ludwig.visualize import learning_curves
learning_curves(train_stats, output_feature_name="star")
Make Predictions on New Reviews¶
Create reviews_to_rate.csv:
review
The app crashes every time I open it. Terrible.
Really useful for tracking workouts. Love the interface.
It's okay. Nothing special but it works.
ludwig predict \
--model_path results/experiment_run/model \
--dataset reviews_to_rate.csv \
--output_directory predictions
import pandas as pd
reviews_to_rate = pd.DataFrame({
"review": [
"The app crashes every time I open it. Terrible.",
"Really useful for tracking workouts. Love the interface.",
"It's okay. Nothing special but it works.",
]
})
predictions, output_directory = model.predict(reviews_to_rate)
print(predictions[["star_predictions"]])
Prediction outputs include star_predictions (the predicted numeric star rating) in predictions/predictions.parquet.
Tips¶
Loss Function¶
By default Ludwig uses mean squared error for number outputs. You can switch to mean absolute error if you want to be less sensitive to outlier ratings:
output_features:
- name: star
type: number
loss:
type: mean_absolute_error
Clipping Predictions¶
Star ratings are bounded between 1 and 5. Adding a clip constraint prevents the model from predicting values outside this range:
output_features:
- name: star
type: number
clip: [1, 5]
Treating the Task as Classification¶
An alternative approach is to treat the star rating as a 5-class category rather than a continuous number. This often performs better when the labels are integers with limited range and the differences between adjacent classes are meaningful categorical distinctions rather than a continuous gradient.
output_features:
- name: star
type: category
trainer:
epochs: 5
learning_rate: 1.0e-5
batch_size: 32
The trade-off: classification gives you per-class probabilities but loses ordinal information (it treats 1-star and 5-star as equally distant from 3-star as 2-star).
Encoder Options¶
| Encoder | type value |
Notes |
|---|---|---|
| BERT | bert |
Strong general-purpose baseline |
| RoBERTa | auto_transformer with pretrained_model_name_or_path: roberta-base |
Often slightly better than BERT |
| DistilBERT | auto_transformer with pretrained_model_name_or_path: distilbert-base-uncased |
40% faster, ~97% of BERT quality |
| Parallel CNN | parallel_cnn |
Much faster, no pretrained weights, lower accuracy |
To use a different pretrained model:
input_features:
- name: review
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: roberta-base
trainable: true
Hyperparameters to Tune¶
trainer.learning_rate— fine-tuning a pretrained model works best with small learning rates (1e-5to5e-5); training from scratch needs larger rates (1e-3to1e-4)trainer.batch_size— larger batches stabilize fine-tuning; try 32 or 64 with gradient accumulation if GPU memory is limitedencoder.max_sequence_length— default is 256 tokens; reviews longer than this are truncated. Set to 512 for BERT-compatible models at the cost of more memorytrainer.epochswithtrainer.early_stop: 3— stop fine-tuning once validation MAE stops improving
Other Ludwig Datasets for Text Regression¶
| Dataset | Ludwig name | Input | Output | Size |
|---|---|---|---|---|
| Amazon Reviews 2023 | amazon_reviews_2023 |
title + text (review title and body) |
rating (number, 1–5) |
varies by category |
Amazon Reviews 2023¶
The Amazon Reviews 2023 dataset spans multiple product categories (electronics, books, clothing, etc.) and provides both the review title and full review text. Using both as inputs often improves accuracy:
input_features:
- name: title
type: text
encoder:
type: bert
trainable: true
- name: text
type: text
encoder:
type: bert
trainable: true
output_features:
- name: rating
type: number
trainer:
epochs: 3
learning_rate: 1.0e-5
batch_size: 32
ludwig datasets download amazon_reviews_2023