Timeseries forecasting (Weather)

This example illustrates univariate timeseries forecasting using historical temperature data for Los Angeles.

Dowload and unpack historical hourly weather data available on Kaggle https://www.kaggle.com/selfishgene/historical-hourly-weather-data

Run the following python script to prepare the training dataset:

import pandas as pd
from ludwig.utils.data_utils import add_sequence_feature_column

df = pd.read_csv(
    '<PATH_TO_FILE>/temperature.csv',
    usecols=['Los Angeles']
).rename(
    columns={"Los Angeles": "temperature"}
).fillna(method='backfill').fillna(method='ffill')

# normalize
df.temperature = ((df.temperature-df.temperature.mean()) /
                  df.temperature.std())

train_size = int(0.6 * len(df))
vali_size = int(0.2 * len(df))

# train, validation, test split
df['split'] = 0
df.loc[
    (
        (df.index.values >= train_size) &
        (df.index.values < train_size + vali_size)
    ),
    ('split')
] = 1
df.loc[
    df.index.values >= train_size + vali_size,
    ('split')
] = 2

# prepare timeseries input feature colum
# (here we are using 20 preceding values to predict the target)
add_sequence_feature_column(df, 'temperature', 20)
df.to_csv('<PATH_TO_FILE>/temperature_la.csv')
ludwig experiment \
--dataset <PATH_TO_FILE>/temperature_la.csv \
  --config_file config.yaml

With config.yaml:

input_features:
    -
        name: temperature_feature
        type: timeseries
        encoder: rnn
        embedding_size: 32
        state_size: 32

output_features:
    -
        name: temperature
        type: numerical