Defining a training dataset

For the complete documentation index see: llms.txt

All documentation pages available in markdown.

A Dataset is a saved definition of a training slice: which entity type, which features, and which entity instances to include. A Dataset describes how to assemble training data, not the data itself. The actual records are created when you materialize it on the next page.

On the previous page, you queried the data and confirmed that the features separate by label. Now save that as a Dataset so later steps can reproduce it.

Create and save the Dataset definition

Run Cell 6 to create and save the Dataset definition.

Cell 6: Create and save Dataset definition

if "require_part2" not in globals():
    raise ValueError("Missing Part 2 bootstrap helper. Re-run Part 2 Cell 1.")

require_part2([
    "DATASET_NAME", "ENTITY_TYPE", "ENTITY_ID_COL",
    "TRAIN_COLUMNS", "TRAINING_PREDICATE", "DATASET_LOCATION", "LABEL_COL"
])

decline_risk_dataset = Dataset(
    name=DATASET_NAME,
    description="Training data for trip decline risk prediction model v1",
    entity=ENTITY_TYPE,
    id_col=ENTITY_ID_COL,
    id_type="string",
    features=TRAIN_COLUMNS,      # Columns to include when materializing
    query=TRAINING_PREDICATE,    # Row filter (which entities to include)
    location=DATASET_LOCATION,
    attrs={
        "model_type": "classification",
        "target_column": LABEL_COL  # Metadata for downstream readers/tools
    },
    tags=["decline-risk", "driver", "training", "v1"]
)

decline_risk_dataset.save()
print(f"Saved dataset definition: {decline_risk_dataset.name}")

Two fields control what the Dataset includes:

features: which columns appear in the materialized DataFrame. TRAIN_COLUMNS includes three predictor columns plus label, so the materialized data contains both model inputs and the known outcome column.
query: which rows are included. TRAINING_PREDICATE is driver_id >= 'driver_001' and driver_id <= 'driver_100', selecting all 100 synthetic drivers.

attrs["target_column"] is descriptive metadata. It documents that label is the target column, but the Dataset class does not enforce that.

For a deeper explanation of Dataset semantics, refer back to Part 1: Feature Engineering.

The definition is now stored in Aerospike, separate from the actual training data. Anyone can load it with Dataset.load(DATASET_NAME) to see exactly what features and filters were used.

Next, you’ll materialize this definition into data you can train on.