---
title: "Defining a training dataset"
description: "Create a Dataset object that defines a reproducible training slice."
---

# Defining a training dataset

> For the complete documentation index see: [llms.txt](https://aerospike.com/docs/llms.txt)
> 
> All documentation pages available in markdown.

A Dataset is a saved definition of a training slice: which entity type, which features, and which entity instances to include. A Dataset describes _how_ to assemble training data, not the data itself. The actual records are created when you _materialize_ it on the next page.

On the previous page, you queried the data and confirmed that the features separate by label. Now save that as a Dataset so later steps can reproduce it.

## Create and save the Dataset definition

1.  Run `Cell 6` to create and save the Dataset definition.

Cell 6: Create and save Dataset definition

```python
if "require_part2" not in globals():

    raise ValueError("Missing Part 2 bootstrap helper. Re-run Part 2 Cell 1.")

require_part2([

    "DATASET_NAME", "ENTITY_TYPE", "ENTITY_ID_COL",

    "TRAIN_COLUMNS", "TRAINING_PREDICATE", "DATASET_LOCATION", "LABEL_COL"

])

decline_risk_dataset = Dataset(

    name=DATASET_NAME,

    description="Training data for trip decline risk prediction model v1",

    entity=ENTITY_TYPE,

    id_col=ENTITY_ID_COL,

    id_type="string",

    features=TRAIN_COLUMNS,      # Columns to include when materializing

    query=TRAINING_PREDICATE,    # Row filter (which entities to include)

    location=DATASET_LOCATION,

    attrs={

        "model_type": "classification",

        "target_column": LABEL_COL  # Metadata for downstream readers/tools

    },

    tags=["decline-risk", "driver", "training", "v1"]

)

decline_risk_dataset.save()

print(f"Saved dataset definition: {decline_risk_dataset.name}")
```

Two fields control what the Dataset includes:

-   `features`: which columns appear in the materialized DataFrame. `TRAIN_COLUMNS` includes three predictor columns plus `label`, so the materialized data contains both model inputs and the known outcome column.
-   `query`: which rows are included. `TRAINING_PREDICATE` is `driver_id >= 'driver_001' and driver_id <= 'driver_100'`, selecting all 100 synthetic drivers.

`attrs["target_column"]` is descriptive metadata. It documents that `label` is the target column, but the Dataset class does not enforce that.

For a deeper explanation of Dataset semantics, refer back to [Part 1: Feature Engineering](https://aerospike.com/docs/develop/feature-store/step/1/part/0/feature-store-objects/).

The definition is now stored in Aerospike, separate from the actual training data. Anyone can load it with `Dataset.load(DATASET_NAME)` to see exactly what features and filters were used.

Next, you’ll _materialize_ this definition into data you can train on.

::: undefined
-   I can create and save a Dataset definition for trip decline risk.
-   I understand how features and query control what a Dataset includes.
:::

[Previous  
Exploring features](https://aerospike.com/docs/develop/model-training/step/1/part/0/exploring-features) [Next  
Materializing training data](https://aerospike.com/docs/develop/model-training/step/2/part/1/materializing-data)