---
title: "Materializing training data"
description: "Transform the dataset definition into training and test splits for model training."
---

# Materializing training data

> For the complete documentation index see: [llms.txt](https://aerospike.com/docs/llms.txt)
> 
> All documentation pages available in markdown.

**Materializing** a dataset means transforming the definition (which features, which entities, which filters) into actual data you can train on.

The Dataset object you created describes _what_ to include. Materialization executes that specification against the current entity data in Aerospike. This page reuses shared constants from the Part 2 bootstrap cell.

::: data model context
Spark expects a tabular DataFrame (rows and columns), so this page uses that view for training. Aerospike itself is a schemaless record model: data lives in namespaces and sets, where each key identifies a record made of bins (name/value pairs). Sets and bins do not require a fixed relational schema, and records can vary in which bins they contain. Materialization maps Aerospike records and bins into Spark rows and columns so you can use Spark ML.
:::

## Load and materialize the dataset

Load the saved Dataset definition and materialize it in one cell.

The `materialize()` method:

1.  Reads records from the `driver-features` set in Aerospike (based on `ds.entity`)
2.  Applies the filter query (`ds.query`)
3.  Projects bins into DataFrame columns from `ds.features` plus the ID column
4.  Writes the materialized dataset to Parquet when `ds.location` is set
5.  Returns a DataFrame ready for training

1.  Run `Cell 7` to load and materialize training data.

Cell 7: Load and materialize training data

```python
if "require_part2" not in globals():

    raise ValueError("Missing Part 2 bootstrap helper. Re-run Part 2 Cell 1.")

require_part2(["DATASET_NAME", "SCHEMA", "FEATURE_COLUMNS", "ENTITY_ID_COL", "LABEL_COL"])

ds = Dataset.load(DATASET_NAME)

if ds is None:

    raise ValueError("Dataset definition not found. Run the previous page first.")

print(f"Name: {ds.name}")

print(f"Entity: {ds.entity}")

print(f"Features: {ds.features}")

data_df = ds.materialize(SCHEMA)

print(f"Materialized {data_df.count()} records")

print(f"Columns: {data_df.columns}")

data_df.select(ENTITY_ID_COL, *FEATURE_COLUMNS, LABEL_COL).show(5)
```

Expected output

```plaintext
Name: trip-decline-risk-v1

Entity: driver

Features: ['ds_decl_rate', 'ds_avg_rating', 'da_trips_today', 'label']

Materialized 100 records

Columns: ['driver_id', 'ds_decl_rate', 'ds_avg_rating', 'da_trips_today', 'label']

+----------+------------+-------------+---------------+-----+

|driver_id |ds_decl_rate|ds_avg_rating|da_trips_today|label|

+----------+------------+-------------+---------------+-----+

|driver_001|       0.037|         4.78|              9|    0|

|driver_002|       0.182|         4.21|              3|    1|

|driver_003|       0.052|         4.65|             11|    0|

|driver_004|       0.028|         4.89|              7|    0|

|driver_005|       0.145|         4.15|              5|    1|

+----------+------------+-------------+---------------+-----+
```

## Create a feature vector

Before you can train a model with Spark ML, you need to put the input values into one column (usually named `features`). Each row’s `features` value is a single **vector**, an ordered list of numbers like `[decline_rate, avg_rating, trips_today]`. The `VectorAssembler` utility builds that `features` column for you.

1.  Run `Cell 8` to create feature vectors with `VectorAssembler` (no output is expected).

Cell 8: Create feature vectors with VectorAssembler

```python
from pyspark.ml.feature import VectorAssembler

# Create the assembler

# inputCols: the columns to combine

# outputCol: the name of the new vector column

assembler = VectorAssembler(

    inputCols=FEATURE_COLUMNS,

    outputCol="features"

)

# Transform the DataFrame to add the features column

vectorized_df = assembler.transform(data_df)
```

1.  Run `Cell 9` to preview assembled feature vectors.

Cell 9: Preview assembled feature vectors

```python
# View the vectorized data

# The 'features' column contains all three predictors as a single vector

vectorized_df.select(ENTITY_ID_COL, 'features', LABEL_COL).show(5, truncate=False)
```

Vectorized data

```plaintext
+----------+-------------------+-----+

|driver_id |features           |label|

+----------+-------------------+-----+

|driver_001|[0.037,4.78,9.0]   |0    |

|driver_002|[0.182,4.21,3.0]   |1    |

|driver_003|[0.052,4.65,11.0]  |0    |

|driver_004|[0.028,4.89,7.0]   |0    |

|driver_005|[0.145,4.15,5.0]   |1    |

+----------+-------------------+-----+
```

Each `features` vector contains three values in order: `decline_rate`, `avg_rating`, `trips_today`. Spark MLlib algorithms use this vector for training.

## Split into training and test sets

To evaluate your model fairly, split the data into two parts:

-   **Training set**: Data the model learns from
-   **Test set**: Data held out to evaluate performance on unseen examples

A common split is 80% training, 20% test:

1.  Run `Cell 10` to split into train and test sets and print label distribution.

Cell 10: Split into train/test and print label distribution

```python
# Split into training (80%) and test (20%) sets

# The seed parameter ensures reproducible splits

train_df, test_df = vectorized_df.randomSplit([0.8, 0.2], seed=42)

# Print split sizes and class distribution

print(f"Training set: {train_df.count()} records")

print(f"Test set: {test_df.count()} records")

print("\nTraining set class distribution:")

train_df.groupBy(LABEL_COL).count().orderBy(LABEL_COL).show()

print("Test set class distribution:")

test_df.groupBy(LABEL_COL).count().orderBy(LABEL_COL).show()
```

Split results

```plaintext
Training set: 82 records

Test set: 18 records

Training set class distribution:

+-----+-----+

|label|count|

+-----+-----+

|    0|   67|

|    1|   15|

+-----+-----+

Test set class distribution:

+-----+-----+

|label|count|

+-----+-----+

|    0|   15|

|    1|    3|

+-----+-----+
```

Exact split counts can vary slightly by Spark version and partitioning, but the totals should stay near an 80/20 split with a similar class mix.

Both sets maintain approximately the 82/18 class ratio from the original data. If one set had no higher-risk examples (label=1), evaluation would be meaningless.

The `seed=42` parameter makes the split reproducible. Running the same code again produces the same split.

The data is now ready for model training.

::: undefined
-   I can materialize a Dataset definition into a Spark DataFrame.
-   I have created training and test splits ready for model training.
:::

[Previous  
Defining a training dataset](https://aerospike.com/docs/develop/model-training/step/2/part/0/defining-dataset) [Next  
Training a classification model](https://aerospike.com/docs/develop/model-training/step/3/part/0/training-models)