Materializing training data

For the complete documentation index see: llms.txt

All documentation pages available in markdown.

Materializing a dataset means transforming the definition (which features, which entities, which filters) into actual data you can train on.

The Dataset object you created describes what to include. Materialization executes that specification against the current entity data in Aerospike. This page reuses shared constants from the Part 2 bootstrap cell.

Load and materialize the dataset

Load the saved Dataset definition and materialize it in one cell.

The materialize() method:

Reads records from the driver-features set in Aerospike (based on ds.entity)
Applies the filter query (ds.query)
Projects bins into DataFrame columns from ds.features plus the ID column
Writes the materialized dataset to Parquet when ds.location is set
Returns a DataFrame ready for training

Run Cell 7 to load and materialize training data.

Cell 7: Load and materialize training data

if "require_part2" not in globals():
    raise ValueError("Missing Part 2 bootstrap helper. Re-run Part 2 Cell 1.")

require_part2(["DATASET_NAME", "SCHEMA", "FEATURE_COLUMNS", "ENTITY_ID_COL", "LABEL_COL"])

ds = Dataset.load(DATASET_NAME)
if ds is None:
    raise ValueError("Dataset definition not found. Run the previous page first.")

print(f"Name: {ds.name}")
print(f"Entity: {ds.entity}")
print(f"Features: {ds.features}")
data_df = ds.materialize(SCHEMA)

print(f"Materialized {data_df.count()} records")
print(f"Columns: {data_df.columns}")
data_df.select(ENTITY_ID_COL, *FEATURE_COLUMNS, LABEL_COL).show(5)

Expected output

Name: trip-decline-risk-v1
Entity: driver
Features: ['ds_decl_rate', 'ds_avg_rating', 'da_trips_today', 'label']
Materialized 100 records
Columns: ['driver_id', 'ds_decl_rate', 'ds_avg_rating', 'da_trips_today', 'label']
+----------+------------+-------------+---------------+-----+
|driver_id |ds_decl_rate|ds_avg_rating|da_trips_today|label|
+----------+------------+-------------+---------------+-----+
|driver_001|       0.037|         4.78|              9|    0|
|driver_002|       0.182|         4.21|              3|    1|
|driver_003|       0.052|         4.65|             11|    0|
|driver_004|       0.028|         4.89|              7|    0|
|driver_005|       0.145|         4.15|              5|    1|
+----------+------------+-------------+---------------+-----+

Create a feature vector

Before you can train a model with Spark ML, you need to put the input values into one column (usually named features). Each row’s features value is a single vector, an ordered list of numbers like [decline_rate, avg_rating, trips_today]. The VectorAssembler utility builds that features column for you.

Run Cell 8 to create feature vectors with VectorAssembler (no output is expected).

Cell 8: Create feature vectors with VectorAssembler

from pyspark.ml.feature import VectorAssembler

# Create the assembler
# inputCols: the columns to combine
# outputCol: the name of the new vector column
assembler = VectorAssembler(
    inputCols=FEATURE_COLUMNS,
    outputCol="features"
)

# Transform the DataFrame to add the features column
vectorized_df = assembler.transform(data_df)

Run Cell 9 to preview assembled feature vectors.

Cell 9: Preview assembled feature vectors

# View the vectorized data
# The 'features' column contains all three predictors as a single vector
vectorized_df.select(ENTITY_ID_COL, 'features', LABEL_COL).show(5, truncate=False)

Vectorized data

+----------+-------------------+-----+
|driver_id |features           |label|
+----------+-------------------+-----+
|driver_001|[0.037,4.78,9.0]   |0    |
|driver_002|[0.182,4.21,3.0]   |1    |
|driver_003|[0.052,4.65,11.0]  |0    |
|driver_004|[0.028,4.89,7.0]   |0    |
|driver_005|[0.145,4.15,5.0]   |1    |
+----------+-------------------+-----+

Each features vector contains three values in order: decline_rate, avg_rating, trips_today. Spark MLlib algorithms use this vector for training.

Split into training and test sets

To evaluate your model fairly, split the data into two parts:

Training set: Data the model learns from
Test set: Data held out to evaluate performance on unseen examples

A common split is 80% training, 20% test:

Run Cell 10 to split into train and test sets and print label distribution.

Cell 10: Split into train/test and print label distribution

# Split into training (80%) and test (20%) sets
# The seed parameter ensures reproducible splits
train_df, test_df = vectorized_df.randomSplit([0.8, 0.2], seed=42)

# Print split sizes and class distribution
print(f"Training set: {train_df.count()} records")
print(f"Test set: {test_df.count()} records")

print("\nTraining set class distribution:")
train_df.groupBy(LABEL_COL).count().orderBy(LABEL_COL).show()

print("Test set class distribution:")
test_df.groupBy(LABEL_COL).count().orderBy(LABEL_COL).show()

Split results

Training set: 82 records
Test set: 18 records

Training set class distribution:
+-----+-----+
|label|count|
+-----+-----+
|    0|   67|
|    1|   15|
+-----+-----+

Test set class distribution:
+-----+-----+
|label|count|
+-----+-----+
|    0|   15|
|    1|    3|
+-----+-----+

Exact split counts can vary slightly by Spark version and partitioning, but the totals should stay near an 80/20 split with a similar class mix.

Both sets maintain approximately the 82/18 class ratio from the original data. If one set had no higher-risk examples (label=1), evaluation would be meaningless.

The seed=42 parameter makes the split reproducible. Running the same code again produces the same split.

The data is now ready for model training.