Materializing training data
For the complete documentation index see: llms.txt
All documentation pages available in markdown.
Materializing a dataset means transforming the definition (which features, which entities, which filters) into actual data you can train on.
The Dataset object you created describes what to include. Materialization executes that specification against the current entity data in Aerospike. This page reuses shared constants from the Part 2 bootstrap cell.
Load and materialize the dataset
Load the saved Dataset definition and materialize it in one cell.
The materialize() method:
- Reads records from the
driver-featuresset in Aerospike (based onds.entity) - Applies the filter query (
ds.query) - Projects bins into DataFrame columns from
ds.featuresplus the ID column - Writes the materialized dataset to Parquet when
ds.locationis set - Returns a DataFrame ready for training
- Run
Cell 7to load and materialize training data.
Cell 7: Load and materialize training data
if "require_part2" not in globals(): raise ValueError("Missing Part 2 bootstrap helper. Re-run Part 2 Cell 1.")
require_part2(["DATASET_NAME", "SCHEMA", "FEATURE_COLUMNS", "ENTITY_ID_COL", "LABEL_COL"])
ds = Dataset.load(DATASET_NAME)if ds is None: raise ValueError("Dataset definition not found. Run the previous page first.")
print(f"Name: {ds.name}")print(f"Entity: {ds.entity}")print(f"Features: {ds.features}")data_df = ds.materialize(SCHEMA)
print(f"Materialized {data_df.count()} records")print(f"Columns: {data_df.columns}")data_df.select(ENTITY_ID_COL, *FEATURE_COLUMNS, LABEL_COL).show(5)Name: trip-decline-risk-v1Entity: driverFeatures: ['ds_decl_rate', 'ds_avg_rating', 'da_trips_today', 'label']Materialized 100 recordsColumns: ['driver_id', 'ds_decl_rate', 'ds_avg_rating', 'da_trips_today', 'label']+----------+------------+-------------+---------------+-----+|driver_id |ds_decl_rate|ds_avg_rating|da_trips_today|label|+----------+------------+-------------+---------------+-----+|driver_001| 0.037| 4.78| 9| 0||driver_002| 0.182| 4.21| 3| 1||driver_003| 0.052| 4.65| 11| 0||driver_004| 0.028| 4.89| 7| 0||driver_005| 0.145| 4.15| 5| 1|+----------+------------+-------------+---------------+-----+Create a feature vector
Before you can train a model with Spark ML, you need to put the input values into one column (usually named features).
Each row’s features value is a single vector, an ordered list of numbers like [decline_rate, avg_rating, trips_today].
The VectorAssembler utility builds that features column for you.
- Run
Cell 8to create feature vectors withVectorAssembler(no output is expected).
Cell 8: Create feature vectors with VectorAssembler
from pyspark.ml.feature import VectorAssembler
# Create the assembler# inputCols: the columns to combine# outputCol: the name of the new vector columnassembler = VectorAssembler( inputCols=FEATURE_COLUMNS, outputCol="features")
# Transform the DataFrame to add the features columnvectorized_df = assembler.transform(data_df)- Run
Cell 9to preview assembled feature vectors.
Cell 9: Preview assembled feature vectors
# View the vectorized data# The 'features' column contains all three predictors as a single vectorvectorized_df.select(ENTITY_ID_COL, 'features', LABEL_COL).show(5, truncate=False)+----------+-------------------+-----+|driver_id |features |label|+----------+-------------------+-----+|driver_001|[0.037,4.78,9.0] |0 ||driver_002|[0.182,4.21,3.0] |1 ||driver_003|[0.052,4.65,11.0] |0 ||driver_004|[0.028,4.89,7.0] |0 ||driver_005|[0.145,4.15,5.0] |1 |+----------+-------------------+-----+Each features vector contains three values in order: decline_rate, avg_rating, trips_today.
Spark MLlib algorithms use this vector for training.
Split into training and test sets
To evaluate your model fairly, split the data into two parts:
- Training set: Data the model learns from
- Test set: Data held out to evaluate performance on unseen examples
A common split is 80% training, 20% test:
- Run
Cell 10to split into train and test sets and print label distribution.
Cell 10: Split into train/test and print label distribution
# Split into training (80%) and test (20%) sets# The seed parameter ensures reproducible splitstrain_df, test_df = vectorized_df.randomSplit([0.8, 0.2], seed=42)
# Print split sizes and class distributionprint(f"Training set: {train_df.count()} records")print(f"Test set: {test_df.count()} records")
print("\nTraining set class distribution:")train_df.groupBy(LABEL_COL).count().orderBy(LABEL_COL).show()
print("Test set class distribution:")test_df.groupBy(LABEL_COL).count().orderBy(LABEL_COL).show()Training set: 82 recordsTest set: 18 records
Training set class distribution:+-----+-----+|label|count|+-----+-----+| 0| 67|| 1| 15|+-----+-----+
Test set class distribution:+-----+-----+|label|count|+-----+-----+| 0| 15|| 1| 3|+-----+-----+Exact split counts can vary slightly by Spark version and partitioning, but the totals should stay near an 80/20 split with a similar class mix.
Both sets maintain approximately the 82/18 class ratio from the original data. If one set had no higher-risk examples (label=1), evaluation would be meaningless.
The seed=42 parameter makes the split reproducible.
Running the same code again produces the same split.
The data is now ready for model training.