Part 2: Model Training

For the complete documentation index see: llms.txt

All documentation pages available in markdown.

Objectives

By the end of this tutorial, you will be able to:

Explore features and datasets stored in the Aerospike feature store.
Define and materialize training datasets using the Dataset class.
Train and evaluate machine learning models with Spark MLlib.
Save and load trained models for production use.

From features to models

In Part 1, you set up a feature store with four objects: Feature Groups, Features, Entities, and Datasets. You explored driver features for a ride-hailing app and stored them in Aerospike.

Now you’ll create sample data and use those features to train a machine learning model. This is the core value of a feature store: centralized, documented features that anyone can discover and use for training. Part 1 used small illustrative records (for example, driver_123) and example dataset metadata. Part 2 intentionally switches to a dedicated training slice: synthetic IDs (driver_001 to driver_100), trip-decline-risk-v1, and a local dataset output path (./datasets/trip-decline-risk-v1/).

Ride-hailing scenario

You’re building a trip decline risk prediction model for the ride-hailing platform. The platform’s smart dispatch system needs to predict which nearby drivers are most likely to decline a ride request, so it can assign each request to the driver most likely to accept.

Feature	Why it might matter for decline risk
`ds_decl_rate`	Historical decline behavior could be a strong predictor of future declines
`ds_avg_rating`	Low-rated drivers may be more likely to decline
`da_trips_today`	Driver fatigue from many trips might influence decline risk later in a shift

You’ll soon discover which features matter most through data analysis and model testing. You may have a good guess already. In production, you might be dealing with dozens or hundreds of features, and that’s where you’d need to use models like these to make evaluations.

Training workflow

Model training with a feature store can be a short, repeatable loop. In this tutorial, you’ll do one quick discovery pass, then move directly into training steps.

Discover: Run a quick metadata and data pass to identify the feature columns and label mix.
Define: Create a Dataset that specifies which features and entities to include in training. Save this definition in Aerospike for reproducibility.
Materialize: Transform the Dataset definition into actual training data and create train/test splits.
Train: Fit a baseline classification model with Spark MLlib.
Save: Persist the trained model artifact for Part 3 (Model Serving).

In the next section, you’ll continue the notebook at Cell 1 under Part 2: Model Training and run the Part 2 bootstrap cells.