---
title: "Part 2: Model Training"
description: "Train a Spark ML model using features stored in Aerospike."
---

# Part 2: Model Training

> For the complete documentation index see: [llms.txt](https://aerospike.com/docs/llms.txt)
> 
> All documentation pages available in markdown.

::: note
Data engineers and ML practitioners familiar with Python and basic Spark concepts.
:::
::: undefined
This tutorial should take between 10 and 15 minutes.
:::

## Objectives

By the end of this tutorial, you will be able to:

-   Explore features and datasets stored in the Aerospike feature store.
-   Define and materialize training datasets using the Dataset class.
-   Train and evaluate machine learning models with Spark MLlib.
-   Save and load trained models for production use.

## From features to models

In Part 1, you set up a feature store with four objects: Feature Groups, Features, Entities, and Datasets. You explored driver features for a ride-hailing app and stored them in Aerospike.

Now you’ll create sample data and use those features to train a machine learning model. This is the core value of a feature store: centralized, documented features that anyone can discover and use for training. Part 1 used small illustrative records (for example, `driver_123`) and example dataset metadata. Part 2 intentionally switches to a dedicated training slice: synthetic IDs (`driver_001` to `driver_100`), `trip-decline-risk-v1`, and a local dataset output path (`./datasets/trip-decline-risk-v1/`).

## Ride-hailing scenario

You’re building a **trip decline risk prediction model** for the ride-hailing platform. The platform’s smart dispatch system needs to predict which nearby drivers are most likely to decline a ride request, so it can assign each request to the driver most likely to accept.

| Feature | Why it might matter for decline risk |
| --- | --- |
| `ds_decl_rate` | Historical decline behavior could be a strong predictor of future declines |
| `ds_avg_rating` | Low-rated drivers may be more likely to decline |
| `da_trips_today` | Driver fatigue from many trips might influence decline risk later in a shift |

You’ll soon discover which features matter most through data analysis and model testing. You may have a good guess already. In production, you might be dealing with dozens or hundreds of features, and that’s where you’d need to use models like these to make evaluations.

## Training workflow

Model training with a feature store can be a short, repeatable loop. In this tutorial, you’ll do one quick discovery pass, then move directly into training steps.

1.  Discover: Run a quick metadata and data pass to identify the feature columns and label mix.
    
2.  Define: Create a Dataset that specifies which features and entities to include in training. Save this definition in Aerospike for reproducibility.
    
3.  Materialize: Transform the Dataset definition into actual training data and create train/test splits.
    
4.  Train: Fit a baseline classification model with Spark MLlib.
    
5.  Save: Persist the trained model artifact for Part 3 (Model Serving).
    

In the next section, you’ll continue the notebook at **Cell 1 under Part 2: Model Training** and run the Part 2 bootstrap cells.

::: undefined
-   I understand how Part 2 builds on Part 1’s feature store.
-   I know the fast training workflow from feature discovery to a saved model.
:::

[Next  
Prerequisites and setup](https://aerospike.com/docs/develop/model-training/step/0/part/1/prerequisites)