Part 3: Model Serving
For the complete documentation index see: llms.txt
All documentation pages available in markdown.
Objectives
By the end of this tutorial, you will be able to:
- Retrieve features in real time using the Aerospike Python client.
- Build an end-to-end prediction function that serves a trained model.
- Understand how the feature store pipeline scales to production workloads.
- Measure sub-millisecond feature retrieval latency at scale.
From training to serving
In Part 1, you set up a feature store. In Part 2, you trained a model that predicts trip decline risk for drivers. Now you serve that model in real time so the platform can use its predictions.
This part shows how the feature store supports low-latency inference.
Training happens offline, where you can afford to wait seconds or minutes for batch reads. But serving happens live. Every millisecond of feature retrieval delays the response to a user, which is why low-latency databases are so critical for modern app infrastructure.
In many real systems, model math is not the slowest step. Feature retrieval is often the bottleneck. At scale, your feature store may hold billions of records and bins, so serving paths should avoid scan-style queries and use direct key-based reads.
Smart dispatch
The ride-hailing platform’s dispatch system uses decline risk predictions to make better driver assignments. When a ride request comes in:
-
Identify nearby available drivers. This might be anywhere from 10-50 candidates in a busy area at a busy time.
-
Score each candidate on predicted decline risk using their latest features.
-
Assign the request to the driver most likely to accept the trip.
This decision must happen within the dispatch latency budget, a few hundred milliseconds at most. If it takes too long, riders wait longer and might find alternative transport options, losing potential sales.
What you’ll build
By the end of Part 3, you’ll have:
- A way to read from Aerospike with the Python client for real-time single-record lookups
- A practical
get_feature_vector()serving path for fetching model-ready features by ID - A
predict_decline_risk()function that retrieves features and runs the model end-to-end - Evidence that this pipeline scales: sub-millisecond feature retrieval even with larger, more complex datasets
Serving workflow
-
Connect: Set up the Aerospike Python client for low-latency reads.
-
Retrieve: Use
get_feature_vector()to fetch a driver’s features by ID. -
Predict: Load the trained model and build a function that goes from driver ID to a predicted decline risk based on the features attached to the ID.
-
Scale: Expand the dataset and measure retrieval performance at larger scale.
In the next section, you’ll prepare your environment and install one new dependency for your Part 3 code.