Skip to content

Explore the feature store

For the complete documentation index see: llms.txt

All documentation pages available in markdown.

The feature store has four objects that work together:

  • Feature Groups organize features by data source and computation pipeline.
  • Features define individual computed values within a group.
  • Entities store the actual feature values for real-world instances.
  • Datasets define reproducible training slices for ML models.

On this page, you will read how each object maps to Aerospike sets and records, then run the pre-filled notebook cells listed in each step. The notebook already contains the Part 1 class definitions and usage examples; the code blocks here explain what those cells do.

Feature Groups (1/4)

Feature Groups are the organizing layer for feature engineering. They capture where a related set of features comes from and how that pipeline is managed.

A single Feature Group stores metadata for features derived from the same raw data and produced by the same computation pipeline. Feature Group metadata is stored in the Aerospike set fg-metadata, keyed by name.

FieldDescriptionExample
nameUnique identifier (primary key)driver-stats
descriptionHuman-readable summaryDriver performance metrics from ride completion events
sourceUpstream dataset or system referencekafka://events.rides.completed
attrsFree-form metadata (owner, refresh cadence, entity type){owner: "ml-platform", refresh: "hourly", entity: "driver"}
tagsTags for search and organization["driver", "performance", "core"]

The actual FeatureGroup definition, along with the other classes, is pre-filled in the notebook.

  1. Run Cell P1-07 in your notebook to create, save, load, and query a Feature Group.

This creates driver-stats, computed from a hypothetical ride completion Kafka stream. You’ll load sample data directly later in the tutorial instead of setting up an example Kafka pipeline.

Usage example:

# Save one Feature Group metadata record, then read it back by key and query it.
driver_stats = FeatureGroup(
"driver-stats",
"Driver performance metrics from ride completion events",
"kafka://events.rides.completed",
{"owner": "ml-platform", "refresh": "hourly", "entity": "driver"},
["driver", "performance", "core"]
)
driver_stats.save()
loaded_fg = FeatureGroup.load("driver-stats")
print(loaded_fg, '\n')
print("Feature groups for driver entity with hourly refresh:")
df = FeatureGroup.query("attrs.entity == 'driver' and attrs.refresh == 'hourly'")
df.show()
Expected output
<class '__main__.FeatureGroup'>: {'name': 'driver-stats', 'description': 'Driver performance metrics from ride completion events', 'source': 'kafka://events.rides.completed', 'attrs': {'owner': 'ml-platform', 'refresh': 'hourly', 'entity': 'driver'}, 'tags': ['driver', 'performance', 'core']}
Feature groups for driver entity with hourly refresh:
+------------+--------------------+--------------------+--------------------+--------------------+
| name| description| source| attrs| tags|
+------------+--------------------+--------------------+--------------------+--------------------+
|driver-stats|Driver performance...|kafka://events.ri...|[owner -> ml-plat...|[driver, performa...|
+------------+--------------------+--------------------+--------------------+--------------------+

You created and saved one feature-group metadata record (driver-stats) into fg-metadata, then loaded it back by primary key (name) to confirm persistence. The table appears because the query filters by attrs.entity == 'driver' and attrs.refresh == 'hourly', and this example has exactly one record that matches.

Features (2/4)

A Feature is a single, named output produced by a pipeline within a Feature Group. Features are metadata records describing what a pipeline computes — they hold definitions, not actual values.

In the ride-hailing app, the driver-stats feature group contains features like decline_rate and avg_rating.

Feature metadata is stored in the Aerospike set feature-metadata. The primary key fid combines the group and feature names for global uniqueness, because the same feature name can appear in different groups.

FieldDescriptionExample
fidUnique identifier (primary key), auto-generated as <fgname>_<name>driver-stats_decline_rate
fgnameParent feature group namedriver-stats
nameFeature name, unique within its groupdecline_rate
ftypeData type (integer, double, string, boolean)double
descriptionHuman-readable meaning and usage notesFraction of ride requests declined by driver in 30 days
attrsFree-form metadata (baseline stats, data quality indicators){baseline_mean: 0.05, baseline_p99: 0.15}
tagsTags for search and organization["driver", "decline-risk"]

The Feature class definition is pre-filled in the notebook.

  1. Run Cell P1-09 to register and query feature metadata.

Usage example:

# Register two features in the driver-stats Feature Group, then query by tag.
FG_NAME = 'driver-stats'
decline_rate = Feature(
FG_NAME, "decline_rate", "double",
"Fraction of ride requests declined by driver in the last 30 days",
{"baseline_mean": "0.05", "baseline_p99": "0.15"},
["driver", "decline-risk"]
)
decline_rate.save()
avg_rating = Feature(
FG_NAME, "avg_rating", "double",
"Average rider rating for driver over last 90 days",
{"baseline_mean": "4.7", "baseline_p99": "4.95"},
["driver", "quality"]
)
avg_rating.save()
loaded_feature = Feature.load("driver-stats", "decline_rate")
print(loaded_feature, '\n')
print("Features tagged with 'driver':")
f_df = Feature.query("array_contains(tags, 'driver')")
f_df.show()
Expected output
<class '__main__.Feature'>: {'fid': 'driver-stats_decline_rate', 'fgname': 'driver-stats', 'name': 'decline_rate', 'ftype': 'double', 'description': 'Fraction of ride requests declined by driver in the last 30 days', 'attrs': {'baseline_mean': '0.05', 'baseline_p99': '0.15'}, 'tags': ['driver', 'decline-risk']}
Features tagged with 'driver':
+--------------------+------------+----------------+------+--------------------+--------------------+--------------------+
| fid| fgname| name| type| description| attrs| tags|
+--------------------+------------+----------------+------+--------------------+--------------------+--------------------+
|driver-stats_decl...|driver-stats| decline_rate|double|Fraction of ride ...|[baseline_mean ->...| [driver, decline-risk]|
|driver-stats_avg_...|driver-stats| avg_rating|double|Average rider rat...|[baseline_mean ->...| [driver, quality]|
+--------------------+------------+----------------+------+--------------------+--------------------+--------------------+

This output shows two things: loading a single feature by key (driver-stats_decline_rate) returns one object, while querying by tag (driver) returns both saved rows because both features were tagged for the driver entity. In the object printout, the field appears as ftype; in the Spark table output, the same value is shown under the type column.

Entities (3/4)

Feature Groups and Features are metadata. They describe where data comes from and what values get computed. Entity records store the actual computed values.

An Entity is a record that holds a real-world instance’s ID along with all its computed feature values. The key design choice is co-location: features from multiple feature groups for the same entity get stored together in a single Aerospike record. If driver-stats computes decline_rate and avg_rating, while driver-activity computes trips_today, all three values end up in the same driver_123 record. Serving can fetch a complete feature vector with one database read.

Each entity type gets its own Aerospike set: driver-features, rider-features, trip-features, and so on.

FieldDescriptionExample
id_colEntity instance ID (primary key)driver_123
feature valuesOne bin per feature, with a short prefix from the feature group nameds_decl_rate: 0.03, ds_avg_rating: 4.82, da_trips_today: 7
timestampLast update timestamp2024-01-15T14:30:00Z

Unlike the other classes, Entity has no fixed schema — the fields depend on which features are computed for that entity type. The class also includes get_feature_vector() for low-latency single-record lookups with the Aerospike Python client, and Part 3 does a deeper dive on how to use and benchmark it in a serving path.

The Entity class definition is pre-filled in the notebook.

  1. Run Cell P1-11 to write and read a concrete driver record.

Usage example:

# Save one driver entity with three feature values, then query drivers by rating.
features = [
('ds_decl_rate', DoubleType(), 0.03),
('ds_avg_rating', DoubleType(), 4.82),
('da_trips_today', LongType(), 7)
]
record = [('driver_id', StringType(), 'driver_123')] + features
driver = Entity('driver', record, 'driver_id')
schema = Entity.get_schema(record)
driver.save(schema)
loaded_driver = Entity.load('driver', 'driver_123', schema, 'driver_id')
print(loaded_driver, '\n')
print("Drivers with rating above 4.5:")
instances = Entity.query('driver', 'ds_avg_rating > 4.5', schema, 'driver_id')
instances.show()
Expected output
<class '__main__.Entity'>: {'etype': 'driver', 'record': [('driver_id', 'string', 'driver_123'), ('ds_decl_rate', 'double', 0.03), ('ds_avg_rating', 'double', 4.82), ('da_trips_today', 'long', 7)], 'id_col': 'driver_id'}
Drivers with rating above 4.5:
+----------+--------------+-------------+--------------+
| driver_id|ds_decl_rate|ds_avg_rating|da_trips_today|
+----------+--------------+-------------+--------------+
|driver_123| 0.03| 4.82| 7|
+----------+--------------+-------------+--------------+

Here, the printed object confirms the full driver_123 record was reconstructed from Aerospike, including all feature bins. The query output then shows the same record in tabular form because it satisfies ds_avg_rating > 4.5.

Datasets (4/4)

A Dataset is a saved definition of a training slice: which entity type, which features, and which entity instances to include. The Dataset record is metadata only. The materialized training data gets written to external storage like Parquet when you call materialize.

In the ride-hailing app, you might define a dataset for trip decline risk modeling that selects driver entities with recent activity, including features like decline_rate, avg_rating, and trips_today.

Dataset metadata is stored in the Aerospike set dataset-metadata, keyed by name.

FieldDescriptionExample
nameUnique identifier (primary key)trip-decline-risk-training-jan2024
descriptionHuman-readable summaryTraining set for trip decline risk prediction model
entityEntity typedriver
id_colEntity ID columndriver_id
id_typeID column typestring
featuresList of entity feature bin names to include["ds_decl_rate", "ds_avg_rating", "da_trips_today"]
queryPredicate to filter entity instancesda_trips_today > 0
locationExternal path for the materialized datasets3://ml-datasets/trip-decline-risk/jan2024/
attrsFree-form metadata (for example, model version){model_version: "v1"}
tagsTags for search and organization["decline-risk", "driver", "training"]

The Dataset class definition is pre-filled in the notebook.

  1. Run Cell P1-13 to save and inspect a dataset definition.

Usage example:

# Save a reusable training dataset definition, then query datasets by tag.
decline_risk_dataset = Dataset(
name="trip-decline-risk-training-jan2024",
description="Training set for trip decline risk prediction model",
entity="driver",
id_col="driver_id",
id_type="string",
features=[
"ds_decl_rate",
"ds_avg_rating",
"da_trips_today"
],
query="da_trips_today > 0",
location="s3://ml-datasets/trip-decline-risk/jan2024/",
attrs={"model_version": "v1"},
tags=["decline-risk", "driver", "training"]
)
decline_risk_dataset.save()
loaded_ds = Dataset.load("trip-decline-risk-training-jan2024")
print(loaded_ds, '\n')
print("Datasets tagged with 'driver':")
ds_df = Dataset.query_datasets("array_contains(tags, 'driver')")
ds_df.show()
Expected output
<class '__main__.Dataset'>: {'name': 'trip-decline-risk-training-jan2024', 'description': 'Training set for trip decline risk prediction model', 'entity': 'driver', 'id_col': 'driver_id', 'id_type': 'string', 'features': ['ds_decl_rate', 'ds_avg_rating', 'da_trips_today'], 'query': 'da_trips_today > 0', 'location': 's3://ml-datasets/trip-decline-risk/jan2024/', 'attrs': {'model_version': 'v1'}, 'tags': ['decline-risk', 'driver', 'training']}
Datasets tagged with 'driver':
+--------------------+--------------------+------+---------+-------+--------------------+--------------------+--------------------+--------------------+--------------------+
| name| description|entity| id_col|id_type| features| query| location| attrs| tags|
+--------------------+--------------------+------+---------+-------+--------------------+--------------------+--------------------+--------------------+--------------------+
|trip-decline-ri...|Training set for ...|driver|driver_id| string|[ds_decl_rate, ...|da_trips_today > 0|s3://ml-datasets/...|[model_version ->...|[decline-risk, d...|
+--------------------+--------------------+------+---------+-------+--------------------+--------------------+--------------------+--------------------+--------------------+

The loaded object confirms the dataset metadata was saved under trip-decline-risk-training-jan2024, including selected features and filter predicate. The query table appears because this dataset includes the driver tag, so it matches array_contains(tags, 'driver').

Continue to What’s next to close out Part 1 and then move into Part 2: Model Training.

Feedback

Was this page helpful?

What type of feedback are you giving?

What would you like us to know?

+Capture screenshot

Can we reach out to you?