Explore the feature store

For the complete documentation index see: llms.txt

All documentation pages available in markdown.

The feature store has four objects that work together:

Feature Groups organize features by data source and computation pipeline.
Features define individual computed values within a group.
Entities store the actual feature values for real-world instances.
Datasets define reproducible training slices for ML models.

On this page, you will read how each object maps to Aerospike sets and records, then run the pre-filled notebook cells listed in each step. The notebook already contains the Part 1 class definitions and usage examples; the code blocks here explain what those cells do.

Feature Groups (1/4)

Feature Groups are the organizing layer for feature engineering. They capture where a related set of features comes from and how that pipeline is managed.

A single Feature Group stores metadata for features derived from the same raw data and produced by the same computation pipeline. Feature Group metadata is stored in the Aerospike set fg-metadata, keyed by name.

Field	Description	Example
name	Unique identifier (primary key)	`driver-stats`
description	Human-readable summary	Driver performance metrics from ride completion events
source	Upstream dataset or system reference	`kafka://events.rides.completed`
attrs	Free-form metadata (owner, refresh cadence, entity type)	`{owner: "ml-platform", refresh: "hourly", entity: "driver"}`
tags	Tags for search and organization	`["driver", "performance", "core"]`

The actual FeatureGroup definition, along with the other classes, is pre-filled in the notebook.

Run Cell P1-07 in your notebook to create, save, load, and query a Feature Group.

This creates driver-stats, computed from a hypothetical ride completion Kafka stream. You’ll load sample data directly later in the tutorial instead of setting up an example Kafka pipeline.

Usage example:

# Save one Feature Group metadata record, then read it back by key and query it.
driver_stats = FeatureGroup(
    "driver-stats",
    "Driver performance metrics from ride completion events",
    "kafka://events.rides.completed",
    {"owner": "ml-platform", "refresh": "hourly", "entity": "driver"},
    ["driver", "performance", "core"]
)
driver_stats.save()

loaded_fg = FeatureGroup.load("driver-stats")
print(loaded_fg, '\n')

print("Feature groups for driver entity with hourly refresh:")
df = FeatureGroup.query("attrs.entity == 'driver' and attrs.refresh == 'hourly'")
df.show()

Expected output

<class '__main__.FeatureGroup'>: {'name': 'driver-stats', 'description': 'Driver performance metrics from ride completion events', 'source': 'kafka://events.rides.completed', 'attrs': {'owner': 'ml-platform', 'refresh': 'hourly', 'entity': 'driver'}, 'tags': ['driver', 'performance', 'core']}

Feature groups for driver entity with hourly refresh:
+------------+--------------------+--------------------+--------------------+--------------------+
|        name|         description|              source|               attrs|                tags|
+------------+--------------------+--------------------+--------------------+--------------------+
|driver-stats|Driver performance...|kafka://events.ri...|[owner -> ml-plat...|[driver, performa...|
+------------+--------------------+--------------------+--------------------+--------------------+

You created and saved one feature-group metadata record (driver-stats) into fg-metadata, then loaded it back by primary key (name) to confirm persistence. The table appears because the query filters by attrs.entity == 'driver' and attrs.refresh == 'hourly', and this example has exactly one record that matches.

Features (2/4)

A Feature is a single, named output produced by a pipeline within a Feature Group. Features are metadata records describing what a pipeline computes — they hold definitions, not actual values.

In the ride-hailing app, the driver-stats feature group contains features like decline_rate and avg_rating.

Feature metadata is stored in the Aerospike set feature-metadata. The primary key fid combines the group and feature names for global uniqueness, because the same feature name can appear in different groups.

Field	Description	Example
fid	Unique identifier (primary key), auto-generated as `<fgname>_<name>`	`driver-stats_decline_rate`
fgname	Parent feature group name	`driver-stats`
name	Feature name, unique within its group	`decline_rate`
ftype	Data type (integer, double, string, boolean)	`double`
description	Human-readable meaning and usage notes	Fraction of ride requests declined by driver in 30 days
attrs	Free-form metadata (baseline stats, data quality indicators)	`{baseline_mean: 0.05, baseline_p99: 0.15}`
tags	Tags for search and organization	`["driver", "decline-risk"]`

The Feature class definition is pre-filled in the notebook.

Run Cell P1-09 to register and query feature metadata.

Usage example:

# Register two features in the driver-stats Feature Group, then query by tag.
FG_NAME = 'driver-stats'

decline_rate = Feature(
    FG_NAME, "decline_rate", "double",
    "Fraction of ride requests declined by driver in the last 30 days",
    {"baseline_mean": "0.05", "baseline_p99": "0.15"},
    ["driver", "decline-risk"]
)
decline_rate.save()

avg_rating = Feature(
    FG_NAME, "avg_rating", "double",
    "Average rider rating for driver over last 90 days",
    {"baseline_mean": "4.7", "baseline_p99": "4.95"},
    ["driver", "quality"]
)
avg_rating.save()

loaded_feature = Feature.load("driver-stats", "decline_rate")
print(loaded_feature, '\n')

print("Features tagged with 'driver':")
f_df = Feature.query("array_contains(tags, 'driver')")
f_df.show()

Expected output

<class '__main__.Feature'>: {'fid': 'driver-stats_decline_rate', 'fgname': 'driver-stats', 'name': 'decline_rate', 'ftype': 'double', 'description': 'Fraction of ride requests declined by driver in the last 30 days', 'attrs': {'baseline_mean': '0.05', 'baseline_p99': '0.15'}, 'tags': ['driver', 'decline-risk']}

Features tagged with 'driver':
+--------------------+------------+----------------+------+--------------------+--------------------+--------------------+
|                 fid|      fgname|            name|  type|         description|               attrs|                tags|
+--------------------+------------+----------------+------+--------------------+--------------------+--------------------+
|driver-stats_decl...|driver-stats|   decline_rate|double|Fraction of ride ...|[baseline_mean ->...|  [driver, decline-risk]|
|driver-stats_avg_...|driver-stats|      avg_rating|double|Average rider rat...|[baseline_mean ->...|    [driver, quality]|
+--------------------+------------+----------------+------+--------------------+--------------------+--------------------+

This output shows two things: loading a single feature by key (driver-stats_decline_rate) returns one object, while querying by tag (driver) returns both saved rows because both features were tagged for the driver entity. In the object printout, the field appears as ftype; in the Spark table output, the same value is shown under the type column.

Entities (3/4)

Feature Groups and Features are metadata. They describe where data comes from and what values get computed. Entity records store the actual computed values.

An Entity is a record that holds a real-world instance’s ID along with all its computed feature values. The key design choice is co-location: features from multiple feature groups for the same entity get stored together in a single Aerospike record. If driver-stats computes decline_rate and avg_rating, while driver-activity computes trips_today, all three values end up in the same driver_123 record. Serving can fetch a complete feature vector with one database read.

Each entity type gets its own Aerospike set: driver-features, rider-features, trip-features, and so on.

Field	Description	Example
id_col	Entity instance ID (primary key)	`driver_123`
feature values	One bin per feature, with a short prefix from the feature group name	`ds_decl_rate: 0.03`, `ds_avg_rating: 4.82`, `da_trips_today: 7`
timestamp	Last update timestamp	`2024-01-15T14:30:00Z`

Unlike the other classes, Entity has no fixed schema — the fields depend on which features are computed for that entity type. The class also includes get_feature_vector() for low-latency single-record lookups with the Aerospike Python client, and Part 3 does a deeper dive on how to use and benchmark it in a serving path.

The Entity class definition is pre-filled in the notebook.

Run Cell P1-11 to write and read a concrete driver record.

Usage example:

# Save one driver entity with three feature values, then query drivers by rating.
features = [
    ('ds_decl_rate', DoubleType(), 0.03),
    ('ds_avg_rating', DoubleType(), 4.82),
    ('da_trips_today', LongType(), 7)
]
record = [('driver_id', StringType(), 'driver_123')] + features
driver = Entity('driver', record, 'driver_id')
schema = Entity.get_schema(record)
driver.save(schema)

loaded_driver = Entity.load('driver', 'driver_123', schema, 'driver_id')
print(loaded_driver, '\n')

print("Drivers with rating above 4.5:")
instances = Entity.query('driver', 'ds_avg_rating > 4.5', schema, 'driver_id')
instances.show()

Expected output

<class '__main__.Entity'>: {'etype': 'driver', 'record': [('driver_id', 'string', 'driver_123'), ('ds_decl_rate', 'double', 0.03), ('ds_avg_rating', 'double', 4.82), ('da_trips_today', 'long', 7)], 'id_col': 'driver_id'}

Drivers with rating above 4.5:
+----------+--------------+-------------+--------------+
| driver_id|ds_decl_rate|ds_avg_rating|da_trips_today|
+----------+--------------+-------------+--------------+
|driver_123|          0.03|         4.82|             7|
+----------+--------------+-------------+--------------+

Here, the printed object confirms the full driver_123 record was reconstructed from Aerospike, including all feature bins. The query output then shows the same record in tabular form because it satisfies ds_avg_rating > 4.5.

Datasets (4/4)

A Dataset is a saved definition of a training slice: which entity type, which features, and which entity instances to include. The Dataset record is metadata only. The materialized training data gets written to external storage like Parquet when you call materialize.

In the ride-hailing app, you might define a dataset for trip decline risk modeling that selects driver entities with recent activity, including features like decline_rate, avg_rating, and trips_today.

Dataset metadata is stored in the Aerospike set dataset-metadata, keyed by name.

Field	Description	Example
name	Unique identifier (primary key)	`trip-decline-risk-training-jan2024`
description	Human-readable summary	Training set for trip decline risk prediction model
entity	Entity type	`driver`
id_col	Entity ID column	`driver_id`
id_type	ID column type	`string`
features	List of entity feature bin names to include	`["ds_decl_rate", "ds_avg_rating", "da_trips_today"]`
query	Predicate to filter entity instances	`da_trips_today > 0`
location	External path for the materialized dataset	`s3://ml-datasets/trip-decline-risk/jan2024/`
attrs	Free-form metadata (for example, model version)	`{model_version: "v1"}`
tags	Tags for search and organization	`["decline-risk", "driver", "training"]`

The Dataset class definition is pre-filled in the notebook.

Run Cell P1-13 to save and inspect a dataset definition.

Usage example:

# Save a reusable training dataset definition, then query datasets by tag.
decline_risk_dataset = Dataset(
    name="trip-decline-risk-training-jan2024",
    description="Training set for trip decline risk prediction model",
    entity="driver",
    id_col="driver_id",
    id_type="string",
    features=[
        "ds_decl_rate",
        "ds_avg_rating",
        "da_trips_today"
    ],
    query="da_trips_today > 0",
    location="s3://ml-datasets/trip-decline-risk/jan2024/",
    attrs={"model_version": "v1"},
    tags=["decline-risk", "driver", "training"]
)
decline_risk_dataset.save()

loaded_ds = Dataset.load("trip-decline-risk-training-jan2024")
print(loaded_ds, '\n')

print("Datasets tagged with 'driver':")
ds_df = Dataset.query_datasets("array_contains(tags, 'driver')")
ds_df.show()

Expected output

<class '__main__.Dataset'>: {'name': 'trip-decline-risk-training-jan2024', 'description': 'Training set for trip decline risk prediction model', 'entity': 'driver', 'id_col': 'driver_id', 'id_type': 'string', 'features': ['ds_decl_rate', 'ds_avg_rating', 'da_trips_today'], 'query': 'da_trips_today > 0', 'location': 's3://ml-datasets/trip-decline-risk/jan2024/', 'attrs': {'model_version': 'v1'}, 'tags': ['decline-risk', 'driver', 'training']}

Datasets tagged with 'driver':
+--------------------+--------------------+------+---------+-------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                name|         description|entity|   id_col|id_type|            features|               query|            location|               attrs|                tags|
+--------------------+--------------------+------+---------+-------+--------------------+--------------------+--------------------+--------------------+--------------------+
|trip-decline-ri...|Training set for ...|driver|driver_id| string|[ds_decl_rate, ...|da_trips_today > 0|s3://ml-datasets/...|[model_version ->...|[decline-risk, d...|
+--------------------+--------------------+------+---------+-------+--------------------+--------------------+--------------------+--------------------+--------------------+

The loaded object confirms the dataset metadata was saved under trip-decline-risk-training-jan2024, including selected features and filter predicate. The query table appears because this dataset includes the driver tag, so it matches array_contains(tags, 'driver').

Continue to What’s next to close out Part 1 and then move into Part 2: Model Training.