---
title: "Explore the feature store"
description: "Understand the four feature store objects, run pre-filled class definitions, and execute usage examples to populate Aerospike with sample data."
---

# Explore the feature store

> For the complete documentation index see: [llms.txt](https://aerospike.com/docs/llms.txt)
> 
> All documentation pages available in markdown.

The feature store has four objects that work together:

-   **Feature Groups** organize features by data source and computation pipeline.
-   **Features** define individual computed values within a group.
-   **Entities** store the actual feature values for real-world instances.
-   **Datasets** define reproducible training slices for ML models.

On this page, you will read how each object maps to Aerospike sets and records, then run the pre-filled notebook cells listed in each step. The notebook already contains the Part 1 class definitions and usage examples; the code blocks here explain what those cells do.

## Feature Groups (1/4)

Feature Groups are the organizing layer for feature engineering. They capture where a related set of features comes from and how that pipeline is managed.

A single Feature Group stores metadata for features derived from the same raw data and produced by the same computation pipeline. Feature Group metadata is stored in the Aerospike set `fg-metadata`, keyed by `name`.

| Field | Description | Example |
| --- | --- | --- |
| **name** | Unique identifier (primary key) | `driver-stats` |
| **description** | Human-readable summary | Driver performance metrics from ride completion events |
| **source** | Upstream dataset or system reference | `kafka://events.rides.completed` |
| **attrs** | Free-form metadata (owner, refresh cadence, entity type) | `{owner: "ml-platform", refresh: "hourly", entity: "driver"}` |
| **tags** | Tags for search and organization | `["driver", "performance", "core"]` |

The actual `FeatureGroup` definition, along with the other classes, is pre-filled in the notebook.

1.  Run `Cell P1-07` in your notebook to create, save, load, and query a Feature Group.

This creates `driver-stats`, computed from a hypothetical ride completion Kafka stream. You’ll load sample data directly later in the tutorial instead of setting up an example Kafka pipeline.

**Usage example:**

```python
# Save one Feature Group metadata record, then read it back by key and query it.

driver_stats = FeatureGroup(

    "driver-stats",

    "Driver performance metrics from ride completion events",

    "kafka://events.rides.completed",

    {"owner": "ml-platform", "refresh": "hourly", "entity": "driver"},

    ["driver", "performance", "core"]

)

driver_stats.save()

loaded_fg = FeatureGroup.load("driver-stats")

print(loaded_fg, '\n')

print("Feature groups for driver entity with hourly refresh:")

df = FeatureGroup.query("attrs.entity == 'driver' and attrs.refresh == 'hourly'")

df.show()
```

Expected output

```plaintext
<class '__main__.FeatureGroup'>: {'name': 'driver-stats', 'description': 'Driver performance metrics from ride completion events', 'source': 'kafka://events.rides.completed', 'attrs': {'owner': 'ml-platform', 'refresh': 'hourly', 'entity': 'driver'}, 'tags': ['driver', 'performance', 'core']}

Feature groups for driver entity with hourly refresh:

+------------+--------------------+--------------------+--------------------+--------------------+

|        name|         description|              source|               attrs|                tags|

+------------+--------------------+--------------------+--------------------+--------------------+

|driver-stats|Driver performance...|kafka://events.ri...|[owner -> ml-plat...|[driver, performa...|

+------------+--------------------+--------------------+--------------------+--------------------+
```

You created and saved one feature-group metadata record (`driver-stats`) into `fg-metadata`, then loaded it back by primary key (`name`) to confirm persistence. The table appears because the query filters by `attrs.entity == 'driver'` and `attrs.refresh == 'hourly'`, and this example has exactly one record that matches.

## Features (2/4)

A Feature is a single, named output produced by a pipeline within a Feature Group. Features are metadata records describing what a pipeline computes — they hold definitions, not actual values.

In the ride-hailing app, the `driver-stats` feature group contains features like `decline_rate` and `avg_rating`.

Feature metadata is stored in the Aerospike set `feature-metadata`. The primary key `fid` combines the group and feature names for global uniqueness, because the same feature name can appear in different groups.

| Field | Description | Example |
| --- | --- | --- |
| **fid** | Unique identifier (primary key), auto-generated as `<fgname>_<name>` | `driver-stats_decline_rate` |
| **fgname** | Parent feature group name | `driver-stats` |
| **name** | Feature name, unique within its group | `decline_rate` |
| **ftype** | Data type (integer, double, string, boolean) | `double` |
| **description** | Human-readable meaning and usage notes | Fraction of ride requests declined by driver in 30 days |
| **attrs** | Free-form metadata (baseline stats, data quality indicators) | `{baseline_mean: 0.05, baseline_p99: 0.15}` |
| **tags** | Tags for search and organization | `["driver", "decline-risk"]` |

The `Feature` class definition is pre-filled in the notebook.

1.  Run `Cell P1-09` to register and query feature metadata.

**Usage example:**

```python
# Register two features in the driver-stats Feature Group, then query by tag.

FG_NAME = 'driver-stats'

decline_rate = Feature(

    FG_NAME, "decline_rate", "double",

    "Fraction of ride requests declined by driver in the last 30 days",

    {"baseline_mean": "0.05", "baseline_p99": "0.15"},

    ["driver", "decline-risk"]

)

decline_rate.save()

avg_rating = Feature(

    FG_NAME, "avg_rating", "double",

    "Average rider rating for driver over last 90 days",

    {"baseline_mean": "4.7", "baseline_p99": "4.95"},

    ["driver", "quality"]

)

avg_rating.save()

loaded_feature = Feature.load("driver-stats", "decline_rate")

print(loaded_feature, '\n')

print("Features tagged with 'driver':")

f_df = Feature.query("array_contains(tags, 'driver')")

f_df.show()
```

Expected output

```plaintext
<class '__main__.Feature'>: {'fid': 'driver-stats_decline_rate', 'fgname': 'driver-stats', 'name': 'decline_rate', 'ftype': 'double', 'description': 'Fraction of ride requests declined by driver in the last 30 days', 'attrs': {'baseline_mean': '0.05', 'baseline_p99': '0.15'}, 'tags': ['driver', 'decline-risk']}

Features tagged with 'driver':

+--------------------+------------+----------------+------+--------------------+--------------------+--------------------+

|                 fid|      fgname|            name|  type|         description|               attrs|                tags|

+--------------------+------------+----------------+------+--------------------+--------------------+--------------------+

|driver-stats_decl...|driver-stats|   decline_rate|double|Fraction of ride ...|[baseline_mean ->...|  [driver, decline-risk]|

|driver-stats_avg_...|driver-stats|      avg_rating|double|Average rider rat...|[baseline_mean ->...|    [driver, quality]|

+--------------------+------------+----------------+------+--------------------+--------------------+--------------------+
```

This output shows two things: loading a single feature by key (`driver-stats_decline_rate`) returns one object, while querying by tag (`driver`) returns both saved rows because both features were tagged for the driver entity. In the object printout, the field appears as `ftype`; in the Spark table output, the same value is shown under the `type` column.

## Entities (3/4)

Feature Groups and Features are metadata. They describe where data comes from and what values get computed. Entity records store the actual computed values.

An Entity is a record that holds a real-world instance’s ID along with all its computed feature values. The key design choice is **co-location**: features from multiple feature groups for the same entity get stored together in a single Aerospike record. If `driver-stats` computes `decline_rate` and `avg_rating`, while `driver-activity` computes `trips_today`, all three values end up in the same `driver_123` record. Serving can fetch a complete feature vector with one database read.

Each entity type gets its own Aerospike set: `driver-features`, `rider-features`, `trip-features`, and so on.

| Field | Description | Example |
| --- | --- | --- |
| **id\_col** | Entity instance ID (primary key) | `driver_123` |
| **feature values** | One bin per feature, with a short prefix from the feature group name | `ds_decl_rate: 0.03`, `ds_avg_rating: 4.82`, `da_trips_today: 7` |
| **timestamp** | Last update timestamp | `2024-01-15T14:30:00Z` |

::: bin name limit
Aerospike bin names are limited to 14 characters. Use short prefixes derived from the feature group name (for example, `ds` for `driver-stats`, `da` for `driver-activity`).
:::

Unlike the other classes, Entity has no fixed schema — the fields depend on which features are computed for that entity type. The class also includes `get_feature_vector()` for low-latency single-record lookups with the Aerospike Python client, and Part 3 does a deeper dive on how to use and benchmark it in a serving path.

The `Entity` class definition is pre-filled in the notebook.

1.  Run `Cell P1-11` to write and read a concrete driver record.

**Usage example:**

```python
# Save one driver entity with three feature values, then query drivers by rating.

features = [

    ('ds_decl_rate', DoubleType(), 0.03),

    ('ds_avg_rating', DoubleType(), 4.82),

    ('da_trips_today', LongType(), 7)

]

record = [('driver_id', StringType(), 'driver_123')] + features

driver = Entity('driver', record, 'driver_id')

schema = Entity.get_schema(record)

driver.save(schema)

loaded_driver = Entity.load('driver', 'driver_123', schema, 'driver_id')

print(loaded_driver, '\n')

print("Drivers with rating above 4.5:")

instances = Entity.query('driver', 'ds_avg_rating > 4.5', schema, 'driver_id')

instances.show()
```

Expected output

```plaintext
<class '__main__.Entity'>: {'etype': 'driver', 'record': [('driver_id', 'string', 'driver_123'), ('ds_decl_rate', 'double', 0.03), ('ds_avg_rating', 'double', 4.82), ('da_trips_today', 'long', 7)], 'id_col': 'driver_id'}

Drivers with rating above 4.5:

+----------+--------------+-------------+--------------+

| driver_id|ds_decl_rate|ds_avg_rating|da_trips_today|

+----------+--------------+-------------+--------------+

|driver_123|          0.03|         4.82|             7|

+----------+--------------+-------------+--------------+
```

Here, the printed object confirms the full `driver_123` record was reconstructed from Aerospike, including all feature bins. The query output then shows the same record in tabular form because it satisfies `ds_avg_rating > 4.5`.

## Datasets (4/4)

A Dataset is a saved definition of a training slice: which entity type, which features, and which entity instances to include. The Dataset record is metadata only. The materialized training data gets written to external storage like Parquet when you call `materialize`.

In the ride-hailing app, you might define a dataset for trip decline risk modeling that selects `driver` entities with recent activity, including features like `decline_rate`, `avg_rating`, and `trips_today`.

Dataset metadata is stored in the Aerospike set `dataset-metadata`, keyed by `name`.

| Field | Description | Example |
| --- | --- | --- |
| **name** | Unique identifier (primary key) | `trip-decline-risk-training-jan2024` |
| **description** | Human-readable summary | Training set for trip decline risk prediction model |
| **entity** | Entity type | `driver` |
| **id\_col** | Entity ID column | `driver_id` |
| **id\_type** | ID column type | `string` |
| **features** | List of entity feature bin names to include | `["ds_decl_rate", "ds_avg_rating", "da_trips_today"]` |
| **query** | Predicate to filter entity instances | `da_trips_today > 0` |
| **location** | External path for the materialized dataset | `s3://ml-datasets/trip-decline-risk/jan2024/` |
| **attrs** | Free-form metadata (for example, model version) | `{model_version: "v1"}` |
| **tags** | Tags for search and organization | `["decline-risk", "driver", "training"]` |

::: bin name limit
The names in the `features` list are entity bin names, which are limited to 14 characters in Aerospike.
:::

The `Dataset` class definition is pre-filled in the notebook.

1.  Run `Cell P1-13` to save and inspect a dataset definition.

**Usage example:**

```python
# Save a reusable training dataset definition, then query datasets by tag.

decline_risk_dataset = Dataset(

    name="trip-decline-risk-training-jan2024",

    description="Training set for trip decline risk prediction model",

    entity="driver",

    id_col="driver_id",

    id_type="string",

    features=[

        "ds_decl_rate",

        "ds_avg_rating",

        "da_trips_today"

    ],

    query="da_trips_today > 0",

    location="s3://ml-datasets/trip-decline-risk/jan2024/",

    attrs={"model_version": "v1"},

    tags=["decline-risk", "driver", "training"]

)

decline_risk_dataset.save()

loaded_ds = Dataset.load("trip-decline-risk-training-jan2024")

print(loaded_ds, '\n')

print("Datasets tagged with 'driver':")

ds_df = Dataset.query_datasets("array_contains(tags, 'driver')")

ds_df.show()
```

Expected output

```plaintext
<class '__main__.Dataset'>: {'name': 'trip-decline-risk-training-jan2024', 'description': 'Training set for trip decline risk prediction model', 'entity': 'driver', 'id_col': 'driver_id', 'id_type': 'string', 'features': ['ds_decl_rate', 'ds_avg_rating', 'da_trips_today'], 'query': 'da_trips_today > 0', 'location': 's3://ml-datasets/trip-decline-risk/jan2024/', 'attrs': {'model_version': 'v1'}, 'tags': ['decline-risk', 'driver', 'training']}

Datasets tagged with 'driver':

+--------------------+--------------------+------+---------+-------+--------------------+--------------------+--------------------+--------------------+--------------------+

|                name|         description|entity|   id_col|id_type|            features|               query|            location|               attrs|                tags|

+--------------------+--------------------+------+---------+-------+--------------------+--------------------+--------------------+--------------------+--------------------+

|trip-decline-ri...|Training set for ...|driver|driver_id| string|[ds_decl_rate, ...|da_trips_today > 0|s3://ml-datasets/...|[model_version ->...|[decline-risk, d...|

+--------------------+--------------------+------+---------+-------+--------------------+--------------------+--------------------+--------------------+--------------------+
```

The loaded object confirms the dataset metadata was saved under `trip-decline-risk-training-jan2024`, including selected features and filter predicate. The query table appears because this dataset includes the `driver` tag, so it matches `array_contains(tags, 'driver')`.

Continue to [What’s next](https://aerospike.com/docs/develop/feature-store/step/1/part/1/whats-next/) to close out Part 1 and then move into [Part 2: Model Training](https://aerospike.com/docs/develop/model-training/).

::: undefined
-   I understand how Feature Groups, Features, Entities, and Datasets work together.
-   I have run the pre-filled class definitions and the Part 1 usage examples.
:::

[Previous  
Set up Spark](https://aerospike.com/docs/develop/feature-store/step/0/part/2/spark-setup) [Next  
What's next](https://aerospike.com/docs/develop/feature-store/step/1/part/1/whats-next)