---
title: "Exploring features"
description: "Discover available features and inspect the training data before defining a dataset."
---

# Exploring features

> For the complete documentation index see: [llms.txt](https://aerospike.com/docs/llms.txt)
> 
> All documentation pages available in markdown.

Before training, explore the feature catalog and inspect the data. You already know what’s in this feature store from Part 1, but a new teammate would run these same catalog queries to learn which driver features are available.

## Feature discovery

Run two queries to confirm what features are available. Since your goal is predicting information about drivers, look for driver-related features.

Query the FeatureGroup class to see which feature groups produce features for the `driver` Entity type. Then query the Feature class to see which feature IDs contain the string `driver`.

1.  Run `Cell 3` to query feature groups and features for drivers.

Cell 3: Query feature groups and features for drivers

```python
# 1) Which feature groups produce driver features?

FeatureGroup.query("attrs.entity == 'driver'") \

    .select("name", "description", "source") \

    .show(truncate=50)

# 2) Which individual features are available in those groups?

Feature.query("fgname like 'driver%'") \

    .select("fid", "type", "description") \

    .show(truncate=50)
```

Feature discovery

```plaintext
+---------------+------------------------------------------+------------------------------+

|name           |description                               |source                        |

+---------------+------------------------------------------+------------------------------+

|driver-stats   |Driver performance metrics...             |kafka://events.rides.completed|

|driver-activity|Driver activity metrics from GPS pipeline |kafka://events.gps.pings      |

+---------------+------------------------------------------+------------------------------+

+-------------------------------+-------+-------------------------------------------+

|fid                            |type   |description                                |

+-------------------------------+-------+-------------------------------------------+

|driver-stats_decline_rate      |double |Fraction of ride requests declined...      |

|driver-stats_avg_rating        |double |Average rider rating for driver...         |

|driver-activity_trips_today    |integer|Number of completed trips today            |

+-------------------------------+-------+-------------------------------------------+
```

In a real environment you may have to sort through many groups and features, but here the catalog is small enough to scan in one query.

## Load the labeled feature data

Now load the actual records to see what the training data looks like.

1.  Run `Cell 4` to load the labeled driver data.

Cell 4: Load labeled driver data

```python
if "require_part2" not in globals():

    raise ValueError("Missing Part 2 bootstrap helper. Re-run Part 2 Cell 1.")

require_part2(["SCHEMA", "ENTITY_TYPE", "ENTITY_ID_COL", "TRAINING_PREDICATE", "TRAIN_COLUMNS"])

driver_df = Entity.query(ENTITY_TYPE, TRAINING_PREDICATE, SCHEMA, ENTITY_ID_COL)

print(f"Loaded driver records: {driver_df.count()}")

driver_df.select(ENTITY_ID_COL, *TRAIN_COLUMNS).show(5)
```

Expected output

```plaintext
Loaded driver records: 100

+----------+------------+-------------+---------------+-----+

|driver_id |ds_decl_rate|ds_avg_rating|da_trips_today|label|

+----------+------------+-------------+---------------+-----+

|driver_001|       0.037|         4.78|              9|    0|

|driver_002|       0.182|         4.21|              3|    1|

|driver_003|       0.052|         4.65|             11|    0|

|driver_004|       0.028|         4.89|              7|    0|

|driver_005|       0.145|         4.15|              5|    1|

+----------+------------+-------------+---------------+-----+
```

## Check feature patterns by label

A quick summary of average feature values by label, to confirm the features separate the two classes.

1.  Run `Cell 5` to compute average feature values per label.

Cell 5: Compute average feature values per label

```python
from pyspark.sql.functions import avg, round as spark_round

decline_col, rating_col, trips_col = FEATURE_COLUMNS

driver_df.groupBy(LABEL_COL).agg(

    spark_round(avg(decline_col), 3).alias('avg_decline'),

    spark_round(avg(rating_col), 2).alias('avg_rating'),

    spark_round(avg(trips_col), 1).alias('avg_trips')

).orderBy(LABEL_COL).show()
```

Expected output

```plaintext
+-----+-----------+----------+---------+

|label|avg_decline|avg_rating|avg_trips|

+-----+-----------+----------+---------+

|    0|      0.045|      4.73|      6.7|

|    1|      0.175|      4.27|      4.8|

+-----+-----------+----------+---------+
```

The higher-risk class (`label=1`) shows higher decline rate and lower rating on average. The features look useful for predicting decline risk.

Next, you’ll define the training Dataset: the reproducible definition of what data to train on.

::: undefined
-   I can find relevant feature groups and features for a training use case.
-   I can load labeled feature data and inspect individual records.
-   I can check feature patterns by label.
:::

[Previous  
Prerequisites and setup](https://aerospike.com/docs/develop/model-training/step/0/part/1/prerequisites) [Next  
Defining a training dataset](https://aerospike.com/docs/develop/model-training/step/2/part/0/defining-dataset)