Exploring features

For the complete documentation index see: llms.txt

All documentation pages available in markdown.

Before training, explore the feature catalog and inspect the data. You already know what’s in this feature store from Part 1, but a new teammate would run these same catalog queries to learn which driver features are available.

Feature discovery

Run two queries to confirm what features are available. Since your goal is predicting information about drivers, look for driver-related features.

Query the FeatureGroup class to see which feature groups produce features for the driver Entity type. Then query the Feature class to see which feature IDs contain the string driver.

Run Cell 3 to query feature groups and features for drivers.

Cell 3: Query feature groups and features for drivers

# 1) Which feature groups produce driver features?
FeatureGroup.query("attrs.entity == 'driver'") \
    .select("name", "description", "source") \
    .show(truncate=50)

# 2) Which individual features are available in those groups?
Feature.query("fgname like 'driver%'") \
    .select("fid", "type", "description") \
    .show(truncate=50)

Feature discovery

+---------------+------------------------------------------+------------------------------+
|name           |description                               |source                        |
+---------------+------------------------------------------+------------------------------+
|driver-stats   |Driver performance metrics...             |kafka://events.rides.completed|
|driver-activity|Driver activity metrics from GPS pipeline |kafka://events.gps.pings      |
+---------------+------------------------------------------+------------------------------+

+-------------------------------+-------+-------------------------------------------+
|fid                            |type   |description                                |
+-------------------------------+-------+-------------------------------------------+
|driver-stats_decline_rate      |double |Fraction of ride requests declined...      |
|driver-stats_avg_rating        |double |Average rider rating for driver...         |
|driver-activity_trips_today    |integer|Number of completed trips today            |
+-------------------------------+-------+-------------------------------------------+

In a real environment you may have to sort through many groups and features, but here the catalog is small enough to scan in one query.

Load the labeled feature data

Now load the actual records to see what the training data looks like.

Run Cell 4 to load the labeled driver data.

Cell 4: Load labeled driver data

if "require_part2" not in globals():
    raise ValueError("Missing Part 2 bootstrap helper. Re-run Part 2 Cell 1.")

require_part2(["SCHEMA", "ENTITY_TYPE", "ENTITY_ID_COL", "TRAINING_PREDICATE", "TRAIN_COLUMNS"])

driver_df = Entity.query(ENTITY_TYPE, TRAINING_PREDICATE, SCHEMA, ENTITY_ID_COL)

print(f"Loaded driver records: {driver_df.count()}")
driver_df.select(ENTITY_ID_COL, *TRAIN_COLUMNS).show(5)

Expected output

Loaded driver records: 100
+----------+------------+-------------+---------------+-----+
|driver_id |ds_decl_rate|ds_avg_rating|da_trips_today|label|
+----------+------------+-------------+---------------+-----+
|driver_001|       0.037|         4.78|              9|    0|
|driver_002|       0.182|         4.21|              3|    1|
|driver_003|       0.052|         4.65|             11|    0|
|driver_004|       0.028|         4.89|              7|    0|
|driver_005|       0.145|         4.15|              5|    1|
+----------+------------+-------------+---------------+-----+

Check feature patterns by label

A quick summary of average feature values by label, to confirm the features separate the two classes.

Run Cell 5 to compute average feature values per label.

Cell 5: Compute average feature values per label

from pyspark.sql.functions import avg, round as spark_round

decline_col, rating_col, trips_col = FEATURE_COLUMNS

driver_df.groupBy(LABEL_COL).agg(
    spark_round(avg(decline_col), 3).alias('avg_decline'),
    spark_round(avg(rating_col), 2).alias('avg_rating'),
    spark_round(avg(trips_col), 1).alias('avg_trips')
).orderBy(LABEL_COL).show()

Expected output

+-----+-----------+----------+---------+
|label|avg_decline|avg_rating|avg_trips|
+-----+-----------+----------+---------+
|    0|      0.045|      4.73|      6.7|
|    1|      0.175|      4.27|      4.8|
+-----+-----------+----------+---------+

The higher-risk class (label=1) shows higher decline rate and lower rating on average. The features look useful for predicting decline risk.

Next, you’ll define the training Dataset: the reproducible definition of what data to train on.