Exploring features
For the complete documentation index see: llms.txt
All documentation pages available in markdown.
Before training, explore the feature catalog and inspect the data. You already know what’s in this feature store from Part 1, but a new teammate would run these same catalog queries to learn which driver features are available.
Feature discovery
Run two queries to confirm what features are available. Since your goal is predicting information about drivers, look for driver-related features.
Query the FeatureGroup class to see which feature groups produce features for the driver Entity type.
Then query the Feature class to see which feature IDs contain the string driver.
- Run
Cell 3to query feature groups and features for drivers.
Cell 3: Query feature groups and features for drivers
# 1) Which feature groups produce driver features?FeatureGroup.query("attrs.entity == 'driver'") \ .select("name", "description", "source") \ .show(truncate=50)
# 2) Which individual features are available in those groups?Feature.query("fgname like 'driver%'") \ .select("fid", "type", "description") \ .show(truncate=50)+---------------+------------------------------------------+------------------------------+|name |description |source |+---------------+------------------------------------------+------------------------------+|driver-stats |Driver performance metrics... |kafka://events.rides.completed||driver-activity|Driver activity metrics from GPS pipeline |kafka://events.gps.pings |+---------------+------------------------------------------+------------------------------+
+-------------------------------+-------+-------------------------------------------+|fid |type |description |+-------------------------------+-------+-------------------------------------------+|driver-stats_decline_rate |double |Fraction of ride requests declined... ||driver-stats_avg_rating |double |Average rider rating for driver... ||driver-activity_trips_today |integer|Number of completed trips today |+-------------------------------+-------+-------------------------------------------+In a real environment you may have to sort through many groups and features, but here the catalog is small enough to scan in one query.
Load the labeled feature data
Now load the actual records to see what the training data looks like.
- Run
Cell 4to load the labeled driver data.
Cell 4: Load labeled driver data
if "require_part2" not in globals(): raise ValueError("Missing Part 2 bootstrap helper. Re-run Part 2 Cell 1.")
require_part2(["SCHEMA", "ENTITY_TYPE", "ENTITY_ID_COL", "TRAINING_PREDICATE", "TRAIN_COLUMNS"])
driver_df = Entity.query(ENTITY_TYPE, TRAINING_PREDICATE, SCHEMA, ENTITY_ID_COL)
print(f"Loaded driver records: {driver_df.count()}")driver_df.select(ENTITY_ID_COL, *TRAIN_COLUMNS).show(5)Loaded driver records: 100+----------+------------+-------------+---------------+-----+|driver_id |ds_decl_rate|ds_avg_rating|da_trips_today|label|+----------+------------+-------------+---------------+-----+|driver_001| 0.037| 4.78| 9| 0||driver_002| 0.182| 4.21| 3| 1||driver_003| 0.052| 4.65| 11| 0||driver_004| 0.028| 4.89| 7| 0||driver_005| 0.145| 4.15| 5| 1|+----------+------------+-------------+---------------+-----+Check feature patterns by label
A quick summary of average feature values by label, to confirm the features separate the two classes.
- Run
Cell 5to compute average feature values per label.
Cell 5: Compute average feature values per label
from pyspark.sql.functions import avg, round as spark_round
decline_col, rating_col, trips_col = FEATURE_COLUMNS
driver_df.groupBy(LABEL_COL).agg( spark_round(avg(decline_col), 3).alias('avg_decline'), spark_round(avg(rating_col), 2).alias('avg_rating'), spark_round(avg(trips_col), 1).alias('avg_trips')).orderBy(LABEL_COL).show()+-----+-----------+----------+---------+|label|avg_decline|avg_rating|avg_trips|+-----+-----------+----------+---------+| 0| 0.045| 4.73| 6.7|| 1| 0.175| 4.27| 4.8|+-----+-----------+----------+---------+The higher-risk class (label=1) shows higher decline rate and lower rating on average.
The features look useful for predicting decline risk.
Next, you’ll define the training Dataset: the reproducible definition of what data to train on.