Training a classification model
For the complete documentation index see: llms.txt
All documentation pages available in markdown.
With train_df and test_df ready, you can train a baseline classifier in a few lines.
For this tutorial, you’ll train a Logistic Regression model, which trains quickly and performs well on this type of dataset.
In plain terms, the model learns from historical examples where label is already known, then estimates the probability that a new driver belongs to the higher-risk class.
It does this by learning how strongly each input feature is associated with the final outcome.
Train the model
- Run
Cell 11to configure and train a Logistic Regression model.
Cell 11: Configure and train a LogisticRegression model
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression( featuresCol='features', labelCol='label', maxIter=100, regParam=0.01)
lr_model = lr.fit(train_df)print("Logistic Regression model trained")Logistic Regression model trainedScore the test set and evaluate accuracy
Use the trained model to generate predictions on unseen data, then summarize overall quality with one metric.
- Run
Cell 12to generate predictions and evaluate accuracy.
Cell 12: Generate predictions and evaluate accuracy
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
lr_predictions = lr_model.transform(test_df)lr_predictions.select('driver_id', 'label', 'prediction', 'probability').show(5, truncate=False)
accuracy = MulticlassClassificationEvaluator( labelCol='label', predictionCol='prediction', metricName='accuracy').evaluate(lr_predictions)
print(f"Accuracy: {accuracy:.2%}")+----------+-----+----------+------------------------------------------+|driver_id |label|prediction|probability |+----------+-----+----------+------------------------------------------+|driver_007|0 |0.0 |[0.9412345678901234,0.05876543210987654] ||driver_012|1 |1.0 |[0.12345678901234567,0.8765432109876543] ||driver_019|0 |0.0 |[0.8876543210987654,0.11234567890123456] ||driver_023|0 |0.0 |[0.9654321098765432,0.034567890123456784] ||driver_031|1 |1.0 |[0.21456789012345678,0.7854321098765432] |+----------+-----+----------+------------------------------------------+Accuracy: 94.44%In this output, prediction is the model’s class decision (0 for typical risk, 1 for higher risk).
probability shows confidence for both classes, where index 1 is the higher-risk probability.
accuracy compares all test-set prediction values against label and returns correct_predictions / total_predictions.
This gives you a clear baseline before moving into model persistence and serving.
If accuracy is very low, it usually means either the model did not learn useful signal or there is a data/label issue upstream. Your exact values might vary slightly from these examples depending on your Spark version and partitioning. This single metric is enough for this tutorial to continue into deployment-focused steps.
Next, you’ll save the trained model artifact for reuse in Part 3 serving workflows.