Training a classification model

For the complete documentation index see: llms.txt

All documentation pages available in markdown.

With train_df and test_df ready, you can train a baseline classifier in a few lines.

For this tutorial, you’ll train a Logistic Regression model, which trains quickly and performs well on this type of dataset. In plain terms, the model learns from historical examples where label is already known, then estimates the probability that a new driver belongs to the higher-risk class. It does this by learning how strongly each input feature is associated with the final outcome.

Train the model

Run Cell 11 to configure and train a Logistic Regression model.

Cell 11: Configure and train a LogisticRegression model

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(
    featuresCol='features',
    labelCol='label',
    maxIter=100,
    regParam=0.01
)

lr_model = lr.fit(train_df)
print("Logistic Regression model trained")

Train Logistic Regression

Logistic Regression model trained

Score the test set and evaluate accuracy

Use the trained model to generate predictions on unseen data, then summarize overall quality with one metric.

Run Cell 12 to generate predictions and evaluate accuracy.

Cell 12: Generate predictions and evaluate accuracy

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

lr_predictions = lr_model.transform(test_df)
lr_predictions.select('driver_id', 'label', 'prediction', 'probability').show(5, truncate=False)

accuracy = MulticlassClassificationEvaluator(
    labelCol='label',
    predictionCol='prediction',
    metricName='accuracy'
).evaluate(lr_predictions)

print(f"Accuracy: {accuracy:.2%}")

Predictions and accuracy

+----------+-----+----------+------------------------------------------+
|driver_id |label|prediction|probability                               |
+----------+-----+----------+------------------------------------------+
|driver_007|0    |0.0       |[0.9412345678901234,0.05876543210987654]  |
|driver_012|1    |1.0       |[0.12345678901234567,0.8765432109876543]  |
|driver_019|0    |0.0       |[0.8876543210987654,0.11234567890123456]  |
|driver_023|0    |0.0       |[0.9654321098765432,0.034567890123456784] |
|driver_031|1    |1.0       |[0.21456789012345678,0.7854321098765432]  |
+----------+-----+----------+------------------------------------------+
Accuracy: 94.44%

In this output, prediction is the model’s class decision (0 for typical risk, 1 for higher risk). probability shows confidence for both classes, where index 1 is the higher-risk probability. accuracy compares all test-set prediction values against label and returns correct_predictions / total_predictions. This gives you a clear baseline before moving into model persistence and serving.

If accuracy is very low, it usually means either the model did not learn useful signal or there is a data/label issue upstream. Your exact values might vary slightly from these examples depending on your Spark version and partitioning. This single metric is enough for this tutorial to continue into deployment-focused steps.

Next, you’ll save the trained model artifact for reuse in Part 3 serving workflows.