---
title: "Training a classification model"
description: "Train and evaluate a simple baseline model with Spark MLlib."
---

# Training a classification model

> For the complete documentation index see: [llms.txt](https://aerospike.com/docs/llms.txt)
> 
> All documentation pages available in markdown.

With `train_df` and `test_df` ready, you can train a baseline classifier in a few lines.

For this tutorial, you’ll train a Logistic Regression model, which trains quickly and performs well on this type of dataset. In plain terms, the model learns from historical examples where `label` is already known, then estimates the probability that a new driver belongs to the higher-risk class. It does this by learning how strongly each input feature is associated with the final outcome.

## Train the model

1.  Run `Cell 11` to configure and train a Logistic Regression model.

Cell 11: Configure and train a LogisticRegression model

```python
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(

    featuresCol='features',

    labelCol='label',

    maxIter=100,

    regParam=0.01

)

lr_model = lr.fit(train_df)

print("Logistic Regression model trained")
```

Train Logistic Regression

```plaintext
Logistic Regression model trained
```

## Score the test set and evaluate accuracy

Use the trained model to generate predictions on unseen data, then summarize overall quality with one metric.

1.  Run `Cell 12` to generate predictions and evaluate accuracy.

Cell 12: Generate predictions and evaluate accuracy

```python
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

lr_predictions = lr_model.transform(test_df)

lr_predictions.select('driver_id', 'label', 'prediction', 'probability').show(5, truncate=False)

accuracy = MulticlassClassificationEvaluator(

    labelCol='label',

    predictionCol='prediction',

    metricName='accuracy'

).evaluate(lr_predictions)

print(f"Accuracy: {accuracy:.2%}")
```

Predictions and accuracy

```plaintext
+----------+-----+----------+------------------------------------------+

|driver_id |label|prediction|probability                               |

+----------+-----+----------+------------------------------------------+

|driver_007|0    |0.0       |[0.9412345678901234,0.05876543210987654]  |

|driver_012|1    |1.0       |[0.12345678901234567,0.8765432109876543]  |

|driver_019|0    |0.0       |[0.8876543210987654,0.11234567890123456]  |

|driver_023|0    |0.0       |[0.9654321098765432,0.034567890123456784] |

|driver_031|1    |1.0       |[0.21456789012345678,0.7854321098765432]  |

+----------+-----+----------+------------------------------------------+

Accuracy: 94.44%
```

In this output, `prediction` is the model’s class decision (`0` for typical risk, `1` for higher risk). `probability` shows confidence for both classes, where index 1 is the higher-risk probability. `accuracy` compares all test-set `prediction` values against `label` and returns `correct_predictions / total_predictions`. This gives you a clear baseline before moving into model persistence and serving.

If accuracy is very low, it usually means either the model did not learn useful signal or there is a data/label issue upstream. Your exact values might vary slightly from these examples depending on your Spark version and partitioning. This single metric is enough for this tutorial to continue into deployment-focused steps.

::: serving latency reality
Model inference itself is often fast. In production, a bigger challenge is retrieving the right feature values quickly for each request. If feature retrieval is slow, end-to-end prediction latency will miss your budget even when the model is efficient. Part 3 focuses on this bottleneck by using Aerospike primary-key lookups for low-latency feature reads.
:::

Next, you’ll save the trained model artifact for reuse in Part 3 serving workflows.

::: undefined
-   I can train a Logistic Regression model with Spark MLlib.
-   I can evaluate a held-out test set with one baseline metric.
:::

[Previous  
Materializing training data](https://aerospike.com/docs/develop/model-training/step/2/part/1/materializing-data) [Next  
Saving and loading models](https://aerospike.com/docs/develop/model-training/step/3/part/1/saving-models)