Set up Spark

For the complete documentation index see: llms.txt

All documentation pages available in markdown.

This page walks through setting up an environment to use Spark and the Aerospike Spark connector. Jupyter is an interactive computing environment that lets you write and run Python code on a browser page called a notebook in chunks called cells, seeing results immediately. If you are new to notebooks, see Aerospike notebook tips and Jupyter basics.

Using a notebook lets you see the steps of the script, as well as troubleshoot specific sections if anything goes wrong.

You run this feature store in a single Jupyter notebook, running the pre-filled cells in order as you follow along on these pages. In production, you’d replace the notebook with scheduled jobs.

Prepare your project directory

Create the project directory and enter it:
Terminal window
```
mkdir feature-store-tutorial
cd feature-store-tutorial
```
Download the Aerospike Spark connector JAR:

Download the connector from Aerospike Enterprise downloads using the Spark/Scala versions from the previous page. For this tutorial, use the Spark 3.5 / Scala 2.13 connector build:
```
aerospike-spark-5.0.1-spark3.5-scala2.13-clientunshaded.jar
```
Download the tutorial notebook into the same directory:

Tutorial template notebook
Download feature_store_tutorial.ipynb and place it in feature-store-tutorial/. It includes pre-filled setup and class-definition cells for Part 1.

Download feature_store_tutorial.ipynb
Confirm your directory structure:
- Directoryfeature-store-tutorial/
  - aerospike-spark-5.0.1-spark3.5-scala2.13-clientunshaded.jar Spark connector
  - feature_store_tutorial.ipynb Tutorial notebook

Set up your Jupyter notebook

Start JupyterLab from your project directory:

cd feature-store-tutorial
python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install jupyterlab
export JAVA_HOME="$(brew --prefix openjdk@17)/libexec/openjdk.jdk/Contents/Home"
export SPARK_HOME="$HOME/spark/spark-3.5.3-bin-hadoop3-scala2.13"
export PATH="$JAVA_HOME/bin:$SPARK_HOME/bin:$PATH"
jupyter lab

This opens the Jupyter interface in your browser with Python 3.11, JDK 17, and Spark 3.5.3 available to the notebook kernel.

Open the provided notebook (feature_store_tutorial.ipynb).

You do not need to create a blank notebook or copy any cells manually.
Install the required Python packages (required one-time step):

Run the pre-filled package-install cell near the top of the notebook:
```
import sys
print(sys.version)
if not sys.version.startswith("3.11."):
    raise RuntimeError("Use a Python 3.11 notebook kernel for this tutorial.")
```
```
%pip install pyspark==3.5.3 findspark
```
- pyspark is the Python API for Apache Spark, letting you work with distributed data using familiar Python syntax.
- findspark helps Jupyter locate your Spark installation so you can import PySpark modules.
Run the cell (Shift+Enter), then restart the kernel (Kernel → Restart) so the packages are available.

Run the pre-filled configuration cell in the ## Setup section:

Centralizing configuration at the top of your notebook makes it straightforward to adjust settings without hunting through code. This cell defines three groups of settings:

import os
import glob
from pathlib import Path

# Auto-detect JDK 17 and Spark 3.5.3 from common install locations.
_JAVA_CANDIDATES = [
    os.environ.get('JAVA_HOME', ''),
    '/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home',
    '/usr/local/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home',
]
JAVA_HOME = next((p for p in _JAVA_CANDIDATES if p and Path(p).exists()), None)
if not JAVA_HOME:
    raise EnvironmentError(
        "JDK 17 not found. Set JAVA_HOME before starting JupyterLab."
    )
os.environ['JAVA_HOME'] = JAVA_HOME
os.environ['PATH'] = f"{Path(JAVA_HOME) / 'bin'}{os.pathsep}{os.environ['PATH']}"

_SPARK_CANDIDATES = [
    os.environ.get('SPARK_HOME', ''),
    str(Path.home() / 'spark' / 'spark-3.5.3-bin-hadoop3-scala2.13'),
    '/opt/homebrew/opt/apache-spark/libexec',   # macOS Homebrew (Apple Silicon)
    '/usr/local/opt/apache-spark/libexec',       # macOS Homebrew (Intel)
    '/opt/spark',                                 # Linux common path
]
SPARK_HOME = next((p for p in _SPARK_CANDIDATES if p and Path(p).exists()), None)
if not SPARK_HOME:
    raise EnvironmentError(
        "Spark not found. Set SPARK_HOME to Spark 3.5.3 with Scala 2.13."
    )
os.environ['SPARK_HOME'] = SPARK_HOME
os.environ.setdefault('SPARK_LOCAL_IP', '127.0.0.1')

# Aerospike connection settings
AS_HOST = '127.0.0.1'
AS_PORT = 3000
AS_NAMESPACE = 'test'

# Auto-detect the Aerospike Spark connector JAR in the working directory
_jar_hits = sorted(glob.glob('aerospike-spark-*-clientunshaded.jar'))
if not _jar_hits:
    raise FileNotFoundError(
        "No Aerospike Spark connector JAR found in the working directory. "
        "Download it from https://aerospike.com/download/connector/spark/"
    )
AEROSPIKE_JAR = _jar_hits[-1]

print(f"JAVA_HOME: {JAVA_HOME}")
print(f"SPARK_HOME: {SPARK_HOME}")
print(f"Connector JAR: {AEROSPIKE_JAR}")

Java and Spark paths: The cell checks common install locations automatically. If none match, it raises an error with instructions. Set the JAVA_HOME and SPARK_HOME environment variables before starting Jupyter to override the auto-detection.
Aerospike connection: Point to your Docker container running Aerospike. Port 3000 is the default client port.
Connector JAR: The cell finds the JAR by glob pattern. If you have multiple JARs, it picks the latest alphabetically.

Run the pre-filled Spark initialization cell:

This cell creates a SparkSession, the entry point for all Spark functionality, and configures it to use the Aerospike connector.

import findspark
findspark.init(SPARK_HOME)

from pyspark.sql import SparkSession

# Build path to the connector JAR
jar_path = Path.cwd() / AEROSPIKE_JAR

# Create SparkSession with the Aerospike connector by configuring it to use the JAR file at the specified path during startup
spark = SparkSession.builder \
    .appName('FeatureStoreTutorial') \
    .config('spark.jars', str(jar_path)) \
    .config('aerospike.seed-nodes', f'{AS_HOST}:{AS_PORT}') \
    .config('aerospike.namespace', AS_NAMESPACE) \
    .config('aerospike.sindex-enable', 'false') \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
log4j = spark._jvm.org.apache.log4j
for logger_name in [
    "com.aerospike.spark",
    "CustomSIndexFilterProvider",
    "AerospikeBatchRecordWriter",
    "AerospikeConfig",
]:
    log4j.Logger.getLogger(logger_name).setLevel(log4j.Level.ERROR)

print(f"Spark version: {spark.version}")
print(f"Aerospike JAR: {jar_path}")

Key points:

findspark.init() configures the Python environment so PySpark imports work correctly.
spark.jars tells Spark to load the Aerospike connector JAR at startup.
aerospike.seed-nodes and aerospike.namespace set default connection parameters for all Aerospike operations.
getOrCreate() either creates a new session or reuses an existing one (useful when re-running cells).

When you run the cell, it should print the Spark version and connector JAR path. Confirm it prints Spark 3.5.x; the tutorial uses Spark 3.5.3 with Scala 2.13 and JDK 17.

Run a round-trip Spark test

Now test that Spark can communicate with Aerospike end to end: create one row, write it, then read it back. This tutorial was validated with Python 3.11. If your notebook kernel shows Python 3.14 or later, switch to a Python 3.11 kernel before continuing.

Create a one-row DataFrame:

This creates the input row used by the write step.

from pyspark.sql import Row

test_df = spark.createDataFrame([
    Row(__key='test_001', message='Hello from Spark!')
])

print("Prepared one test record for Aerospike write")

Run the write test record cell:

This saves the DataFrame to Aerospike. Aerospike is schemaless: instead of table rows and columns, data is stored as records (in a namespace and set) with named bins inside each record.
```
test_df.write \
    .mode('overwrite') \
    .format('aerospike') \
    .option('aerospike.write-set', 'spark-test') \
    .option('aerospike.write-with-key', '__key') \
    .save()

print("Successfully wrote test record to Aerospike!")
```
What’s happening here:
- Row(__key='test_001', ...) creates a record with a primary key and a data field.
- format('aerospike') tells Spark to use the Aerospike connector.
- aerospike.write-set specifies which Aerospike set to store the data in.
- aerospike.write-with-key identifies which DataFrame column contains the primary key.
Run the read-back cell:

Finally, read the data back from Aerospike to confirm the full round-trip works.
```
result_df = spark.read \
    .format('aerospike') \
    .option('aerospike.read-set', 'spark-test') \
    .load()

result_df.show()
```
This queries the spark-test set and returns the results as a DataFrame. The .show() method prints the data in a formatted table. Verify the output includes your test record with __key and message columns.

If the read-back shows the test_001 row with __key and message, your setup is ready for the tutorial.