Skip to content

Set up Spark

For the complete documentation index see: llms.txt

All documentation pages available in markdown.

This page walks through setting up an environment to use Spark and the Aerospike Spark connector. Jupyter is an interactive computing environment that lets you write and run Python code on a browser page called a notebook in chunks called cells, seeing results immediately. If you are new to notebooks, see Aerospike notebook tips and Jupyter basics.

Using a notebook lets you see the steps of the script, as well as troubleshoot specific sections if anything goes wrong.

You run this feature store in a single Jupyter notebook, running the pre-filled cells in order as you follow along on these pages. In production, you’d replace the notebook with scheduled jobs.

Prepare your project directory

  1. Create the project directory and enter it:

    Terminal window
    mkdir feature-store-tutorial
    cd feature-store-tutorial
  2. Download the Aerospike Spark connector JAR:

    Download the connector from Aerospike Enterprise downloads using the Spark/Scala versions from the previous page. For this tutorial, use the Spark 3.5 / Scala 2.13 connector build:

    aerospike-spark-5.0.1-spark3.5-scala2.13-clientunshaded.jar
  3. Download the tutorial notebook into the same directory:

    Tutorial template notebook
    Download feature_store_tutorial.ipynb and place it in feature-store-tutorial/. It includes pre-filled setup and class-definition cells for Part 1.
  4. Confirm your directory structure:

    • Directoryfeature-store-tutorial/
      • aerospike-spark-5.0.1-spark3.5-scala2.13-clientunshaded.jar Spark connector
      • feature_store_tutorial.ipynb Tutorial notebook

Set up your Jupyter notebook

  1. Start JupyterLab from your project directory:

    Terminal window
    cd feature-store-tutorial
    python3.11 -m venv .venv
    source .venv/bin/activate
    python -m pip install jupyterlab
    export JAVA_HOME="$(brew --prefix openjdk@17)/libexec/openjdk.jdk/Contents/Home"
    export SPARK_HOME="$HOME/spark/spark-3.5.3-bin-hadoop3-scala2.13"
    export PATH="$JAVA_HOME/bin:$SPARK_HOME/bin:$PATH"
    jupyter lab

    This opens the Jupyter interface in your browser with Python 3.11, JDK 17, and Spark 3.5.3 available to the notebook kernel.

  2. Open the provided notebook (feature_store_tutorial.ipynb).

    You do not need to create a blank notebook or copy any cells manually.

  3. Install the required Python packages (required one-time step):

    Run the pre-filled package-install cell near the top of the notebook:

    import sys
    print(sys.version)
    if not sys.version.startswith("3.11."):
    raise RuntimeError("Use a Python 3.11 notebook kernel for this tutorial.")
    %pip install pyspark==3.5.3 findspark
    • pyspark is the Python API for Apache Spark, letting you work with distributed data using familiar Python syntax.
    • findspark helps Jupyter locate your Spark installation so you can import PySpark modules.

    Run the cell (Shift+Enter), then restart the kernel (Kernel → Restart) so the packages are available.

  4. Run the pre-filled configuration cell in the ## Setup section:

    Centralizing configuration at the top of your notebook makes it straightforward to adjust settings without hunting through code. This cell defines three groups of settings:

    import os
    import glob
    from pathlib import Path
    # Auto-detect JDK 17 and Spark 3.5.3 from common install locations.
    _JAVA_CANDIDATES = [
    os.environ.get('JAVA_HOME', ''),
    '/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home',
    '/usr/local/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home',
    ]
    JAVA_HOME = next((p for p in _JAVA_CANDIDATES if p and Path(p).exists()), None)
    if not JAVA_HOME:
    raise EnvironmentError(
    "JDK 17 not found. Set JAVA_HOME before starting JupyterLab."
    )
    os.environ['JAVA_HOME'] = JAVA_HOME
    os.environ['PATH'] = f"{Path(JAVA_HOME) / 'bin'}{os.pathsep}{os.environ['PATH']}"
    _SPARK_CANDIDATES = [
    os.environ.get('SPARK_HOME', ''),
    str(Path.home() / 'spark' / 'spark-3.5.3-bin-hadoop3-scala2.13'),
    '/opt/homebrew/opt/apache-spark/libexec', # macOS Homebrew (Apple Silicon)
    '/usr/local/opt/apache-spark/libexec', # macOS Homebrew (Intel)
    '/opt/spark', # Linux common path
    ]
    SPARK_HOME = next((p for p in _SPARK_CANDIDATES if p and Path(p).exists()), None)
    if not SPARK_HOME:
    raise EnvironmentError(
    "Spark not found. Set SPARK_HOME to Spark 3.5.3 with Scala 2.13."
    )
    os.environ['SPARK_HOME'] = SPARK_HOME
    os.environ.setdefault('SPARK_LOCAL_IP', '127.0.0.1')
    # Aerospike connection settings
    AS_HOST = '127.0.0.1'
    AS_PORT = 3000
    AS_NAMESPACE = 'test'
    # Auto-detect the Aerospike Spark connector JAR in the working directory
    _jar_hits = sorted(glob.glob('aerospike-spark-*-clientunshaded.jar'))
    if not _jar_hits:
    raise FileNotFoundError(
    "No Aerospike Spark connector JAR found in the working directory. "
    "Download it from https://aerospike.com/download/connector/spark/"
    )
    AEROSPIKE_JAR = _jar_hits[-1]
    print(f"JAVA_HOME: {JAVA_HOME}")
    print(f"SPARK_HOME: {SPARK_HOME}")
    print(f"Connector JAR: {AEROSPIKE_JAR}")
    • Java and Spark paths: The cell checks common install locations automatically. If none match, it raises an error with instructions. Set the JAVA_HOME and SPARK_HOME environment variables before starting Jupyter to override the auto-detection.
    • Aerospike connection: Point to your Docker container running Aerospike. Port 3000 is the default client port.
    • Connector JAR: The cell finds the JAR by glob pattern. If you have multiple JARs, it picks the latest alphabetically.
  5. Run the pre-filled Spark initialization cell:

    This cell creates a SparkSession, the entry point for all Spark functionality, and configures it to use the Aerospike connector.

    import findspark
    findspark.init(SPARK_HOME)
    from pyspark.sql import SparkSession
    # Build path to the connector JAR
    jar_path = Path.cwd() / AEROSPIKE_JAR
    # Create SparkSession with the Aerospike connector by configuring it to use the JAR file at the specified path during startup
    spark = SparkSession.builder \
    .appName('FeatureStoreTutorial') \
    .config('spark.jars', str(jar_path)) \
    .config('aerospike.seed-nodes', f'{AS_HOST}:{AS_PORT}') \
    .config('aerospike.namespace', AS_NAMESPACE) \
    .config('aerospike.sindex-enable', 'false') \
    .getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    log4j = spark._jvm.org.apache.log4j
    for logger_name in [
    "com.aerospike.spark",
    "CustomSIndexFilterProvider",
    "AerospikeBatchRecordWriter",
    "AerospikeConfig",
    ]:
    log4j.Logger.getLogger(logger_name).setLevel(log4j.Level.ERROR)
    print(f"Spark version: {spark.version}")
    print(f"Aerospike JAR: {jar_path}")

    Key points:

    • findspark.init() configures the Python environment so PySpark imports work correctly.
    • spark.jars tells Spark to load the Aerospike connector JAR at startup.
    • aerospike.seed-nodes and aerospike.namespace set default connection parameters for all Aerospike operations.
    • getOrCreate() either creates a new session or reuses an existing one (useful when re-running cells).

    When you run the cell, it should print the Spark version and connector JAR path. Confirm it prints Spark 3.5.x; the tutorial uses Spark 3.5.3 with Scala 2.13 and JDK 17.

Run a round-trip Spark test

Now test that Spark can communicate with Aerospike end to end: create one row, write it, then read it back. This tutorial was validated with Python 3.11. If your notebook kernel shows Python 3.14 or later, switch to a Python 3.11 kernel before continuing.

  1. Create a one-row DataFrame:

    This creates the input row used by the write step.

    from pyspark.sql import Row
    test_df = spark.createDataFrame([
    Row(__key='test_001', message='Hello from Spark!')
    ])
    print("Prepared one test record for Aerospike write")
  2. Run the write test record cell:

    This saves the DataFrame to Aerospike. Aerospike is schemaless: instead of table rows and columns, data is stored as records (in a namespace and set) with named bins inside each record.

    test_df.write \
    .mode('overwrite') \
    .format('aerospike') \
    .option('aerospike.write-set', 'spark-test') \
    .option('aerospike.write-with-key', '__key') \
    .save()
    print("Successfully wrote test record to Aerospike!")

    What’s happening here:

    • Row(__key='test_001', ...) creates a record with a primary key and a data field.
    • format('aerospike') tells Spark to use the Aerospike connector.
    • aerospike.write-set specifies which Aerospike set to store the data in.
    • aerospike.write-with-key identifies which DataFrame column contains the primary key.
  3. Run the read-back cell:

    Finally, read the data back from Aerospike to confirm the full round-trip works.

    result_df = spark.read \
    .format('aerospike') \
    .option('aerospike.read-set', 'spark-test') \
    .load()
    result_df.show()

    This queries the spark-test set and returns the results as a DataFrame. The .show() method prints the data in a formatted table. Verify the output includes your test record with __key and message columns.

If the read-back shows the test_001 row with __key and message, your setup is ready for the tutorial.

Feedback

Was this page helpful?

What type of feedback are you giving?

What would you like us to know?

+Capture screenshot

Can we reach out to you?