Set up Spark
For the complete documentation index see: llms.txt
All documentation pages available in markdown.
This page walks through setting up an environment to use Spark and the Aerospike Spark connector. Jupyter is an interactive computing environment that lets you write and run Python code on a browser page called a notebook in chunks called cells, seeing results immediately. If you are new to notebooks, see Aerospike notebook tips and Jupyter basics.
Using a notebook lets you see the steps of the script, as well as troubleshoot specific sections if anything goes wrong.
You run this feature store in a single Jupyter notebook, running the pre-filled cells in order as you follow along on these pages. In production, you’d replace the notebook with scheduled jobs.
Prepare your project directory
-
Create the project directory and enter it:
Terminal window mkdir feature-store-tutorialcd feature-store-tutorial -
Download the Aerospike Spark connector JAR:
Download the connector from Aerospike Enterprise downloads using the Spark/Scala versions from the previous page. For this tutorial, use the Spark 3.5 / Scala 2.13 connector build:
aerospike-spark-5.0.1-spark3.5-scala2.13-clientunshaded.jar -
Download the tutorial notebook into the same directory:
Tutorial template notebookDownload feature_store_tutorial.ipynb and place it in feature-store-tutorial/. It includes pre-filled setup and class-definition cells for Part 1. -
Confirm your directory structure:
Directoryfeature-store-tutorial/
- aerospike-spark-5.0.1-spark3.5-scala2.13-clientunshaded.jar Spark connector
- feature_store_tutorial.ipynb Tutorial notebook
Set up your Jupyter notebook
-
Start JupyterLab from your project directory:
Terminal window cd feature-store-tutorialpython3.11 -m venv .venvsource .venv/bin/activatepython -m pip install jupyterlabexport JAVA_HOME="$(brew --prefix openjdk@17)/libexec/openjdk.jdk/Contents/Home"export SPARK_HOME="$HOME/spark/spark-3.5.3-bin-hadoop3-scala2.13"export PATH="$JAVA_HOME/bin:$SPARK_HOME/bin:$PATH"jupyter labThis opens the Jupyter interface in your browser with Python 3.11, JDK 17, and Spark 3.5.3 available to the notebook kernel.
-
Open the provided notebook (
feature_store_tutorial.ipynb).You do not need to create a blank notebook or copy any cells manually.
-
Install the required Python packages (required one-time step):
Run the pre-filled package-install cell near the top of the notebook:
import sysprint(sys.version)if not sys.version.startswith("3.11."):raise RuntimeError("Use a Python 3.11 notebook kernel for this tutorial.")%pip install pyspark==3.5.3 findspark- pyspark is the Python API for Apache Spark, letting you work with distributed data using familiar Python syntax.
- findspark helps Jupyter locate your Spark installation so you can import PySpark modules.
Run the cell (Shift+Enter), then restart the kernel (Kernel → Restart) so the packages are available.
-
Run the pre-filled configuration cell in the
## Setupsection:Centralizing configuration at the top of your notebook makes it straightforward to adjust settings without hunting through code. This cell defines three groups of settings:
import osimport globfrom pathlib import Path# Auto-detect JDK 17 and Spark 3.5.3 from common install locations._JAVA_CANDIDATES = [os.environ.get('JAVA_HOME', ''),'/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home','/usr/local/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home',]JAVA_HOME = next((p for p in _JAVA_CANDIDATES if p and Path(p).exists()), None)if not JAVA_HOME:raise EnvironmentError("JDK 17 not found. Set JAVA_HOME before starting JupyterLab.")os.environ['JAVA_HOME'] = JAVA_HOMEos.environ['PATH'] = f"{Path(JAVA_HOME) / 'bin'}{os.pathsep}{os.environ['PATH']}"_SPARK_CANDIDATES = [os.environ.get('SPARK_HOME', ''),str(Path.home() / 'spark' / 'spark-3.5.3-bin-hadoop3-scala2.13'),'/opt/homebrew/opt/apache-spark/libexec', # macOS Homebrew (Apple Silicon)'/usr/local/opt/apache-spark/libexec', # macOS Homebrew (Intel)'/opt/spark', # Linux common path]SPARK_HOME = next((p for p in _SPARK_CANDIDATES if p and Path(p).exists()), None)if not SPARK_HOME:raise EnvironmentError("Spark not found. Set SPARK_HOME to Spark 3.5.3 with Scala 2.13.")os.environ['SPARK_HOME'] = SPARK_HOMEos.environ.setdefault('SPARK_LOCAL_IP', '127.0.0.1')# Aerospike connection settingsAS_HOST = '127.0.0.1'AS_PORT = 3000AS_NAMESPACE = 'test'# Auto-detect the Aerospike Spark connector JAR in the working directory_jar_hits = sorted(glob.glob('aerospike-spark-*-clientunshaded.jar'))if not _jar_hits:raise FileNotFoundError("No Aerospike Spark connector JAR found in the working directory. ""Download it from https://aerospike.com/download/connector/spark/")AEROSPIKE_JAR = _jar_hits[-1]print(f"JAVA_HOME: {JAVA_HOME}")print(f"SPARK_HOME: {SPARK_HOME}")print(f"Connector JAR: {AEROSPIKE_JAR}")- Java and Spark paths: The cell checks common install locations automatically. If none match, it raises an error with instructions. Set the
JAVA_HOMEandSPARK_HOMEenvironment variables before starting Jupyter to override the auto-detection. - Aerospike connection: Point to your Docker container running Aerospike. Port 3000 is the default client port.
- Connector JAR: The cell finds the JAR by glob pattern. If you have multiple JARs, it picks the latest alphabetically.
- Java and Spark paths: The cell checks common install locations automatically. If none match, it raises an error with instructions. Set the
-
Run the pre-filled Spark initialization cell:
This cell creates a
SparkSession, the entry point for all Spark functionality, and configures it to use the Aerospike connector.import findsparkfindspark.init(SPARK_HOME)from pyspark.sql import SparkSession# Build path to the connector JARjar_path = Path.cwd() / AEROSPIKE_JAR# Create SparkSession with the Aerospike connector by configuring it to use the JAR file at the specified path during startupspark = SparkSession.builder \.appName('FeatureStoreTutorial') \.config('spark.jars', str(jar_path)) \.config('aerospike.seed-nodes', f'{AS_HOST}:{AS_PORT}') \.config('aerospike.namespace', AS_NAMESPACE) \.config('aerospike.sindex-enable', 'false') \.getOrCreate()spark.sparkContext.setLogLevel("ERROR")log4j = spark._jvm.org.apache.log4jfor logger_name in ["com.aerospike.spark","CustomSIndexFilterProvider","AerospikeBatchRecordWriter","AerospikeConfig",]:log4j.Logger.getLogger(logger_name).setLevel(log4j.Level.ERROR)print(f"Spark version: {spark.version}")print(f"Aerospike JAR: {jar_path}")Key points:
findspark.init()configures the Python environment so PySpark imports work correctly.spark.jarstells Spark to load the Aerospike connector JAR at startup.aerospike.seed-nodesandaerospike.namespaceset default connection parameters for all Aerospike operations.getOrCreate()either creates a new session or reuses an existing one (useful when re-running cells).
When you run the cell, it should print the Spark version and connector JAR path. Confirm it prints Spark
3.5.x; the tutorial uses Spark 3.5.3 with Scala 2.13 and JDK 17.
Run a round-trip Spark test
Now test that Spark can communicate with Aerospike end to end: create one row, write it, then read it back. This tutorial was validated with Python 3.11. If your notebook kernel shows Python 3.14 or later, switch to a Python 3.11 kernel before continuing.
-
Create a one-row DataFrame:
This creates the input row used by the write step.
from pyspark.sql import Rowtest_df = spark.createDataFrame([Row(__key='test_001', message='Hello from Spark!')])print("Prepared one test record for Aerospike write") -
Run the write test record cell:
This saves the DataFrame to Aerospike. Aerospike is schemaless: instead of table rows and columns, data is stored as records (in a namespace and set) with named bins inside each record.
test_df.write \.mode('overwrite') \.format('aerospike') \.option('aerospike.write-set', 'spark-test') \.option('aerospike.write-with-key', '__key') \.save()print("Successfully wrote test record to Aerospike!")What’s happening here:
Row(__key='test_001', ...)creates a record with a primary key and a data field.format('aerospike')tells Spark to use the Aerospike connector.aerospike.write-setspecifies which Aerospike set to store the data in.aerospike.write-with-keyidentifies which DataFrame column contains the primary key.
-
Run the read-back cell:
Finally, read the data back from Aerospike to confirm the full round-trip works.
result_df = spark.read \.format('aerospike') \.option('aerospike.read-set', 'spark-test') \.load()result_df.show()This queries the
spark-testset and returns the results as a DataFrame. The.show()method prints the data in a formatted table. Verify the output includes your test record with__keyandmessagecolumns.
If the read-back shows the test_001 row with __key and message, your setup is ready for the tutorial.