Prepare AWS resources
This page guides you through creating AWS resources and configuring the bulk loading scripts for your environment.
Create an S3 bucket
-
Create an S3 bucket for your graph data.
Create a bucket to store vertex and edge CSV files, the bulk loader JAR, and configuration files. Replace
YOUR_BUCKET_NAMEwith a unique bucket name andYOUR_REGIONwith your AWS region (for example,us-east-1):Terminal window aws s3 mb s3://YOUR_BUCKET_NAME/ --region YOUR_REGIONmake_bucket: YOUR_BUCKET_NAMEExample response
Download the bulk loader
-
Navigate to the AWS bulk loading directory.
Make sure you’re in the directory containing the bulk loading scripts:
Terminal window cd aerospike-graph/misc/distributed-bulkload-example/AWS -
Create the bucket files directory structure.
Create a directory for the bulk loader JAR:
Terminal window mkdir -p bucket-files/jars -
Download the Aerospike Graph bulk loader JAR.
Download the bulk loader directly to the
bucket-files/jarsdirectory. ReplaceVERSIONwith the latest version (for example,3.1.1):Terminal window curl -L -o bucket-files/jars/aerospike-graph-bulk-loader-VERSION.jar \"https://download.aerospike.com/artifacts/aerospike-graph-bulk-loader/VERSION/aerospike-graph-bulk-loader-VERSION.jar"For example, to download version 3.1.1:
Terminal window curl -L -o bucket-files/jars/aerospike-graph-bulk-loader-3.1.1.jar \"https://download.aerospike.com/artifacts/aerospike-graph-bulk-loader/3.1.1/aerospike-graph-bulk-loader-3.1.1.jar" -
Verify the download.
Confirm the JAR file was downloaded successfully:
Terminal window ls -la bucket-files/jars/You should see the bulk loader JAR file listed.
Edit the variables script
-
Open
set_variables.shin your editor.This script contains configuration settings for your Aerolab cluster, S3 bucket, and EMR cluster.
-
Update these Aerolab cluster variables:
name: A unique name for your Aerospike cluster (for example,my-graph-cluster)username: Your username or identifier (for example,myusername)
Example:
name="my-graph-cluster"username="myusername" -
Update the AWS region.
Set
AWS_REGIONto match the region you configured in Aerolab:AWS_REGION="YOUR_REGION"Replace
YOUR_REGIONwith your AWS region (for example,us-west-2). -
Update the S3 bucket path.
Set the
BUCKET_PATHvariable to your bucket name:BUCKET_PATH="s3://YOUR_BUCKET_NAME"Replace
YOUR_BUCKET_NAMEwith the bucket name you created earlier. -
Locate these two variables:
SUBNET_ID="YOUR_SUBNET_ID"SECURITY_GROUP="YOUR_SECURITY_GROUP_ID"Do not change these values yet. You will update them after creating your Aerospike cluster in the next section, when Aerolab outputs the subnet ID and security group ID.
-
Save the file.
Understanding the configuration
The set_variables.sh script centralizes all configuration for the bulk loading pipeline. Understanding these settings helps you customize the deployment for your specific requirements.
As described in the overview, this tutorial uses several AWS services working together:
- Amazon EC2: Virtual servers that run your Aerospike database and EMR worker nodes.
- Amazon S3: Object storage for your graph data files, bulk loader JAR, and job logs.
- Amazon EMR: Managed cluster platform for running Apache Spark jobs that process large datasets.
- Amazon VPC: Isolated network where your AWS resources communicate securely.
Aerolab settings
These variables control the Aerospike cluster that Aerolab creates on EC2:
| Variable | Description | Default |
|---|---|---|
name | Unique identifier for your cluster. Aerolab uses this to manage cluster lifecycle. | — |
username | Owner tag applied to AWS resources for cost tracking and access control. | — |
aerospike_version | Aerospike Database version to install on the cluster nodes. | 8.0.0.7 |
instance_count | Number of Aerospike nodes in the cluster. Scale up for larger datasets. | 1 |
aws_instance_type | EC2 instance type for Aerospike nodes. Choose based on data size and performance needs. | t3.medium |
Bulk load settings
| Variable | Description | Default |
|---|---|---|
AWS_REGION | AWS region where all resources are created. Must be consistent across all components. | us-east-1 |
S3 locations
These paths define where the bulk loader reads data and writes logs in Amazon S3:
| Variable | Description | Example |
|---|---|---|
BUCKET_PATH | Root S3 path for all bulk loading files. | s3://my-bucket |
LOG_URI | Directory where EMR writes Spark job logs. Useful for debugging failed jobs. | ${BUCKET_PATH}/logs/ |
SPARK_JAR | S3 location of the bulk loader JAR file. | ${BUCKET_PATH}/jars/aerospike-graph-bulk-loader-3.1.1.jar |
SPARK_ARGS | Command-line arguments passed to the bulk loader, including the path to the properties file. | -c,${BUCKET_PATH}/config/bulk-loader.properties |
EMR cluster settings
Amazon EMR runs Apache Spark to execute the bulk loader. Spark distributes the data loading work across multiple nodes for faster processing:
| Variable | Description | Default |
|---|---|---|
CLUSTER_NAME | Display name for the EMR cluster in the AWS console. | Aerospike AWS Graph Cluster |
EMR_RELEASE | EMR release version, which determines Spark and Hadoop versions. The default includes Spark 3.4 and Java 11 support. | emr-6.15.0 |
emr_instance_type | EC2 instance type for EMR nodes. Larger instances speed up bulk loading for big datasets. | m5.xlarge |
Network settings
For the bulk loader to write data to Aerospike, the EMR cluster must have network access to the Aerospike nodes within the same VPC:
| Variable | Description |
|---|---|
SUBNET_ID | VPC subnet where EMR launches its nodes. Must be the same subnet as your Aerospike cluster for private IP connectivity. |
SECURITY_GROUP | Security group that allows inbound traffic on port 3000 (Aerospike). Aerolab creates this automatically when it provisions the cluster. |