Skip to content

Prepare AWS resources

This page guides you through creating AWS resources and configuring the bulk loading scripts for your environment.

Create an S3 bucket

  1. Create an S3 bucket for your graph data.

    Create a bucket to store vertex and edge CSV files, the bulk loader JAR, and configuration files. Replace YOUR_BUCKET_NAME with a unique bucket name and YOUR_REGION with your AWS region (for example, us-east-1):

    Terminal window
    aws s3 mb s3://YOUR_BUCKET_NAME/ --region YOUR_REGION
    make_bucket: YOUR_BUCKET_NAME
    Example response

Download the bulk loader

  1. Navigate to the AWS bulk loading directory.

    Make sure you’re in the directory containing the bulk loading scripts:

    Terminal window
    cd aerospike-graph/misc/distributed-bulkload-example/AWS
  2. Create the bucket files directory structure.

    Create a directory for the bulk loader JAR:

    Terminal window
    mkdir -p bucket-files/jars
  3. Download the Aerospike Graph bulk loader JAR.

    Download the bulk loader directly to the bucket-files/jars directory. Replace VERSION with the latest version (for example, 3.1.1):

    Terminal window
    curl -L -o bucket-files/jars/aerospike-graph-bulk-loader-VERSION.jar \
    "https://download.aerospike.com/artifacts/aerospike-graph-bulk-loader/VERSION/aerospike-graph-bulk-loader-VERSION.jar"

    For example, to download version 3.1.1:

    Terminal window
    curl -L -o bucket-files/jars/aerospike-graph-bulk-loader-3.1.1.jar \
    "https://download.aerospike.com/artifacts/aerospike-graph-bulk-loader/3.1.1/aerospike-graph-bulk-loader-3.1.1.jar"
  4. Verify the download.

    Confirm the JAR file was downloaded successfully:

    Terminal window
    ls -la bucket-files/jars/

    You should see the bulk loader JAR file listed.

Edit the variables script

  1. Open set_variables.sh in your editor.

    This script contains configuration settings for your Aerolab cluster, S3 bucket, and EMR cluster.

  2. Update these Aerolab cluster variables:

    • name: A unique name for your Aerospike cluster (for example, my-graph-cluster)
    • username: Your username or identifier (for example, myusername)

    Example:

    name="my-graph-cluster"
    username="myusername"
  3. Update the AWS region.

    Set AWS_REGION to match the region you configured in Aerolab:

    AWS_REGION="YOUR_REGION"

    Replace YOUR_REGION with your AWS region (for example, us-west-2).

  4. Update the S3 bucket path.

    Set the BUCKET_PATH variable to your bucket name:

    BUCKET_PATH="s3://YOUR_BUCKET_NAME"

    Replace YOUR_BUCKET_NAME with the bucket name you created earlier.

  5. Locate these two variables:

    SUBNET_ID="YOUR_SUBNET_ID"
    SECURITY_GROUP="YOUR_SECURITY_GROUP_ID"

    Do not change these values yet. You will update them after creating your Aerospike cluster in the next section, when Aerolab outputs the subnet ID and security group ID.

  6. Save the file.

Understanding the configuration

The set_variables.sh script centralizes all configuration for the bulk loading pipeline. Understanding these settings helps you customize the deployment for your specific requirements.

As described in the overview, this tutorial uses several AWS services working together:

  • Amazon EC2: Virtual servers that run your Aerospike database and EMR worker nodes.
  • Amazon S3: Object storage for your graph data files, bulk loader JAR, and job logs.
  • Amazon EMR: Managed cluster platform for running Apache Spark jobs that process large datasets.
  • Amazon VPC: Isolated network where your AWS resources communicate securely.

Aerolab settings

These variables control the Aerospike cluster that Aerolab creates on EC2:

VariableDescriptionDefault
nameUnique identifier for your cluster. Aerolab uses this to manage cluster lifecycle.
usernameOwner tag applied to AWS resources for cost tracking and access control.
aerospike_versionAerospike Database version to install on the cluster nodes.8.0.0.7
instance_countNumber of Aerospike nodes in the cluster. Scale up for larger datasets.1
aws_instance_typeEC2 instance type for Aerospike nodes. Choose based on data size and performance needs.t3.medium

Bulk load settings

VariableDescriptionDefault
AWS_REGIONAWS region where all resources are created. Must be consistent across all components.us-east-1

S3 locations

These paths define where the bulk loader reads data and writes logs in Amazon S3:

VariableDescriptionExample
BUCKET_PATHRoot S3 path for all bulk loading files.s3://my-bucket
LOG_URIDirectory where EMR writes Spark job logs. Useful for debugging failed jobs.${BUCKET_PATH}/logs/
SPARK_JARS3 location of the bulk loader JAR file.${BUCKET_PATH}/jars/aerospike-graph-bulk-loader-3.1.1.jar
SPARK_ARGSCommand-line arguments passed to the bulk loader, including the path to the properties file.-c,${BUCKET_PATH}/config/bulk-loader.properties

EMR cluster settings

Amazon EMR runs Apache Spark to execute the bulk loader. Spark distributes the data loading work across multiple nodes for faster processing:

VariableDescriptionDefault
CLUSTER_NAMEDisplay name for the EMR cluster in the AWS console.Aerospike AWS Graph Cluster
EMR_RELEASEEMR release version, which determines Spark and Hadoop versions. The default includes Spark 3.4 and Java 11 support.emr-6.15.0
emr_instance_typeEC2 instance type for EMR nodes. Larger instances speed up bulk loading for big datasets.m5.xlarge

Network settings

For the bulk loader to write data to Aerospike, the EMR cluster must have network access to the Aerospike nodes within the same VPC:

VariableDescription
SUBNET_IDVPC subnet where EMR launches its nodes. Must be the same subnet as your Aerospike cluster for private IP connectivity.
SECURITY_GROUPSecurity group that allows inbound traffic on port 3000 (Aerospike). Aerolab creates this automatically when it provisions the cluster.
Feedback

Was this page helpful?

What type of feedback are you giving?

What would you like us to know?

+Capture screenshot

Can we reach out to you?