Skip to content

Run bulk load job

This page guides you through uploading your files to S3, creating an EMR cluster, and running the distributed bulk load job to load graph data into your Aerospike cluster.

Upload files to S3

  1. Upload the bulk loader configuration and JAR.

    From the AWS bulk loading directory, upload the bucket files to S3. Replace YOUR_BUCKET_NAME and YOUR_REGION with your values:

    Terminal window
    aws s3 cp ./bucket-files s3://YOUR_BUCKET_NAME/ --recursive --region YOUR_REGION
    upload: bucket-files/config/bulk-loader.properties to s3://YOUR_BUCKET_NAME/config/bulk-loader.properties
    upload: bucket-files/jars/aerospike-graph-bulk-loader-3.1.1.jar to s3://YOUR_BUCKET_NAME/jars/aerospike-graph-bulk-loader-3.1.1.jar
    Example response
  2. Upload the sample vertex and edge data.

    Upload the sample graph data from the common bulkload data directory:

    Terminal window
    aws s3 cp ../../../common/bulkload-data/ s3://YOUR_BUCKET_NAME/ --recursive --region YOUR_REGION
    upload: ../../../common/bulkload-data/vertices/vertices.csv to s3://YOUR_BUCKET_NAME/vertices/vertices.csv
    upload: ../../../common/bulkload-data/edges/edges.csv to s3://YOUR_BUCKET_NAME/edges/edges.csv
    Example response

If you want to load your own graph data instead of the sample data, replace the CSV files in the common/bulkload-data directory with your own vertex and edge files before uploading. Ensure your CSV files follow the Aerospike Graph CSV format.

Create EMR cluster and submit the bulk load job

  1. Create the EMR default roles (if needed).

    EMR requires IAM roles to launch clusters. Many enterprise accounts don’t have these by default. Create them:

    Terminal window
    aws emr create-default-roles --region YOUR_REGION

    If the roles already exist, you’ll see a message indicating they’re already created.

  2. Run the bulk load script.

    Execute the bulk load script to create an EMR cluster and submit the Spark job:

    Terminal window
    ./bulkload.sh
    Creating EMR Cluster...
    Cluster ID: j-XXXXXXXXXXXXX
    Adding Spark job step...
    Step ID: s-XXXXXXXXXXXXX
    Example response

    Save the Cluster ID and Step ID to monitor the job.

The script performs two operations:

  1. Creates an EMR cluster with the specified configuration.

    • EMR release 6.15.0 (includes Spark and Java 11)
    • Instance type and count from your variables script
    • Configured to use Java 11 (required for Aerospike Graph)
    • Deployed in the same subnet as your Aerospike cluster
  2. Submits a Spark step that.

    • Runs the bulk loader JAR with your configuration
    • Reads vertex and edge CSV files from S3
    • Processes data in parallel across Spark executors
    • Writes graph data to your Aerospike cluster

Monitor the bulk load job

  1. Check the job status.

    Use the AWS CLI to monitor the bulk load step. Replace YOUR_CLUSTER_ID, YOUR_STEP_ID, and YOUR_REGION with your values:

    Terminal window
    aws emr describe-step --cluster-id YOUR_CLUSTER_ID --step-id YOUR_STEP_ID --region YOUR_REGION
    {
    "Step": {
    "Id": "s-XXXXXXXXXXXXX",
    "Name": "Aerospike Graph AWS Spark Job",
    "Config": {
    "Jar": "command-runner.jar",
    "Args": [
    "spark-submit",
    "--class",
    "com.aerospike.firefly.bulkloader.SparkBulkLoaderMain",
    "s3://YOUR_BUCKET_NAME/jars/aerospike-graph-bulk-loader-3.1.1.jar",
    "-c",
    "s3://YOUR_BUCKET_NAME/config/bulk-loader.properties"
    ]
    },
    "ActionOnFailure": "CONTINUE",
    "Status": {
    "State": "RUNNING",
    "Timeline": {
    "CreationDateTime": "2025-12-09T14:39:18.795000-08:00",
    "StartDateTime": "2025-12-09T14:42:41.446000-08:00"
    }
    }
    }
    }
    Example response

    The State field shows the current status. Possible values include:

    • PENDING: Job is waiting to start
    • RUNNING: Job is executing
    • COMPLETED: Job finished successfully
    • FAILED: Job encountered an error
  2. View detailed job logs.

    Once the job completes, you can view the detailed logs stored in S3:

    Terminal window
    aws s3 ls s3://YOUR_BUCKET_NAME/logs/ --recursive

    Download and view the stdout log to see the bulk loader output:

    Terminal window
    aws s3 cp s3://YOUR_BUCKET_NAME/logs/YOUR_CLUSTER_ID/steps/YOUR_STEP_ID/stdout.gz . && gunzip stdout.gz && cat stdout
    INFO EdgeOperations: Execution time in seconds for Edge write task: 2
    INFO ProgressBar:
    Bulk Loader Progress:
    Preflight check complete
    Temp data writing complete
    Supernode extraction complete
    Edge cache generation complete
    Vertex writing complete
    Total of 10 vertices have been successfully written
    Vertex validation complete
    Edge writing complete
    Total of 5 edges have been successfully written
    Edge validation complete
    Example response

The sample dataset contains 10 vertices and 5 edges. If you loaded your own data, the counts will differ based on your CSV files.

Understanding the bulk load process

The distributed bulk loader executes several phases:

  1. Preflight check: Validates configuration and connectivity to Aerospike.
  2. Temp data writing: Writes intermediate data for processing.
  3. Supernode extraction: Identifies vertices with many edges.
  4. Edge cache generation: Prepares edge data for efficient writing.
  5. Vertex writing: Writes all vertex data to Aerospike.
  6. Vertex validation: Verifies vertices were written correctly.
  7. Edge writing: Writes all edge data to Aerospike.
  8. Edge validation: Verifies edges were written correctly.

Each phase uses Spark’s parallel processing capabilities to handle large datasets efficiently.

Clean up resources

After completing the tutorial, clean up AWS resources to avoid ongoing charges and follow best practices for resource management.

Required cleanup (stops charges)

Replace the placeholder values with your actual resource IDs from earlier in the tutorial.

  1. Terminate the EMR cluster.

    Replace YOUR_CLUSTER_ID with the cluster ID from the ./bulkload.sh output (for example, j-2ZI32VESUL9O5):

    Terminal window
    aws emr terminate-clusters --cluster-ids YOUR_CLUSTER_ID --region YOUR_REGION
  2. Delete the Aerolab cluster.

    Replace YOUR_CLUSTER_NAME with the cluster name you set in set_variables.sh:

    Terminal window
    aerolab cluster destroy --name YOUR_CLUSTER_NAME
  3. Delete the S3 bucket and its contents.

    Replace YOUR_BUCKET_NAME with your S3 bucket name:

    Terminal window
    aws s3 rb s3://YOUR_BUCKET_NAME --force --region YOUR_REGION

IAM role cleanup

EMR creates IAM roles that should be cleaned up when you’re finished. The role names below (EMR_EC2_DefaultRole and EMR_DefaultRole) are the standard AWS default names and don’t need to be replaced.

  1. Delete the EMR instance profile.

    First, remove the role from the instance profile, then delete the instance profile:

    Terminal window
    aws iam remove-role-from-instance-profile --instance-profile-name EMR_EC2_DefaultRole --role-name EMR_EC2_DefaultRole
    aws iam delete-instance-profile --instance-profile-name EMR_EC2_DefaultRole
  2. Delete the EMR EC2 role.

    Detach the policies and delete the role:

    Terminal window
    aws iam detach-role-policy --role-name EMR_EC2_DefaultRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role
    aws iam delete-role --role-name EMR_EC2_DefaultRole
  3. Delete the EMR service role.

    Detach the policies and delete the role:

    Terminal window
    aws iam detach-role-policy --role-name EMR_DefaultRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEMRServicePolicy_v2
    aws iam delete-role --role-name EMR_DefaultRole

Optional cleanup (networking resources)

If you created networking resources specifically for this tutorial, you can delete them. These resources don’t incur charges but cleaning them up keeps your account organized.

Replace the placeholder values with the resource IDs you saved from the Create an Aerolab cluster page.

  1. Delete the security group (if created for this tutorial).

    Replace YOUR_SECURITY_GROUP_ID with the security group ID (for example, sg-05df537bd2b80c09c):

    Terminal window
    aws ec2 delete-security-group --group-id YOUR_SECURITY_GROUP_ID --region YOUR_REGION
  2. Delete the subnet (if created for this tutorial).

    Replace YOUR_SUBNET_ID with the subnet ID (for example, subnet-0f8a303638eddc327):

    Terminal window
    aws ec2 delete-subnet --subnet-id YOUR_SUBNET_ID --region YOUR_REGION
  3. Delete the route table (if created for this tutorial).

    Replace YOUR_ROUTE_TABLE_ID with the route table ID (for example, rtb-0123456789abcdef0):

    Terminal window
    aws ec2 delete-route-table --route-table-id YOUR_ROUTE_TABLE_ID --region YOUR_REGION
  4. Delete the Internet Gateway (if created for this tutorial).

    Detach it from the VPC first, then delete. Replace YOUR_IGW_ID with the Internet Gateway ID (for example, igw-0123456789abcdef0) and YOUR_VPC_ID with your VPC ID:

    Terminal window
    aws ec2 detach-internet-gateway --internet-gateway-id YOUR_IGW_ID --vpc-id YOUR_VPC_ID --region YOUR_REGION
    aws ec2 delete-internet-gateway --internet-gateway-id YOUR_IGW_ID --region YOUR_REGION

If you used existing networking resources (VPC, subnet, security group) that were already in your account, you don’t need to delete them.

Feedback

Was this page helpful?

What type of feedback are you giving?

What would you like us to know?

+Capture screenshot

Can we reach out to you?