Run bulk load job
This page guides you through uploading your files to S3, creating an EMR cluster, and running the distributed bulk load job to load graph data into your Aerospike cluster.
Upload files to S3
-
Upload the bulk loader configuration and JAR.
From the AWS bulk loading directory, upload the bucket files to S3. Replace
YOUR_BUCKET_NAMEandYOUR_REGIONwith your values:Terminal window aws s3 cp ./bucket-files s3://YOUR_BUCKET_NAME/ --recursive --region YOUR_REGIONupload: bucket-files/config/bulk-loader.properties to s3://YOUR_BUCKET_NAME/config/bulk-loader.propertiesupload: bucket-files/jars/aerospike-graph-bulk-loader-3.1.1.jar to s3://YOUR_BUCKET_NAME/jars/aerospike-graph-bulk-loader-3.1.1.jarExample response -
Upload the sample vertex and edge data.
Upload the sample graph data from the common bulkload data directory:
Terminal window aws s3 cp ../../../common/bulkload-data/ s3://YOUR_BUCKET_NAME/ --recursive --region YOUR_REGIONupload: ../../../common/bulkload-data/vertices/vertices.csv to s3://YOUR_BUCKET_NAME/vertices/vertices.csvupload: ../../../common/bulkload-data/edges/edges.csv to s3://YOUR_BUCKET_NAME/edges/edges.csvExample response
If you want to load your own graph data instead of the sample data, replace the CSV files in the common/bulkload-data directory with your own vertex and edge files before uploading. Ensure your CSV files follow the Aerospike Graph CSV format.
Create EMR cluster and submit the bulk load job
-
Create the EMR default roles (if needed).
EMR requires IAM roles to launch clusters. Many enterprise accounts don’t have these by default. Create them:
Terminal window aws emr create-default-roles --region YOUR_REGIONIf the roles already exist, you’ll see a message indicating they’re already created.
-
Run the bulk load script.
Execute the bulk load script to create an EMR cluster and submit the Spark job:
Terminal window ./bulkload.shCreating EMR Cluster...Cluster ID: j-XXXXXXXXXXXXXAdding Spark job step...Step ID: s-XXXXXXXXXXXXXExample response Save the Cluster ID and Step ID to monitor the job.
The script performs two operations:
-
Creates an EMR cluster with the specified configuration.
- EMR release 6.15.0 (includes Spark and Java 11)
- Instance type and count from your variables script
- Configured to use Java 11 (required for Aerospike Graph)
- Deployed in the same subnet as your Aerospike cluster
-
Submits a Spark step that.
- Runs the bulk loader JAR with your configuration
- Reads vertex and edge CSV files from S3
- Processes data in parallel across Spark executors
- Writes graph data to your Aerospike cluster
Monitor the bulk load job
-
Check the job status.
Use the AWS CLI to monitor the bulk load step. Replace
YOUR_CLUSTER_ID,YOUR_STEP_ID, andYOUR_REGIONwith your values:Terminal window aws emr describe-step --cluster-id YOUR_CLUSTER_ID --step-id YOUR_STEP_ID --region YOUR_REGION{"Step": {"Id": "s-XXXXXXXXXXXXX","Name": "Aerospike Graph AWS Spark Job","Config": {"Jar": "command-runner.jar","Args": ["spark-submit","--class","com.aerospike.firefly.bulkloader.SparkBulkLoaderMain","s3://YOUR_BUCKET_NAME/jars/aerospike-graph-bulk-loader-3.1.1.jar","-c","s3://YOUR_BUCKET_NAME/config/bulk-loader.properties"]},"ActionOnFailure": "CONTINUE","Status": {"State": "RUNNING","Timeline": {"CreationDateTime": "2025-12-09T14:39:18.795000-08:00","StartDateTime": "2025-12-09T14:42:41.446000-08:00"}}}}Example response The
Statefield shows the current status. Possible values include:PENDING: Job is waiting to startRUNNING: Job is executingCOMPLETED: Job finished successfullyFAILED: Job encountered an error
-
View detailed job logs.
Once the job completes, you can view the detailed logs stored in S3:
Terminal window aws s3 ls s3://YOUR_BUCKET_NAME/logs/ --recursiveDownload and view the stdout log to see the bulk loader output:
Terminal window aws s3 cp s3://YOUR_BUCKET_NAME/logs/YOUR_CLUSTER_ID/steps/YOUR_STEP_ID/stdout.gz . && gunzip stdout.gz && cat stdoutINFO EdgeOperations: Execution time in seconds for Edge write task: 2INFO ProgressBar:Bulk Loader Progress:Preflight check completeTemp data writing completeSupernode extraction completeEdge cache generation completeVertex writing completeTotal of 10 vertices have been successfully writtenVertex validation completeEdge writing completeTotal of 5 edges have been successfully writtenEdge validation completeExample response
The sample dataset contains 10 vertices and 5 edges. If you loaded your own data, the counts will differ based on your CSV files.
Understanding the bulk load process
The distributed bulk loader executes several phases:
- Preflight check: Validates configuration and connectivity to Aerospike.
- Temp data writing: Writes intermediate data for processing.
- Supernode extraction: Identifies vertices with many edges.
- Edge cache generation: Prepares edge data for efficient writing.
- Vertex writing: Writes all vertex data to Aerospike.
- Vertex validation: Verifies vertices were written correctly.
- Edge writing: Writes all edge data to Aerospike.
- Edge validation: Verifies edges were written correctly.
Each phase uses Spark’s parallel processing capabilities to handle large datasets efficiently.
Clean up resources
After completing the tutorial, clean up AWS resources to avoid ongoing charges and follow best practices for resource management.
Required cleanup (stops charges)
Replace the placeholder values with your actual resource IDs from earlier in the tutorial.
-
Terminate the EMR cluster.
Replace
YOUR_CLUSTER_IDwith the cluster ID from the./bulkload.shoutput (for example,j-2ZI32VESUL9O5):Terminal window aws emr terminate-clusters --cluster-ids YOUR_CLUSTER_ID --region YOUR_REGION -
Delete the Aerolab cluster.
Replace
YOUR_CLUSTER_NAMEwith the cluster name you set inset_variables.sh:Terminal window aerolab cluster destroy --name YOUR_CLUSTER_NAME -
Delete the S3 bucket and its contents.
Replace
YOUR_BUCKET_NAMEwith your S3 bucket name:Terminal window aws s3 rb s3://YOUR_BUCKET_NAME --force --region YOUR_REGION
IAM role cleanup
EMR creates IAM roles that should be cleaned up when you’re finished. The role names below (EMR_EC2_DefaultRole and EMR_DefaultRole) are the standard AWS default names and don’t need to be replaced.
-
Delete the EMR instance profile.
First, remove the role from the instance profile, then delete the instance profile:
Terminal window aws iam remove-role-from-instance-profile --instance-profile-name EMR_EC2_DefaultRole --role-name EMR_EC2_DefaultRoleaws iam delete-instance-profile --instance-profile-name EMR_EC2_DefaultRole -
Delete the EMR EC2 role.
Detach the policies and delete the role:
Terminal window aws iam detach-role-policy --role-name EMR_EC2_DefaultRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Roleaws iam delete-role --role-name EMR_EC2_DefaultRole -
Delete the EMR service role.
Detach the policies and delete the role:
Terminal window aws iam detach-role-policy --role-name EMR_DefaultRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEMRServicePolicy_v2aws iam delete-role --role-name EMR_DefaultRole
Optional cleanup (networking resources)
If you created networking resources specifically for this tutorial, you can delete them. These resources don’t incur charges but cleaning them up keeps your account organized.
Replace the placeholder values with the resource IDs you saved from the Create an Aerolab cluster page.
-
Delete the security group (if created for this tutorial).
Replace
YOUR_SECURITY_GROUP_IDwith the security group ID (for example,sg-05df537bd2b80c09c):Terminal window aws ec2 delete-security-group --group-id YOUR_SECURITY_GROUP_ID --region YOUR_REGION -
Delete the subnet (if created for this tutorial).
Replace
YOUR_SUBNET_IDwith the subnet ID (for example,subnet-0f8a303638eddc327):Terminal window aws ec2 delete-subnet --subnet-id YOUR_SUBNET_ID --region YOUR_REGION -
Delete the route table (if created for this tutorial).
Replace
YOUR_ROUTE_TABLE_IDwith the route table ID (for example,rtb-0123456789abcdef0):Terminal window aws ec2 delete-route-table --route-table-id YOUR_ROUTE_TABLE_ID --region YOUR_REGION -
Delete the Internet Gateway (if created for this tutorial).
Detach it from the VPC first, then delete. Replace
YOUR_IGW_IDwith the Internet Gateway ID (for example,igw-0123456789abcdef0) andYOUR_VPC_IDwith your VPC ID:Terminal window aws ec2 detach-internet-gateway --internet-gateway-id YOUR_IGW_ID --vpc-id YOUR_VPC_ID --region YOUR_REGIONaws ec2 delete-internet-gateway --internet-gateway-id YOUR_IGW_ID --region YOUR_REGION
If you used existing networking resources (VPC, subnet, security group) that were already in your account, you don’t need to delete them.