Run bulk load job

For the complete documentation index see: llms.txt

All documentation pages available in markdown.

This page guides you through uploading your files to S3, creating an EMR cluster, and running the distributed bulk load job to load graph data into your Aerospike cluster.

Upload files to S3

Upload the bulk loader configuration and JAR.

From the AWS bulk loading directory, upload the bucket files to S3. Replace YOUR_BUCKET_NAME and YOUR_REGION with your values:

aws s3 cp ./bucket-files s3://YOUR_BUCKET_NAME/ --recursive --region YOUR_REGION

upload: bucket-files/config/bulk-loader.properties to s3://YOUR_BUCKET_NAME/config/bulk-loader.properties
upload: bucket-files/jars/aerospike-graph-bulk-loader-3.1.1.jar to s3://YOUR_BUCKET_NAME/jars/aerospike-graph-bulk-loader-3.1.1.jar

Example response

Upload the sample vertex and edge data.

Upload the sample graph data from the common bulkload data directory:

aws s3 cp ../../../common/bulkload-data/ s3://YOUR_BUCKET_NAME/ --recursive --region YOUR_REGION

upload: ../../../common/bulkload-data/vertices/vertices.csv to s3://YOUR_BUCKET_NAME/vertices/vertices.csv
upload: ../../../common/bulkload-data/edges/edges.csv to s3://YOUR_BUCKET_NAME/edges/edges.csv

Example response

Create EMR cluster and submit the bulk load job

Create the EMR default roles (if needed).

EMR requires IAM roles to launch clusters. Many enterprise accounts don’t have these by default. Create them:
Terminal window
```
aws emr create-default-roles --region YOUR_REGION
```
If the roles already exist, you’ll see a message indicating they’re already created.
Run the bulk load script.

Execute the bulk load script to create an EMR cluster and submit the Spark job:
Terminal window
```
./bulkload.sh
```
```
Creating EMR Cluster...
Cluster ID: j-XXXXXXXXXXXXX
Adding Spark job step...
Step ID: s-XXXXXXXXXXXXX
```
Example response
Save the Cluster ID and Step ID to monitor the job.

The script performs two operations:

Creates an EMR cluster with the specified configuration.
- EMR release 6.15.0 (includes Spark and Java 11)
- Instance type and count from your variables script
- Configured to use Java 11 (required for Aerospike Graph)
- Deployed in the same subnet as your Aerospike cluster
Submits a Spark step that.
- Runs the bulk loader JAR with your configuration
- Reads vertex and edge CSV files from S3
- Processes data in parallel across Spark executors
- Writes graph data to your Aerospike cluster

Monitor the bulk load job

Check the job status.

Use the AWS CLI to monitor the bulk load step. Replace YOUR_CLUSTER_ID, YOUR_STEP_ID, and YOUR_REGION with your values:

aws emr describe-step --cluster-id YOUR_CLUSTER_ID --step-id YOUR_STEP_ID --region YOUR_REGION

{
    "Step": {
        "Id": "s-XXXXXXXXXXXXX",
        "Name": "Aerospike Graph AWS Spark Job",
        "Config": {
            "Jar": "command-runner.jar",
            "Args": [
                "spark-submit",
                "--class",
                "com.aerospike.firefly.bulkloader.SparkBulkLoaderMain",
                "s3://YOUR_BUCKET_NAME/jars/aerospike-graph-bulk-loader-3.1.1.jar",
                "-c",
                "s3://YOUR_BUCKET_NAME/config/bulk-loader.properties"
            ]
        },
        "ActionOnFailure": "CONTINUE",
        "Status": {
            "State": "RUNNING",
            "Timeline": {
                "CreationDateTime": "2025-12-09T14:39:18.795000-08:00",
                "StartDateTime": "2025-12-09T14:42:41.446000-08:00"
            }
        }
    }
}

Example response

The State field shows the current status. Possible values include:

PENDING: Job is waiting to start
RUNNING: Job is executing
COMPLETED: Job finished successfully
FAILED: Job encountered an error

View detailed job logs.

Once the job completes, you can view the detailed logs stored in S3:

aws s3 ls s3://YOUR_BUCKET_NAME/logs/ --recursive

Download and view the stdout log to see the bulk loader output:

aws s3 cp s3://YOUR_BUCKET_NAME/logs/YOUR_CLUSTER_ID/steps/YOUR_STEP_ID/stdout.gz . && gunzip stdout.gz && cat stdout

INFO EdgeOperations: Execution time in seconds for Edge write task: 2
INFO ProgressBar:
         Bulk Loader Progress:
                Preflight check complete
                Temp data writing complete
                Supernode extraction complete
                Edge cache generation complete
                Vertex writing complete
                        Total of 10 vertices have been successfully written
                Vertex validation complete
                Edge writing complete
                        Total of 5 edges have been successfully written
                Edge validation complete

Example response

Understanding the bulk load process

The distributed bulk loader executes several phases:

Preflight check: Validates configuration and connectivity to Aerospike.
Temp data writing: Writes intermediate data for processing.
Supernode extraction: Identifies vertices with many edges.
Edge cache generation: Prepares edge data for efficient writing.
Vertex writing: Writes all vertex data to Aerospike.
Vertex validation: Verifies vertices were written correctly.
Edge writing: Writes all edge data to Aerospike.
Edge validation: Verifies edges were written correctly.

Each phase uses Spark’s parallel processing capabilities to handle large datasets efficiently.

Clean up resources

After completing the tutorial, clean up AWS resources to avoid ongoing charges and follow best practices for resource management.

Required cleanup (stops charges)

Replace the placeholder values with your actual resource IDs from earlier in the tutorial.

Terminate the EMR cluster.

Replace YOUR_CLUSTER_ID with the cluster ID from the ./bulkload.sh output (for example, j-2ZI32VESUL9O5):
Terminal window
```
aws emr terminate-clusters --cluster-ids YOUR_CLUSTER_ID --region YOUR_REGION
```
Delete the Aerolab cluster.

Replace YOUR_CLUSTER_NAME with the cluster name you set in set_variables.sh:
Terminal window
```
aerolab cluster destroy --name YOUR_CLUSTER_NAME
```
Delete the S3 bucket and its contents.

Replace YOUR_BUCKET_NAME with your S3 bucket name:
Terminal window
```
aws s3 rb s3://YOUR_BUCKET_NAME --force --region YOUR_REGION
```

IAM role cleanup

EMR creates IAM roles that should be cleaned up when you’re finished. The role names below (EMR_EC2_DefaultRole and EMR_DefaultRole) are the standard AWS default names and don’t need to be replaced.

Delete the EMR instance profile.

First, remove the role from the instance profile, then delete the instance profile:

aws iam remove-role-from-instance-profile --instance-profile-name EMR_EC2_DefaultRole --role-name EMR_EC2_DefaultRole
aws iam delete-instance-profile --instance-profile-name EMR_EC2_DefaultRole

Delete the EMR EC2 role.

Detach the policies and delete the role:

aws iam detach-role-policy --role-name EMR_EC2_DefaultRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role
aws iam delete-role --role-name EMR_EC2_DefaultRole

Delete the EMR service role.

Detach the policies and delete the role:

aws iam detach-role-policy --role-name EMR_DefaultRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEMRServicePolicy_v2
aws iam delete-role --role-name EMR_DefaultRole

Optional cleanup (networking resources)

If you created networking resources specifically for this tutorial, you can delete them. These resources don’t incur charges but cleaning them up keeps your account organized.

Replace the placeholder values with the resource IDs you saved from the Create an Aerolab cluster page.

Delete the security group (if created for this tutorial).

Replace YOUR_SECURITY_GROUP_ID with the security group ID (for example, sg-05df537bd2b80c09c):
Terminal window
```
aws ec2 delete-security-group --group-id YOUR_SECURITY_GROUP_ID --region YOUR_REGION
```
Delete the subnet (if created for this tutorial).

Replace YOUR_SUBNET_ID with the subnet ID (for example, subnet-0f8a303638eddc327):
Terminal window
```
aws ec2 delete-subnet --subnet-id YOUR_SUBNET_ID --region YOUR_REGION
```
Delete the route table (if created for this tutorial).

Replace YOUR_ROUTE_TABLE_ID with the route table ID (for example, rtb-0123456789abcdef0):
Terminal window
```
aws ec2 delete-route-table --route-table-id YOUR_ROUTE_TABLE_ID --region YOUR_REGION
```
Delete the Internet Gateway (if created for this tutorial).

Detach it from the VPC first, then delete. Replace YOUR_IGW_ID with the Internet Gateway ID (for example, igw-0123456789abcdef0) and YOUR_VPC_ID with your VPC ID:
Terminal window
```
aws ec2 detach-internet-gateway --internet-gateway-id YOUR_IGW_ID --vpc-id YOUR_VPC_ID --region YOUR_REGION
aws ec2 delete-internet-gateway --internet-gateway-id YOUR_IGW_ID --region YOUR_REGION
```