Prepare AWS resources

This page guides you through creating AWS resources and configuring the bulk loading scripts for your environment.

Create an S3 bucket

Create an S3 bucket for your graph data.

Create a bucket to store vertex and edge CSV files, the bulk loader JAR, and configuration files. Replace YOUR_BUCKET_NAME with a unique bucket name and YOUR_REGION with your AWS region (for example, us-east-1):
Terminal window
```
aws s3 mb s3://YOUR_BUCKET_NAME/ --region YOUR_REGION
```
```
make_bucket: YOUR_BUCKET_NAME
```
Example response

Download the bulk loader

Navigate to the AWS bulk loading directory.

Make sure you’re in the directory containing the bulk loading scripts:
Terminal window
```
cd aerospike-graph/misc/distributed-bulkload-example/AWS
```
Create the bucket files directory structure.

Create a directory for the bulk loader JAR:
Terminal window
```
mkdir -p bucket-files/jars
```

Download the Aerospike Graph bulk loader JAR.

Download the bulk loader directly to the bucket-files/jars directory. Replace VERSION with the latest version (for example, 3.1.1):

curl -L -o bucket-files/jars/aerospike-graph-bulk-loader-VERSION.jar \
  "https://download.aerospike.com/artifacts/aerospike-graph-bulk-loader/VERSION/aerospike-graph-bulk-loader-VERSION.jar"

For example, to download version 3.1.1:

curl -L -o bucket-files/jars/aerospike-graph-bulk-loader-3.1.1.jar \
  "https://download.aerospike.com/artifacts/aerospike-graph-bulk-loader/3.1.1/aerospike-graph-bulk-loader-3.1.1.jar"

To find the latest version, visit the Aerospike Graph bulk loader download page.

Verify the download.

Confirm the JAR file was downloaded successfully:
Terminal window
```
ls -la bucket-files/jars/
```
You should see the bulk loader JAR file listed.

Edit the variables script

Open set_variables.sh in your editor.

This script contains configuration settings for your Aerolab cluster, S3 bucket, and EMR cluster.
Update these Aerolab cluster variables:
- name: A unique name for your Aerospike cluster (for example, my-graph-cluster)
- username: Your username or identifier (for example, myusername)
Example:
```
name="my-graph-cluster"
username="myusername"
```
Update the AWS region.

Set AWS_REGION to match the region you configured in Aerolab:
```
AWS_REGION="YOUR_REGION"
```
Replace YOUR_REGION with your AWS region (for example, us-west-2).
Update the S3 bucket path.

Set the BUCKET_PATH variable to your bucket name:
```
BUCKET_PATH="s3://YOUR_BUCKET_NAME"
```
Replace YOUR_BUCKET_NAME with the bucket name you created earlier.
Locate these two variables:
```
SUBNET_ID="YOUR_SUBNET_ID"
SECURITY_GROUP="YOUR_SECURITY_GROUP_ID"
```
Do not change these values yet. You will update them after creating your Aerospike cluster in the next section, when Aerolab outputs the subnet ID and security group ID.
Save the file.

The bulk loader JAR version in set_variables.sh (line 16) should match the version you downloaded. If you downloaded a different version, update the SPARK_JAR variable accordingly.

Understanding the configuration

The set_variables.sh script centralizes all configuration for the bulk loading pipeline. Understanding these settings helps you customize the deployment for your specific requirements.

As described in the overview, this tutorial uses several AWS services working together:

Amazon EC2: Virtual servers that run your Aerospike database and EMR worker nodes.
Amazon S3: Object storage for your graph data files, bulk loader JAR, and job logs.
Amazon EMR: Managed cluster platform for running Apache Spark jobs that process large datasets.
Amazon VPC: Isolated network where your AWS resources communicate securely.

Aerolab settings

These variables control the Aerospike cluster that Aerolab creates on EC2:

Variable	Description	Default
`name`	Unique identifier for your cluster. Aerolab uses this to manage cluster lifecycle.	—
`username`	Owner tag applied to AWS resources for cost tracking and access control.	—
`aerospike_version`	Aerospike Database version to install on the cluster nodes.	`8.0.0.7`
`instance_count`	Number of Aerospike nodes in the cluster. Scale up for larger datasets.	`1`
`aws_instance_type`	EC2 instance type for Aerospike nodes. Choose based on data size and performance needs.	`t3.medium`

Bulk load settings

Variable	Description	Default
`AWS_REGION`	AWS region where all resources are created. Must be consistent across all components.	`us-east-1`

S3 locations

These paths define where the bulk loader reads data and writes logs in Amazon S3:

Variable	Description	Example
`BUCKET_PATH`	Root S3 path for all bulk loading files.	`s3://my-bucket`
`LOG_URI`	Directory where EMR writes Spark job logs. Useful for debugging failed jobs.	`${BUCKET_PATH}/logs/`
`SPARK_JAR`	S3 location of the bulk loader JAR file.	`${BUCKET_PATH}/jars/aerospike-graph-bulk-loader-3.1.1.jar`
`SPARK_ARGS`	Command-line arguments passed to the bulk loader, including the path to the properties file.	`-c,${BUCKET_PATH}/config/bulk-loader.properties`

EMR cluster settings

Amazon EMR runs Apache Spark to execute the bulk loader. Spark distributes the data loading work across multiple nodes for faster processing:

Variable	Description	Default
`CLUSTER_NAME`	Display name for the EMR cluster in the AWS console.	`Aerospike AWS Graph Cluster`
`EMR_RELEASE`	EMR release version, which determines Spark and Hadoop versions. The default includes Spark 3.4 and Java 11 support.	`emr-6.15.0`
`emr_instance_type`	EC2 instance type for EMR nodes. Larger instances speed up bulk loading for big datasets.	`m5.xlarge`

Network settings

For the bulk loader to write data to Aerospike, the EMR cluster must have network access to the Aerospike nodes within the same VPC:

Variable	Description
`SUBNET_ID`	VPC subnet where EMR launches its nodes. Must be the same subnet as your Aerospike cluster for private IP connectivity.
`SECURITY_GROUP`	Security group that allows inbound traffic on port 3000 (Aerospike). Aerolab creates this automatically when it provisions the cluster.

You’ll retrieve the SUBNET_ID and SECURITY_GROUP values from the Aerolab output in the next section. These ensure the EMR cluster can communicate with Aerospike over the private network.