# Prepare AWS resources

This page guides you through creating AWS resources and configuring the bulk loading scripts for your environment.

## Create an S3 bucket

1.  Create an S3 bucket for your graph data.

    Create a bucket to store vertex and edge CSV files, the bulk loader JAR, and configuration files. Replace `YOUR_BUCKET_NAME` with a unique bucket name and `YOUR_REGION` with your AWS region (for example, `us-east-1`):

    ``` shell
    aws s3 mb s3://YOUR_BUCKET_NAME/ --region YOUR_REGION
    ```

    ``` plaintext
    make_bucket: YOUR_BUCKET_NAME
    ```

    Example response

## Download the bulk loader

1.  Navigate to the AWS bulk loading directory.

    Make sure you’re in the directory containing the bulk loading scripts:

    ``` shell
    cd aerospike-graph/misc/distributed-bulkload-example/AWS
    ```

2.  Create the bucket files directory structure.

    Create a directory for the bulk loader JAR:

    ``` shell
    mkdir -p bucket-files/jars
    ```

3.  Download the Aerospike Graph bulk loader JAR.

    Download the bulk loader directly to the `bucket-files/jars` directory. Replace `VERSION` with the latest version (for example, `3.1.1`):

    ``` shell
    curl -L -o bucket-files/jars/aerospike-graph-bulk-loader-VERSION.jar \
      "https://download.aerospike.com/artifacts/aerospike-graph-bulk-loader/VERSION/aerospike-graph-bulk-loader-VERSION.jar"
    ```

    For example, to download version 3.1.1:

    ``` shell
    curl -L -o bucket-files/jars/aerospike-graph-bulk-loader-3.1.1.jar \
      "https://download.aerospike.com/artifacts/aerospike-graph-bulk-loader/3.1.1/aerospike-graph-bulk-loader-3.1.1.jar"
    ```

    ::: tip
    To find the latest version, visit the [Aerospike Graph bulk loader download page](https://aerospike.com/download/graph/loader/).
    :::

4.  Verify the download.

    Confirm the JAR file was downloaded successfully:

    ``` shell
    ls -la bucket-files/jars/
    ```

    You should see the bulk loader JAR file listed.

## Edit the variables script

1.  Open `set_variables.sh` in your editor.

    This script contains configuration settings for your Aerolab cluster, S3 bucket, and EMR cluster.

2.  Update these Aerolab cluster variables:

    -   `name`: A unique name for your Aerospike cluster (for example, `my-graph-cluster`)
    -   `username`: Your username or identifier (for example, `myusername`)

    Example:

    ``` properties
    name="my-graph-cluster"
    username="myusername"
    ```

3.  Update the AWS region.

    Set `AWS_REGION` to match the region you configured in Aerolab:

    ``` properties
    AWS_REGION="YOUR_REGION"
    ```

    Replace `YOUR_REGION` with your AWS region (for example, `us-west-2`).

4.  Update the S3 bucket path.

    Set the `BUCKET_PATH` variable to your bucket name:

    ``` properties
    BUCKET_PATH="s3://YOUR_BUCKET_NAME"
    ```

    Replace `YOUR_BUCKET_NAME` with the bucket name you created earlier.

5.  Locate these two variables:

    ``` properties
    SUBNET_ID="YOUR_SUBNET_ID"
    SECURITY_GROUP="YOUR_SECURITY_GROUP_ID"
    ```

    Do not change these values yet. You will update them after creating your Aerospike cluster in the next section, when Aerolab outputs the subnet ID and security group ID.

6.  Save the file.

::: note
The bulk loader JAR version in `set_variables.sh` (line 16) should match the version you downloaded. If you downloaded a different version, update the `SPARK_JAR` variable accordingly.
:::

## Understanding the configuration

The `set_variables.sh` script centralizes all configuration for the bulk loading pipeline. Understanding these settings helps you customize the deployment for your specific requirements.

As described in the [overview](https://aerospike.com/docs/graph/graph-and-aws-get-started#aws-infrastructure-and-data-loading.md), this tutorial uses several AWS services working together:

-   **[Amazon EC2](https://aws.amazon.com/ec2/)**: Virtual servers that run your Aerospike database and EMR worker nodes.
-   **[Amazon S3](https://aws.amazon.com/s3/)**: Object storage for your graph data files, bulk loader JAR, and job logs.
-   **[Amazon EMR](https://aws.amazon.com/emr/)**: Managed cluster platform for running [Apache Spark](https://spark.apache.org/) jobs that process large datasets.
-   **[Amazon VPC](https://aws.amazon.com/vpc/)**: Isolated network where your AWS resources communicate securely.

### Aerolab settings

These variables control the Aerospike cluster that Aerolab creates on EC2:

| Variable | Description | Default |
|----|----|----|
| `name` | Unique identifier for your cluster. Aerolab uses this to manage cluster lifecycle. | — |
| `username` | Owner tag applied to AWS resources for cost tracking and access control. | — |
| `aerospike_version` | Aerospike Database version to install on the cluster nodes. | `8.0.0.7` |
| `instance_count` | Number of Aerospike nodes in the cluster. Scale up for larger datasets. | `1` |
| `aws_instance_type` | [EC2 instance type](https://aws.amazon.com/ec2/instance-types/) for Aerospike nodes. Choose based on data size and performance needs. | `t3.medium` |

### Bulk load settings

| Variable | Description | Default |
|----|----|----|
| `AWS_REGION` | [AWS region](https://aws.amazon.com/about-aws/global-infrastructure/regions_az/) where all resources are created. Must be consistent across all components. | `us-east-1` |

### S3 locations

These paths define where the bulk loader reads data and writes logs in [Amazon S3](https://aws.amazon.com/s3/):

| Variable | Description | Example |
|----|----|----|
| `BUCKET_PATH` | Root S3 path for all bulk loading files. | `s3://my-bucket` |
| `LOG_URI` | Directory where EMR writes Spark job logs. Useful for debugging failed jobs. | `${BUCKET_PATH}/logs/` |
| `SPARK_JAR` | S3 location of the bulk loader JAR file. | `${BUCKET_PATH}/jars/aerospike-graph-bulk-loader-3.1.1.jar` |
| `SPARK_ARGS` | Command-line arguments passed to the bulk loader, including the path to the properties file. | `-c,${BUCKET_PATH}/config/bulk-loader.properties` |

### EMR cluster settings

[Amazon EMR](https://aws.amazon.com/emr/) runs [Apache Spark](https://spark.apache.org/) to execute the bulk loader. Spark distributes the data loading work across multiple nodes for faster processing:

| Variable | Description | Default |
|----|----|----|
| `CLUSTER_NAME` | Display name for the EMR cluster in the AWS console. | `Aerospike AWS Graph Cluster` |
| `EMR_RELEASE` | [EMR release version](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html), which determines Spark and Hadoop versions. The default includes Spark 3.4 and Java 11 support. | `emr-6.15.0` |
| `emr_instance_type` | [EC2 instance type](https://aws.amazon.com/ec2/instance-types/) for EMR nodes. Larger instances speed up bulk loading for big datasets. | `m5.xlarge` |

### Network settings

For the bulk loader to write data to Aerospike, the EMR cluster must have network access to the Aerospike nodes within the same [VPC](https://aws.amazon.com/vpc/):

| Variable | Description |
|----|----|
| `SUBNET_ID` | [VPC subnet](https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html) where EMR launches its nodes. Must be the same subnet as your Aerospike cluster for private IP connectivity. |
| `SECURITY_GROUP` | [Security group](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-groups.html) that allows inbound traffic on port 3000 (Aerospike). Aerolab creates this automatically when it provisions the cluster. |

::: tip
You’ll retrieve the `SUBNET_ID` and `SECURITY_GROUP` values from the Aerolab output in the next section. These ensure the EMR cluster can communicate with Aerospike over the private network.
:::

I’ve created an S3 bucket.

I’ve downloaded the bulk loader JAR file.

I’ve edited the variables script with my AWS configuration.