# Bulk data loading for standalone processing

## Overview

This page describes how to load graph data into an Aerospike database with the [Aerospike Graph](https://aerospike.com/docs/graph.md) bulk data loader and the Gremlin `call` API. This method is for standalone processing of small data sets.

Data processing takes place on your AGS instance, so this option is appropriate for smaller data sets or for testing. For larger data sets, we recommend using the [distributed mode](https://aerospike.com/docs/graph/3.1.0/develop/data-loading/distributed.md).

## Bulk loading with the Gremlin `call` step

### Requirements

-   A running AGS instance with the [standard Docker image](https://hub.docker.com/r/aerospike/aerospike-graph-service/tags). See [Installation](https://aerospike.com/docs/graph/3.1.0/install/docker.md) for help with getting an AGS instance up and running.

    ::: note
    The slim version of the Docker image does not include the standalone bulk loader.
    :::

-   A running [Aerospike Database](https://aerospike.com/download/), version 7.0 or later.

-   Data files for edges and vertices in [CSV format](https://aerospike.com/docs/graph/3.1.0/develop/data-loading/csv-format.md), in the required directory structure.

### The bulk load command

Use the following base Gremlin command to initiate a standalone bulk loading job:

``` groovy
g.call("aerospike.graphloader.admin.bulk-load.load")
```

Response

``` {.bash style="--ecMaxLine:135ch"}
==>Bulk load started successfully. Use the g.call("aerospike.graphloader.admin.bulk-load.status") command to get the status of the job.
```

The full usage of this command varies by storage backend (Local, S3, GCS) and is shown in the tabs below.

-   [Local](#tab-panel-1169)
-   [S3 (AWS)](#tab-panel-1170)
-   [GCS (GCP)](#tab-panel-1171)

When using local source files, make sure your AGS container can access them via Docker bind mounts. For example, if your files are in `/etc/data`:

``` bash
docker run -p 8182:8182 -v /etc/data:/opt/aerospike-graph/data aerospike/aerospike-graph-service
```

Then run the following in the Gremlin console:

``` groovy
g.with("evaluationTimeout", 20000)
 .call("aerospike.graphloader.admin.bulk-load.load")
 .with("aerospike.graphloader.vertices", "/opt/aerospike-graph/data/vertices")
 .with("aerospike.graphloader.edges", "/opt/aerospike-graph/data/edges")
 .next()
```

When using Amazon S3, provide the following credentials during the call step:

-   `aerospike.graphloader.remote-user`: your `AWS_ACCESS_KEY_ID`
-   `aerospike.graphloader.remote-passkey`: your `AWS_SECRET_ACCESS_KEY`

``` groovy
g.with("evaluationTimeout", 60000)
 .call("aerospike.graphloader.admin.bulk-load.load")
 .with("aerospike.graphloader.vertices", "s3://BUCKET_NAME/vertices")
 .with("aerospike.graphloader.edges", "s3://BUCKET_NAME/edges")
 .with("aerospike.graphloader.remote-user", "AWS_ACCESS_KEY_ID")
 .with("aerospike.graphloader.remote-passkey", "AWS_SECRET_ACCESS_KEY")
 .next()
```

These options are required unless the Docker container is already configured with credentials.

For Google Cloud Storage, you can authenticate in one of two ways.

#### Option 1: Use a key file

1.  Mount the service account key JSON at a directory inside the container that isn’t directly in the root. The following example mounts it at `/opt/secrets/gcs-key.json` and names the container `ags` so you can easily verify the mount with `docker exec`.

    ``` bash
    docker run --name ags -d -p 8182:8182 \
      -v /PATH_TO_GCS_KEY/GCS_KEY.json:/opt/secrets/gcs-key.json \
      container.aerospike.com/aerospike/aerospike-graph-service
    ```

2.  Verify the file is available inside the container:

    ``` {.bash style="--ecMaxLine:85ch"}
    docker exec -it ags sh -lc 'ls -l /opt/secrets && head -n2 /opt/secrets/gcs-key.json'
    ```

3.  In the bulk load call, set `aerospike.graphloader.gcs-keyfile` to the absolute container path you mounted. Replace the placeholders `BUCKET_NAME` and `/opt/secrets/gcs-key.json` with values that match your environment before running the example. Then run the bulk load:

    ``` groovy
    g.with("evaluationTimeout", 60000)
    .call("aerospike.graphloader.admin.bulk-load.load")
    .with("aerospike.graphloader.vertices", "gs://BUCKET_NAME/vertices")
    .with("aerospike.graphloader.edges", "gs://BUCKET_NAME/edges")
    .with("aerospike.graphloader.gcs-keyfile", "/opt/secrets/gcs-key.json")
    .next()
    ```

#### Option 2: Provide credentials inline

Provide these fields from the Service Account key during the call step:

-   `aerospike.graphloader.remote-user`: your private key ID
-   `aerospike.graphloader.remote-passkey`: your private key
-   `aerospike.graphloader.gcs-email`: your client email

Replace the placeholders in the example below, including `BUCKET_NAME`, `PRIVATE_KEY_ID`, `PRIVATE_KEY`, and `CLIENT_EMAIL`, with your actual service account values before you run it.

``` groovy
g.with("evaluationTimeout", 60000)
 .call("aerospike.graphloader.admin.bulk-load.load")
 .with("aerospike.graphloader.vertices", "gs://BUCKET_NAME/vertices")
 .with("aerospike.graphloader.edges", "gs://BUCKET_NAME/edges")
 .with("aerospike.graphloader.remote-user", "PRIVATE_KEY_ID")
 .with("aerospike.graphloader.remote-passkey", "PRIVATE_KEY")
 .with("aerospike.graphloader.gcs-email", "CLIENT_EMAIL")
 .next()
```

These fields are extracted from the JSON key file:

``` json
{
  "type": "service_account",
  "project_id": "PROJECT_ID",
  "private_key_id": "PRIVATE_KEY_ID",
  "private_key": "-----BEGIN PRIVATE KEY-----\\n...",
  "client_email": "SERVICE_ACCOUNT_EMAIL"
}
```

::: note
Use the key file option if authentication fails when passing `private_key` inline. Large keys can be truncated by some Docker network setups. The key file avoids truncation.
:::

------------------------------------------------------------------------

### The [`evaluationTimeout`](https://aerospike.com/docs/graph/reference/config/#aerospikegraph-serviceevaluationtimeout.md) parameter

The default AGS command timeout is 10 seconds (specified in milliseconds as `10000`). Depending on system load or configuration, the bulk loader may take longer to initialize.

If your graph data is stored in remote cloud buckets or your cluster takes longer to initialize, you can increase the timeout with the `evaluationTimeout` parameter.

For example:

``` groovy
// For remote storage access (S3/GCS), longer initialization is expected
g.with("evaluationTimeout", 60000)
```

If commands are failing during initialization, try increasing this value.

## Status monitoring

Use the command `aerospike.graphloader.admin.bulk-load.status` to check the progress of a standalone bulk data loading job. In the Gremlin console:

``` groovy
g.call("aerospike.graphloader.admin.bulk-load.status").next()
```

::: caution
If your cluster has more than one AGS node, turn off the other nodes before querying the bulk load status, otherwise the results can be unreliable.
:::

This call returns a structured response describing the job’s current status. The available fields are:

| Key | Type | Availability | Description |
|----|----|----|----|
| `step` | String | Always | Current bulk load step. See [stages and steps](https://aerospike.com/docs/graph/3.1.0/develop/data-loading/distributed/#bulk-data-loading-job-stages-and-steps.md) for a complete list of bulk loading steps. |
| `complete` | Boolean | Always | If `true`, the current bulk loading job is complete. If `false`, the job is ongoing. |
| `status` | String | Always | Current job status. May be one of: `success`, `in progress`, `error` |
| `message` | String | Only when `complete` is `true` and `status` is `error` | Message from the Exception that caused the failure. |
| `stacktrace` | String | Only when `complete` is `true` and `status` is `error` | Stacktrace from the Exception that caused the failure. |
| `elements-written` | Long | Only when `stage` is `Vertex writing` or `Edge writing` | Number of vertex or edge elements written, depending on the current writing stage. |
| `complete-partitions-percentage` | Integer | Only when `stage` is `Vertex writing` or `Edge writing` | Percentage count of the partitions completed for the current writing stage. |
| `duplicate-vertex-ids` | Long | When `complete` is `true`. May be absent if `status` is `error` and the error which caused the job to fail makes this information inaccessible. | See [Error handling](https://aerospike.com/docs/graph/3.1.0/develop/data-loading/error-handling.md) for details. |
| `bad-entries` | Long | When `complete` is `true`. May be absent if `status` is `error` and the error which caused the job to fail makes this information inaccessible. | See [Error handling](https://aerospike.com/docs/graph/3.1.0/develop/data-loading/error-handling.md) for details. |
| `bad-edges` | Long | When `complete` is `true`. May be absent if `status` is `error` and the error which caused the job to fail makes this information inaccessible. | See [Error handling](https://aerospike.com/docs/graph/3.1.0/develop/data-loading/error-handling.md) for details. |