Bulk data loading for standalone processing

Overview

This page describes how to load graph data into an Aerospike database with the Aerospike Graph bulk data loader and the Gremlin call API. This method is for standalone processing of small data sets.

Data processing takes place on your AGS instance, so this option is appropriate for smaller data sets or for testing. For larger data sets, we recommend using the distributed mode.

Bulk loading with the Gremlin `call` step

Requirements

A running AGS instance with the standard Docker image. See Installation for help with getting an AGS instance up and running.

Note
The slim version of the Docker image does not include the standalone bulk loader.
A running Aerospike Database, version 7.0 or later.
Data files for edges and vertices in CSV format, in the required directory structure.

The bulk load command

Use the following base Gremlin command to initiate a standalone bulk loading job:

g.call("aerospike.graphloader.admin.bulk-load.load")

Response

==>Bulk load started successfully. Use the g.call("aerospike.graphloader.admin.bulk-load.status") command to get the status of the job.

The full usage of this command varies by storage backend (Local, S3, GCS) and is shown in the tabs below.

When using local source files, make sure your AGS container can access them via Docker bind mounts. For example, if your files are in /etc/data:

docker run -p 8182:8182 -v /etc/data:/opt/aerospike-graph/data aerospike/aerospike-graph-service

Then run the following in the Gremlin console:

g.with("evaluationTimeout", 20000)
 .call("aerospike.graphloader.admin.bulk-load.load")
 .with("aerospike.graphloader.vertices", "/opt/aerospike-graph/data/vertices")
 .with("aerospike.graphloader.edges", "/opt/aerospike-graph/data/edges")
 .next()

When using Amazon S3, provide the following credentials during the call step:

aerospike.graphloader.remote-user: your AWS_ACCESS_KEY_ID
aerospike.graphloader.remote-passkey: your AWS_SECRET_ACCESS_KEY

g.with("evaluationTimeout", 60000)
 .call("aerospike.graphloader.admin.bulk-load.load")
 .with("aerospike.graphloader.vertices", "s3://BUCKET_NAME/vertices")
 .with("aerospike.graphloader.edges", "s3://BUCKET_NAME/edges")
 .with("aerospike.graphloader.remote-user", "AWS_ACCESS_KEY_ID")
 .with("aerospike.graphloader.remote-passkey", "AWS_SECRET_ACCESS_KEY")
 .next()

These options are required unless the Docker container is already configured with credentials.

For Google Cloud Storage, you can authenticate in one of two ways.

Option 1: Use a key file

Mount the service account key JSON at a directory inside the container that isn’t directly in the root. The following example mounts it at /opt/secrets/gcs-key.json and names the container ags so you can easily verify the mount with docker exec.
Terminal window
```
docker run --name ags -d -p 8182:8182 \
  -v /PATH_TO_GCS_KEY/GCS_KEY.json:/opt/secrets/gcs-key.json \
  container.aerospike.com/aerospike/aerospike-graph-service
```

Verify the file is available inside the container:

docker exec -it ags sh -lc 'ls -l /opt/secrets && head -n2 /opt/secrets/gcs-key.json'

In the bulk load call, set aerospike.graphloader.gcs-keyfile to the absolute container path you mounted. Replace the placeholders BUCKET_NAME and /opt/secrets/gcs-key.json with values that match your environment before running the example. Then run the bulk load:

g.with("evaluationTimeout", 60000)
.call("aerospike.graphloader.admin.bulk-load.load")
.with("aerospike.graphloader.vertices", "gs://BUCKET_NAME/vertices")
.with("aerospike.graphloader.edges", "gs://BUCKET_NAME/edges")
.with("aerospike.graphloader.gcs-keyfile", "/opt/secrets/gcs-key.json")
.next()

Option 2: Provide credentials inline

Provide these fields from the Service Account key during the call step:

aerospike.graphloader.remote-user: your private key ID
aerospike.graphloader.remote-passkey: your private key
aerospike.graphloader.gcs-email: your client email

Replace the placeholders in the example below, including BUCKET_NAME, PRIVATE_KEY_ID, PRIVATE_KEY, and CLIENT_EMAIL, with your actual service account values before you run it.

g.with("evaluationTimeout", 60000)
 .call("aerospike.graphloader.admin.bulk-load.load")
 .with("aerospike.graphloader.vertices", "gs://BUCKET_NAME/vertices")
 .with("aerospike.graphloader.edges", "gs://BUCKET_NAME/edges")
 .with("aerospike.graphloader.remote-user", "PRIVATE_KEY_ID")
 .with("aerospike.graphloader.remote-passkey", "PRIVATE_KEY")
 .with("aerospike.graphloader.gcs-email", "CLIENT_EMAIL")
 .next()

These fields are extracted from the JSON key file:

{
  "type": "service_account",
  "project_id": "PROJECT_ID",
  "private_key_id": "PRIVATE_KEY_ID",
  "private_key": "-----BEGIN PRIVATE KEY-----\\n...",
  "client_email": "SERVICE_ACCOUNT_EMAIL"
}

The `evaluationTimeout` parameter

The default AGS command timeout is 10 seconds (specified in milliseconds as 10000). Depending on system load or configuration, the bulk loader may take longer to initialize.

If your graph data is stored in remote cloud buckets or your cluster takes longer to initialize, you can increase the timeout with the evaluationTimeout parameter.

For example:

// For remote storage access (S3/GCS), longer initialization is expected
g.with("evaluationTimeout", 60000)

If commands are failing during initialization, try increasing this value.

Status monitoring

Use the command aerospike.graphloader.admin.bulk-load.status to check the progress of a standalone bulk data loading job. In the Gremlin console:

g.call("aerospike.graphloader.admin.bulk-load.status").next()

This call returns a structured response describing the job’s current status. The available fields are:

Key	Type	Availability	Description
`step`	String	Always	Current bulk load step. See stages and steps for a complete list of bulk loading steps.
`complete`	Boolean	Always	If `true`, the current bulk loading job is complete. If `false`, the job is ongoing.
`status`	String	Always	Current job status. May be one of: `success`, `in progress`, `error`
`message`	String	Only when `complete` is `true` and `status` is `error`	Message from the Exception that caused the failure.
`stacktrace`	String	Only when `complete` is `true` and `status` is `error`	Stacktrace from the Exception that caused the failure.
`elements-written`	Long	Only when `stage` is `Vertex writing` or `Edge writing`	Number of vertex or edge elements written, depending on the current writing stage.
`complete-partitions-percentage`	Integer	Only when `stage` is `Vertex writing` or `Edge writing`	Percentage count of the partitions completed for the current writing stage.
`duplicate-vertex-ids`	Long	When `complete` is `true`. May be absent if `status` is `error` and the error which caused the job to fail makes this information inaccessible.	See Error handling for details.
`bad-entries`	Long	When `complete` is `true`. May be absent if `status` is `error` and the error which caused the job to fail makes this information inaccessible.	See Error handling for details.
`bad-edges`	Long	When `complete` is `true`. May be absent if `status` is `error` and the error which caused the job to fail makes this information inaccessible.	See Error handling for details.

Bulk data loading for standalone processing

Overview

Bulk loading with the Gremlin call step

Requirements

The bulk load command

Option 1: Use a key file

Option 2: Provide credentials inline

The evaluationTimeout parameter

Status monitoring

Bulk loading with the Gremlin `call` step

The `evaluationTimeout` parameter