Bulk data loading for standalone processing
Overview
This page describes how to load graph data into an Aerospike database with the Aerospike Graph bulk data loader and the Gremlin call
API. This method is
for standalone processing of small data sets.
Data processing takes place on your AGS instance, so this option is appropriate for smaller data sets or for testing. For larger data sets, we recommend using the distributed mode.
Bulk loading with the Gremlin call
step
Requirements
-
A running AGS instance with the standard Docker image. See Installation for help with getting an AGS instance up and running.
-
A running Aerospike Database, version 7.0 or later.
-
Data files for edges and vertices in CSV format, in the required directory structure.
The bulk load command
Use the following base Gremlin command to initiate a standalone bulk loading job:
g.call("aerospike.graphloader.admin.bulk-load.load")
==>Bulk load started successfully. Use the g.call("aerospike.graphloader.admin.bulk-load.status") command to get the status of the job.
The full usage of this command varies by storage backend (Local, S3, GCS) and is shown in the tabs below.
When using local source files, make sure your AGS container can access them via Docker bind mounts. For example, if your files are in /etc/data
:
docker run -p 8182:8182 -v /etc/data:/opt/aerospike-graph/data container.aerospike.com/aerospike/aerospike-graph-service
Then run the following in the Gremlin console:
g.with("evaluationTimeout", 20000) .call("aerospike.graphloader.admin.bulk-load.load") .with("aerospike.graphloader.vertices", "/opt/aerospike-graph/data/vertices") .with("aerospike.graphloader.edges", "/opt/aerospike-graph/data/edges") .next()
When using Amazon S3, provide the following credentials during the call step:
aerospike.graphloader.remote-user
: yourAWS_ACCESS_KEY_ID
aerospike.graphloader.remote-passkey
: yourAWS_SECRET_ACCESS_KEY
g.with("evaluationTimeout", 60000) .call("aerospike.graphloader.admin.bulk-load.load") .with("aerospike.graphloader.vertices", "s3://<bucket-name>/vertices") .with("aerospike.graphloader.edges", "s3://<bucket-name>/edges") .with("aerospike.graphloader.remote-user", "AWS_ACCESS_KEY_ID") .with("aerospike.graphloader.remote-passkey", "AWS_SECRET_ACCESS_KEY") .next()
These options are required unless the Docker container is already configured with credentials.
For Google Cloud Storage, provide credentials from a GCS Service Account key:
aerospike.graphloader.remote-user
: yourprivate_key_id
aerospike.graphloader.remote-passkey
: yourprivate_key
aerospike.graphloader.gcs-email
: yourclient_email
g.with("evaluationTimeout", 60000) .call("aerospike.graphloader.admin.bulk-load.load") .with("aerospike.graphloader.vertices", "gs://<bucket-name>/vertices") .with("aerospike.graphloader.edges", "gs://<bucket-name>/edges") .with("aerospike.graphloader.remote-user", "private_key_id") .with("aerospike.graphloader.remote-passkey", "private_key") .with("aerospike.graphloader.gcs-email", "client_email") .next()
These fields are extracted from the JSON key file:
{ "type": "service_account", "project_id": "my-project", "private_key_id": "...", "private_key": "-----BEGIN PRIVATE KEY-----\n...", "client_email": "account@project.iam.gserviceaccount.com"}
The evaluationTimeout
parameter
The default AGS command timeout is 10 seconds (specified in milliseconds as 10000
). Depending on system load or configuration, the bulk loader may take longer to initialize.
If your graph data is stored in remote cloud buckets or your cluster takes longer to initialize, you can increase the timeout with the evaluationTimeout
parameter.
For example:
// For remote storage access (S3/GCS), longer initialization is expectedg.with("evaluationTimeout", 60000)
If commands are failing during initialization, try increasing this value.
Status monitoring
Use the command aerospike.graphloader.admin.bulk-load.status
to check the progress of a
standalone bulk data loading job. In the Gremlin console:
g.call("aerospike.graphloader.admin.bulk-load.status").next()
This call returns a structured response describing the job’s current status. The available fields are:
Key | Type | Availability | Description |
---|---|---|---|
step | String | Always | Current bulk load step. See stages and steps for a complete list of bulk loading steps. |
complete | Boolean | Always | If true , the current bulk loading job is complete. If false , the job is ongoing. |
status | String | Always | Current job status. May be one of: success , in progress , error |
message | String | Only when complete is true and status is error | Message from the Exception that caused the failure. |
stacktrace | String | Only when complete is true and status is error | Stacktrace from the Exception that caused the failure. |
elements-written | Long | Only when stage is Vertex writing or Edge writing | Number of vertex or edge elements written, depending on the current writing stage. |
complete-partitions-percentage | Integer | Only when stage is Vertex writing or Edge writing | Percentage count of the partitions completed for the current writing stage. |
duplicate-vertex-ids | Long | When complete is true . May be absent if status is error and the error which caused the job to fail makes this information inaccessible. | See Error handling for details. |
bad-entries | Long | When complete is true . May be absent if status is error and the error which caused the job to fail makes this information inaccessible. | See Error handling for details. |
bad-edges | Long | When complete is true . May be absent if status is error and the error which caused the job to fail makes this information inaccessible. | See Error handling for details. |