Bulk data loading for standalone processing

Overview

This page describes how to load graph data into an Aerospike database with the Aerospike Graph Service (AGS) bulk data loader and the Gremlin call API. This method is for standalone processing of small data sets.

Data processing takes place on your AGS machine, so this option is appropriate for smaller data sets or for testing. For larger data sets, we recommend using the distributed mode.

Bulk loading with the Gremlin `call` step

Requirements

A running AGS instance with the standard Docker image. See Installation for help with getting an AGS instance up and running. NOTE: the slim version of the Docker image does not include the standalone bulk loader.
A running Aerospike Database, version 6.2.0.7 or later.
Data files for edges and vertices in CSV format, in the required directory structure.

Bulk loading with local source files

Your source data files must be accessible by your AGS instance. When you start the AGS Docker container, bind your local source files with the Docker -v option. In the following example command, the local source files are located at /etc/data:

docker run -p 8182:8182 -v /etc/data:/opt/aerospike-graph/data container.aerospike.com/aerospike/aerospike-graph-service

Run the following command in the Gremlin console to start the bulk data loading job:

g.with("evaluationTimeout", 24 * 60 * 60 * 1000)
 .call("aerospike.graphloader.admin.bulk-load.load")
 .with("aerospike.graphloader.vertices", "/opt/aerospike-graph/data/vertices")
 .with("aerospike.graphloader.edges", "/opt/aerospike-graph/data/edges")

Bulk loading with remote files

The bulk loader supports remote data files stored in Google Storage Buckets on GCP and S3 buckets on AWS. You can specify remote file locations and the credentials necessary to reach them in the call step.

In the following example, the source data files are in S3 buckets:

g.with("evaluationTimeout", 24 * 60 * 60 * 1000)
 .call("aerospike.graphloader.admin.bulk-load.load")
 .with("aerospike.graphloader.vertices", "s3://myBucket/vertices")
 .with("aerospike.graphloader.edges", "s3://myBucket/edges")
 .with("aerospike.graphloader.remote-user", "AWS_ACCESS_KEY_ID")
 .with("aerospike.graphloader.remote-passkey", "AWS_SECRET_ACCESS_KEY")

Cloud storage configuration options

The bulk loader supports cloud-based source data storage locations. If your edge and vertex CSV files are stored in AWS S3 or Google Cloud Storage buckets, the following configuration options are relevant. These options may be optional or required, depending on the remote environment. Check your cloud service documentation for details.

Name
aerospike.graphloader.remote-user
aerospike.graphloader.remote-passkey
aerospike.graphloader.gcs-email

Additional cloud considerations

When using AWS for source data files, specify the following options as part of the bulk loader command:

aerospike.graphloader.remote-user: your AWS AWS_ACCESS_KEY_ID value.
aerospike.graphloader.remote-passkey: your AWS AWS_SECRET_ACCESS_KEY value.

The AWS options are required for the Gremlin call step unless the Graph Docker environment is preconfigured with AWS credentials.

The GCS options are not applicable.

When using Google Cloud Storage for source data files, you must configure a GCS Service Account. When running the bulk loader, specify the following options:

aerospike.graphloader.remote-user : your GCS private_key_id value.
aerospike.graphloader.remote-passkey : your GCS private_key value.
aerospike.graphloader.gcs-email : your GCS client_email value.

These values can be found in the JSON-generated key file for the GCS Service Account.

Example JSON key file:

{
  "type": "service_account",
  "project_id": "account",
  "private_key_id": "bede4fccf89903a2bf1ca5be3ace1f5df4cc96d8",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQDBx81Dpdg58C0v\nXYxIeXFwahVzmv11OmgILLXfqwfBXCk+VqWZXvouNWT6LJj3hXuB3rg5QeYvqU3c\nBJFow5oS/YogkTBjLAGtthbDE+y1m8dJx1cMNNSHnItvLLZvjizlKlvuyRDvQT20\nN9ZKQkPNMu8YFCDcS8F9dcqc23IoAlCd01oigp5SvaZC9GoH5L6GBixzUxpkwhFl\nCWq6Sn7c3RPpwWB1w9opQX0Rf+0QtS5c5xmnlb6KkAAaTaWZTac0mTdjdc/3YDxI\ngcFR+3e23lyLMD0t4YWb1IoeAb9jnrSd+zV18vrMcvSJEmA3529e3h69E3q39XI4\nE2fR2t+BAgMBAAECggEABF+1ZR++7e3bIzzOEAu3sQUf91RKAMWg+ABEWahTXDyv\njKHpinjvli/r186+ZCXd6AxGJbq0TqKYaI0s2AvENEYHQl1slWx2nDxmLqCKQP4j\nZSmr8BFYM7hnmEqO0p1Hq6OFYDInIPNcyG7TNilhPOY3qdg4dqh6FVIdkOVOIhOD\nNVTx5g2sGk9z+hjiOM6Zjljl5EANR1U2WNGe8c3T4eau8SnJi/9tsFxITe/uAj8n\n9LRPSsL2fD7U3Kug5O8j7qhFmsGMVXDKs4LOqIGbdcDpeL10RtGr00jGJlgpkNjc\n7r3KuKdNjdyqfl8d1VLHgcBqxvjxO4Cfc+iTtq7wfQKBgQDysscmOBqhQnsVfp7C\nVyLSRz8dKElJkmYHYZfAeWV62ZAYcpovJghin/vPiSJXHITynwnbCbi6YNbmhn9W\nScJaTRODZy4fCbyKbeG59ZBIBu0/5Ou+RTqL5f5k8UzFSm0CAaP+EmdoiPSOUzkH\nvzARDqzCnsKQ/nqEBfx3TzTTvQKBgQDMZqxJjGwHRvACVE1K+8RTfC3F7tmdAo75\ncAB7JpXlCeitcQ5omShVQ3J410iysdWZ3hgV+nMwsKfRimAwUm/Zo0H9tFjO/+An\np+q36khSBUhs4UtJFTlyQTg9H4AURa7b0whDjkl4MIFo19CSEWErmeL1VumEXSJU\nxZ5hsLMVFQKBgBHrf2bkB5tWlE3+/mvtESYjmpZljhu/kocC/rh4fjS28bvMYnQO\nw9m8ZFRrlLyH340mjwy8SAaC9fspfSd65L3UKRevu6kRB/nUqTEY36Fh2Yy5M2rm\nI6+GuOTtKDT9DNV0F46//yCp1BzaKkDXLg5kXf80x7r6/0LWSlDo6UalAoGBAIdj\n0ub8vmmrkTrZwEDUt1xdOqyK41Xe5flPOOJZ0pvdjmOkKVkbad3gSSjF4P+MT+IV\nfHrCZB5yRRbEw6X+VNwiCYoVNWYXktBxp0WfR7wch7anHIkSJ/UIQkoqXVoQNhyh\nki29R+j2qCFcImk+XdDVo8HCifcFAcKJC7nFozlpAoGBAJwBWfaS7UcC+kCfO2DO\nnNXaoA8kdjxPbGnj6N7bbixjPJJTiYEIjU6pLk0QWx0rnIV3EQmY/nSQ4cPppWgv\nPCqtIpvkrcHMtJ/o4ChcDSvCP8ZS3i2YYVDWSzWC3/ocrTFxrPJfjm5LRlkXEjFM\n63jYaJce0rAigB0JJBNkTuDF\n-----END PRIVATE KEY-----\n",
  "client_email": "account@project.iam.gserviceaccount.com",
  "client_id": "10034131231241",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/account%40project.iam.gserviceaccount.com",
  "universe_domain": "googleapis.com"
}

Bulk data loading for standalone processing

Overview

Bulk loading with the Gremlin call step

Requirements

Bulk loading with local source files

Bulk loading with remote files

Cloud storage configuration options

Additional cloud considerations

Bulk loading with the Gremlin `call` step