Skip to main content
Loading
Version: Graph 2.4.2

Bulk data loading for standalone processing

Overviewโ€‹

This page describes how to load graph data into an Aerospike database with the Aerospike Graph Service (AGS) bulk data loader and the Gremlin call API. This method is for standalone processing of small data sets.

Data processing takes place on your AGS machine, so this option is appropriate for smaller data sets or for testing. For larger data sets, we recommend using the distributed mode.

Bulk loading with the Gremlin call stepโ€‹

Requirementsโ€‹

  • A running AGS instance with the standard Docker image. See Installation for help with getting an AGS instance up and running. NOTE: the slim version of the Docker image does not include the standalone bulk loader.
  • A running Aerospike Database, version 6.2.0.7 or later.
  • Data files for edges and vertices in CSV format, in the required directory structure.

Bulk loading with local source filesโ€‹

Your source data files must be accessible by your AGS instance. When you start the AGS Docker container, bind your local source files with the Docker -v option. In the following example command, the local source files are located at /etc/data:

docker run -p 8182:8182 -v /etc/data:/opt/aerospike-graph/data aerospike/aerospike-graph-service

Run the following command in the Gremlin console to start the bulk data loading job:

g.with("evaluationTimeout", 24 * 60 * 60 * 1000)
.call("aerospike.graphloader.admin.bulk-load.load")
.with("aerospike.graphloader.vertices", "/opt/aerospike-graph/data/vertices")
.with("aerospike.graphloader.edges", "/opt/aerospike-graph/data/edges")

Bulk loading with remote filesโ€‹

The bulk loader supports remote data files stored in Google Storage Buckets on GCP and S3 buckets on AWS. You can specify remote file locations and the credentials necessary to reach them in the call step.

In the following example, the source data files are in S3 buckets:

g.with("evaluationTimeout", 24 * 60 * 60 * 1000)
.call("aerospike.graphloader.admin.bulk-load.load")
.with("aerospike.graphloader.vertices", "s3://myBucket/vertices")
.with("aerospike.graphloader.edges", "s3://myBucket/edges")
.with("aerospike.graphloader.remote-user", "AWS_ACCESS_KEY_ID")
.with("aerospike.graphloader.remote-passkey", "AWS_SECRET_ACCESS_KEY")

Cloud storage configuration optionsโ€‹

The bulk loader supports cloud-based source data storage locations. If your edge and vertex CSV files are stored in AWS S3 or Google Cloud Storage buckets, the following configuration options are relevant. These options may be optional or required, depending on the remote environment. Check your cloud service documentation for details.

info

The cloud-based configuration options must be included as part of the bulk loader execution command. They cannot be part of the AGS properties file.

Name
aerospike.graphloader.remote-user
aerospike.graphloader.remote-passkey
aerospike.graphloader.gcs-email

Additional cloud considerationsโ€‹

When using AWS for source data files, specify the following options as part of the bulk loader command:

  • aerospike.graphloader.remote-user: your AWS AWS_ACCESS_KEY_ID value.

  • aerospike.graphloader.remote-passkey: your AWS AWS_SECRET_ACCESS_KEY value.

    The AWS options are required for the Gremlin call step unless the Graph Docker environment is preconfigured with AWS credentials.

    The GCS options are not applicable.

Incremental data loadingโ€‹

Incremental data loading allows you to:

  • Add vertices to an existing graph.
  • Add edges to new and existing vertices.
  • Update properties of existing vertices.

To load data incrementally, add the incremental_mode flag to the call command.

g.with("evaluationTimeout", 24 * 60 * 60 * 1000).call("aerospike.graphloader.admin.bulk-load.load").with("aerospike.graphloader.vertices", "<path_to_vertices>").with("aerospike.graphloader.edges", "<path_to_edges>").with("incremental_load", true).