Skip to main content
Loading

Bulk Data Loading Procedures for Standalone Processing

You can load graph data into an Aerospike database efficiently with the Aerospike Graph bulk data loader and the Gremlin call API. Data processing takes place on your Aerospike Graph Service machine, so this option is appropriate for smaller data sets or for testing. For larger data sets, distributed mode is recommended.

Bulk loading with the Gremlin call step

note

The bulk loader can only load data into an empty database.

Requirements

  • A running Aerospike Graph Service (AGS) instance. See Installation for help with getting an AGS instance up and running.
  • A running Aerospike Database instance, version 6.2.0.7 or higher.
  • Data files for edges and vertices in the Gremlin CSV format.

Source data files

The bulk loader accepts data files in the Gremlin CSV format, with vertices and edges specified in separate files. All CSV files should have header information with names for each column of data.

note

Aerospike Graph does not support user-provided ~id values for edges, so the ~id column is optional for edge CSV files. If your CSV file contains an ~id column, the values are ignored.

CSV data files can either be local or in cloud-based storage. Cloud-based data files can be stored either in Amazon AWS S3 or Google Cloud Storage.

Data files should be stored in directories specified by the aerospike.graphloader.vertices and aerospike.graphloader.edges configuration options.

  • The directory specified in aerospike.graphloader.vertices should contain one or more subdirectories of vertex CSV files.
  • The directory specified in aerospike.graphloader.edges should contain one or more subdirectories of edge CSV files.
  • Each subdirectory should be named for the label of the data files it contains. For example, a subdirectory of vertex files containing data about people should be named people. A subdirectory of edge files containing data about connections in the people vertices, in which each row has the knows label, should be named knows.

For example, if your S3 bucket is named myBucket, that bucket should contain separate directories for edge and vertex data files, and those directories should contain subdirectories for the CSV files. If the aerospike.graphloader.vertices configuration option is set to s3://myBucket/vertices, you might have subdirectories named s3://myBucket/vertices/people and s3://myBucket/vertices/places, each containing one or more CSV files.

Example directory structure:

/myBucket 
|
---- /myBucket/vertices/
|
-------- /myBucket/vertices/people/
|
------------ /myBucket/vertices/people/vert_file1.csv
------------ /myBucket/vertices/people/vert_file2.csv
|
-------- /myBucket/vertices/places/
|
------------ /myBucket/vertices/places/vert_file3.csv
------------ /myBucket/vertices/places/vert_file4.csv
|
---- /myBucket/edges/
|
-------- /myBucket/edges/worksWith/
|
------------ /myBucket/edges/worksWith/edge_file1.csv
------------ /myBucket/edges/worksWith/edge_file2.csv
|
-------- /myBucket/edges/knows/
|
------------ /myBucket/edges/knows/edge_file3.csv
------------ /myBucket/edges/knows/edge_file4.csv

When using cloud-based source data files, be sure to include your cloud service credentials with the call function. The required parameters for cloud service credentials are listed in the Cloud storage configuration options section.

Bulk loading with local files

You can bulk load local files with the Gremlin call step by specifying their location in the Gremlin command. Use the aerospike.graphloader.vertices and aerospike.graphloader.edges options to specify file directory locations.

note

Local files must be accessible to the AGS Docker image. Specify local file locations in your Docker run command.

The call API runs the bulk loader on a single AGS instance. Aerospike Graph runs in Docker, so any local file paths that you pass to call must be accessible to the Docker image. If you are using local source data files, you must mount these in the Docker image to make them accessible to the bulk loader. More information about mounting directories is available in the Docker volumes documentation. In this example, we have the following directories:

  • /home/graph-user/graph/data/docker-bulk-load/sampledata/vertices/
  • /home/graph-user/graph/data/docker-bulk-load/sampledata/edges/

When we mount /home/graph-user/graph/data/docker-bulk-load/ to /opt/aerospike/etc/ the container sees all the subdirectories below /opt/aerospike/etc/, including sampledata/*. That is reflected in the paths specified in the call step below.

docker run -p 8182:8182 \
-v /home/graph-user/graph/data/docker-bulk-load/:/opt/aerospike/etc/
aerospike/aerospike-graph-service
note

If you are using cloud storage for your data source files, it's not necessary to specify their location in the Docker run command.

When using the -v flag, the path on the left side of the : character is your local path, and the path on the right side of the : character is the path within your AGS Docker image.

To specify your file locations in the Gremlin command, use the with step:

g.with("evaluationTimeout", 24 * 60 * 60 * 1000).call("bulk-load").with("aerospike.graphloader.vertices", "/opt/aerospike/etc/sampledata/vertices").with("aerospike.graphloader.edges", "/opt/aerospike/etc/sampledata/edges")

Bulk loading with remote files

The bulk loader supports remote data files stored in Google Storage Buckets on GCP and S3 buckets on AWS. You can specify remote file locations and the credentials necessary to reach them in the call step. Use the aerospike.graphloader.vertices and aerospike.graphloader.edges options to specify file directory locations, and use any necessary credential options to authenticate with your cloud provider.

The following example Gremlin command uses the bulk loader to add data from source data files stored in an AWS S3 bucket to an Aerospike Graph database:

g.with("evaluationTimeout", 24 * 60 * 60 * 1000).call("bulk-load").with("aerospike.graphloader.vertices", "s3://myBucket/vertices").with("aerospike.graphloader.edges", "s3://myBucket/edges").with("aerospike.graphloader.remote-user", "AWS_ACCESS_KEY_ID").with("aerospike.graphloader.remote-passkey", "AWS_SECRET_ACCESS_KEY")

The evaluationTimeout parameter

All bulk loader operations should include the evaluationTimout parameter:

.with("evaluationTimeout", 24L * 60L * 60L * 1000L)

This parameter prevents the bulk loading operation from timing out when running for extended periods. In the above example, the timeout is set to 24 hours in milliseconds. You can adjust it as necessary.

note

Certain Gremlin language variants may expect the numeric value to be of type Long, as shown in the Java example.

Configuration options

You can specify configuration options as part of the Gremlin call step.

The following options are available for the bulk loader:

NameOptionalDefaultDescription
aerospike.graphloader.edgesnononeThe path to the directory or cloud storage location where CSV files containing edge data are stored. This directory may contain subdirectories with CSV files.
aerospike.graphloader.verticesnononeThe path to the directory or cloud storage location where CSV files containing vertex data are stored. This directory may contain subdirectories with CSV files.
aerospike.graphloader.sampling-percentageyes0Specifies a percentage of the input data to be sampled for verifying that the bulk loading job was successful.

Cloud storage configuration options

The Bulk Loader supports cloud-based source data storage locations. If your edge and vertex CSV files are stored in AWS S3 or Google Cloud Storage buckets, the following configuration options are relevant.

note

These options may be optional or required, depending on the remote environment. Check your cloud service documentation for details.

Name
aerospike.graphloader.remote-user
aerospike.graphloader.remote-passkey
aerospike.graphloader.gcs-email
aerospike.graphloader.gcs-keyfile

Additional cloud considerations

  • Populate the aerospike.graphloader.remote-user option with your AWS AWS_ACCESS_KEY_ID value.
  • Populate the aerospike.graphloader.remote-passkey option with your AWS AWS_SECRET_ACCESS_KEY value.
  • The AWS options are required for the Gremlin call step unless the Graph Docker environment is preconfigured with AWS credentials.
  • The GCS options are not applicable.