Bulk data loading for standalone processing
Overviewโ
This page describes how to load graph data into an Aerospike database with the Aerospike Graph Service (AGS) bulk data loader and the Gremlin call
API. This method is
for standalone processing of small data sets.
Data processing takes place on your AGS machine, so this option is appropriate for smaller data sets or for testing. For larger data sets, we recommend using the distributed mode.
Bulk loading with the Gremlin call
stepโ
Requirementsโ
- A running AGS instance. See Installation for help with getting an AGS instance up and running.
- A running Aerospike Database, version 6.2.0.7 or later.
- Data files for edges and vertices in the Gremlin CSV format.
Source data filesโ
The bulk loader accepts data files in the Gremlin CSV format, with vertices and edges specified in separate files. All CSV files should have header information with names for each column of data.
AGS does not support user-provided ~id
values for edges, so the ~id
column
is optional for edge CSV files. If your CSV file contains an ~id
column, the values
are ignored.
CSV data files can either be local or in cloud-based storage. Cloud-based data files can be stored either in Amazon AWS S3 or Google Cloud Storage.
Data files must be stored in directories specified by the
aerospike.graphloader.vertices
and aerospike.graphloader.edges
configuration options.
- The directory specified in
aerospike.graphloader.vertices
must contain one or more subdirectories of vertex CSV files. - The directory specified in
aerospike.graphloader.edges
must contain one or more subdirectories of edge CSV files. - Each subdirectory should be named for the label of the data files it contains.
For example, a subdirectory of vertex files containing data about people should be named
people
. A subdirectory of edge files containing data about connections in thepeople
vertices, in which each row has theknows
label, should be namedknows
.
For example, if your S3 bucket is named myBucket
, that bucket must
contain separate directories for edge and vertex data files, and those directories
must contain subdirectories for the CSV files.
If the aerospike.graphloader.vertices
configuration option
is set to s3://myBucket/vertices
, you might have subdirectories named
s3://myBucket/vertices/people
and s3://myBucket/vertices/places
, each
containing one or more CSV files.
Example directory structure:
/myBucket
|
---- /myBucket/vertices/
|
-------- /myBucket/vertices/people/
|
------------ /myBucket/vertices/people/vert_file1.csv
------------ /myBucket/vertices/people/vert_file2.csv
|
-------- /myBucket/vertices/places/
|
------------ /myBucket/vertices/places/vert_file3.csv
------------ /myBucket/vertices/places/vert_file4.csv
|
---- /myBucket/edges/
|
-------- /myBucket/edges/worksWith/
|
------------ /myBucket/edges/worksWith/edge_file1.csv
------------ /myBucket/edges/worksWith/edge_file2.csv
|
-------- /myBucket/edges/knows/
|
------------ /myBucket/edges/knows/edge_file3.csv
------------ /myBucket/edges/knows/edge_file4.csv
When using cloud-based source data files, be sure to include your cloud
service credentials with the call
function. The required parameters for
cloud service credentials are listed in the Cloud storage configuration
options section.
Bulk loader configuration optionsโ
You can specify configuration options either as
part of the Gremlin call
step or as part of your
aerospike-graph.properties
properties file.
For a complete list of bulk loader configuration options,
see the options reference.
Cloud storage configuration optionsโ
The bulk loader supports cloud-based source data storage locations. If your edge and vertex CSV files are stored in AWS S3 or Google Cloud Storage buckets, the following configuration options are relevant. These options may be optional or required, depending on the remote environment. Check your cloud service documentation for details.
The cloud-based configuration options must be included as part of the bulk loader execution command. They cannot be part of the AGS properties file.
Name |
---|
aerospike.graphloader.remote-user |
aerospike.graphloader.remote-passkey |
aerospike.graphloader.gcs-email |
Additional cloud considerationsโ
- AWS
- GCS
When using AWS for source data files, specify the following options as part of the bulk loader command:
aerospike.graphloader.remote-user
: your AWSAWS_ACCESS_KEY_ID
value.aerospike.graphloader.remote-passkey
: your AWSAWS_SECRET_ACCESS_KEY
value.The AWS options are required for the Gremlin
call
step unless the Graph Docker environment is preconfigured with AWS credentials.The GCS options are not applicable.
When using Google Cloud Storage for source data files, you must configure a GCS Service Account. When running the bulk loader, specify the following options:
aerospike.graphloader.remote-user
: your GCSprivate_key_id
value.aerospike.graphloader.remote-passkey
: your GCSprivate_key
value.aerospike.graphloader.gcs-email
: your GCSclient_email
value.These values can be found in the JSON-generated key file for the GCS Service Account.
Bulk loading with local filesโ
You can bulk load local files with the Gremlin call
step by specifying their
location in the Gremlin command. Use the aerospike.graphloader.vertices
and aerospike.graphloader.edges
options to
specify file directory locations.
Local files must be accessible to the AGS Docker image. Specify local file
locations in your Docker run
command.
The call
API runs the bulk loader on a single AGS instance.
AGS runs in Docker, so any local file paths that you pass to call
must be accessible to the Docker image. If you are using local
source data files, you must mount these in the Docker image to
make them accessible to the bulk loader. More information about mounting directories
is available in the Docker volumes documentation.
In this example, we have the following directories:
/home/graph-user/graph/data/docker-bulk-load/sampledata/vertices/
/home/graph-user/graph/data/docker-bulk-load/sampledata/edges/
When we mount /home/graph-user/graph/data/docker-bulk-load/
to
/opt/aerospike/etc/
the container sees all the subdirectories
below /opt/aerospike/etc/
, including sampledata/*
. That is
reflected in the paths specified in the following call
step.
docker run -p 8182:8182 \
-v /home/graph-user/graph/data/docker-bulk-load/:/opt/aerospike/etc/
aerospike/aerospike-graph-service
If you are using cloud storage for your data source files, it's not necessary
to specify their location in the Docker run
command.
When using the -v
flag, the path on the left side of the :
character is your
local path, and the path on the right side of the :
character is the path
within your AGS Docker image.
To specify your file locations in the Gremlin command, use the with
step:
g.with("evaluationTimeout", 24 * 60 * 60 * 1000).call("aerospike.graphloader.admin.bulk-load.load").with("aerospike.graphloader.vertices", "/opt/aerospike/etc/sampledata/vertices").with("aerospike.graphloader.edges", "/opt/aerospike/etc/sampledata/edges")
Bulk loading with remote filesโ
The bulk loader supports remote data files stored in Google Storage Buckets
on GCP and S3 buckets on AWS. You can
specify remote file locations and the credentials necessary to reach them in the
call
step. Use the aerospike.graphloader.vertices
and aerospike.graphloader.edges
options
to specify file directory locations, and use any necessary
credential options to authenticate with your cloud provider.
The following example Gremlin command uses the bulk loader to add data format source data files stored in an AWS S3 bucket to an AGS database:
g.with("evaluationTimeout", 24 * 60 * 60 * 1000).call("aerospike.graphloader.admin.bulk-load.load").with("aerospike.graphloader.vertices", "s3://myBucket/vertices").with("aerospike.graphloader.edges", "s3://myBucket/edges").with("aerospike.graphloader.remote-user", "AWS_ACCESS_KEY_ID").with("aerospike.graphloader.remote-passkey", "AWS_SECRET_ACCESS_KEY")
The evaluationTimeout
parameterโ
All bulk loader operations must include the evaluationTimeout
parameter.
This parameter prevents the bulk loading operation from timing out when running for extended periods. In the following example, the timeout is set to 24 hours in milliseconds. You can adjust it as necessary.
- Java
- Python
- Groovy (Gremlin Console)
.with("evaluationTimeout", 24L * 60L * 60L * 1000L)
.with_('evaluationTimeout', 24 * 60 * 60 * 1000)
.with("evaluationTimeout", 24L * 60L * 60L * 1000L)
Certain Gremlin language variants may expect the numeric value to be of type
Long
, as shown in the Java example.
Incremental data loadingโ
Incremental data loading allows you to:
- Add vertices to an existing graph.
- Add edges to new and existing vertices.
- Update properties of existing vertices.
To load data incrementally, add the incremental_mode
flag to the call
command.
g.with("evaluationTimeout", 24 * 60 * 60 * 1000).call("aerospike.graphloader.admin.bulk-load.load").with("aerospike.graphloader.vertices", "<path_to_vertices>").with("aerospike.graphloader.edges", "<path_to_edges>").with("incremental_load", true).