Bulk data loading for standalone processing
Overviewโ
This page describes how to load graph data into an Aerospike database with the Aerospike Graph Service (AGS) bulk data loader and the Gremlin call
API. This method is
for standalone processing of small data sets.
Data processing takes place on your AGS machine, so this option is appropriate for smaller data sets or for testing. For larger data sets, we recommend using the distributed mode.
Bulk loading with the Gremlin call
stepโ
Requirementsโ
- A running AGS instance. See Installation for help with getting an AGS instance up and running.
- A running Aerospike Database, version 6.2.0.7 or later.
- Data files for edges and vertices in CSV format, in the required directory structure.
Bulk loading with local source filesโ
Your source data files must be accessible by your AGS instance. When you start the
AGS Docker container, bind your local source files with the Docker -v
option.
In the following example command, the local source files are located at
/etc/data
:
docker run -p 8182:8182 -v /etc/data:/opt/aerospike-graph/data aerospike/aerospike-graph-service
Run the following command in the Gremlin console to start the bulk data loading job:
g.with("evaluationTimeout", 24 * 60 * 60 * 1000)
.call("aerospike.graphloader.admin.bulk-load.load")
.with("aerospike.graphloader.vertices", "/opt/aerospike-graph/data/vertices")
.with("aerospike.graphloader.edges", "/opt/aerospike-graph/data/edges")
Bulk loading with remote filesโ
The bulk loader supports remote data files stored in Google Storage Buckets
on GCP and S3 buckets on AWS. You can
specify remote file locations and the credentials necessary to reach them in the
call
step.
In the following example, the source data files are in S3 buckets:
g.with("evaluationTimeout", 24 * 60 * 60 * 1000)
.call("aerospike.graphloader.admin.bulk-load.load")
.with("aerospike.graphloader.vertices", "s3://myBucket/vertices")
.with("aerospike.graphloader.edges", "s3://myBucket/edges")
.with("aerospike.graphloader.remote-user", "AWS_ACCESS_KEY_ID")
.with("aerospike.graphloader.remote-passkey", "AWS_SECRET_ACCESS_KEY")
Cloud storage configuration optionsโ
The bulk loader supports cloud-based source data storage locations. If your edge and vertex CSV files are stored in AWS S3 or Google Cloud Storage buckets, the following configuration options are relevant. These options may be optional or required, depending on the remote environment. Check your cloud service documentation for details.
The cloud-based configuration options must be included as part of the bulk loader execution command. They cannot be part of the AGS properties file.
Name |
---|
aerospike.graphloader.remote-user |
aerospike.graphloader.remote-passkey |
aerospike.graphloader.gcs-email |
Additional cloud considerationsโ
- AWS
- GCS
When using AWS for source data files, specify the following options as part of the bulk loader command:
aerospike.graphloader.remote-user
: your AWSAWS_ACCESS_KEY_ID
value.aerospike.graphloader.remote-passkey
: your AWSAWS_SECRET_ACCESS_KEY
value.The AWS options are required for the Gremlin
call
step unless the Graph Docker environment is preconfigured with AWS credentials.The GCS options are not applicable.
When using Google Cloud Storage for source data files, you must configure a GCS Service Account. When running the bulk loader, specify the following options:
aerospike.graphloader.remote-user
: your GCSprivate_key_id
value.aerospike.graphloader.remote-passkey
: your GCSprivate_key
value.aerospike.graphloader.gcs-email
: your GCSclient_email
value.These values can be found in the JSON-generated key file for the GCS Service Account.
Example JSON key file:
{
"type": "service_account",
"project_id": "account",
"private_key_id": "bede4fccf89903a2bf1ca5be3ace1f5df4cc96d8",
"private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQDBx81Dpdg58C0v\nXYxIeXFwahVzmv11OmgILLXfqwfBXCk+VqWZXvouNWT6LJj3hXuB3rg5QeYvqU3c\nBJFow5oS/YogkTBjLAGtthbDE+y1m8dJx1cMNNSHnItvLLZvjizlKlvuyRDvQT20\nN9ZKQkPNMu8YFCDcS8F9dcqc23IoAlCd01oigp5SvaZC9GoH5L6GBixzUxpkwhFl\nCWq6Sn7c3RPpwWB1w9opQX0Rf+0QtS5c5xmnlb6KkAAaTaWZTac0mTdjdc/3YDxI\ngcFR+3e23lyLMD0t4YWb1IoeAb9jnrSd+zV18vrMcvSJEmA3529e3h69E3q39XI4\nE2fR2t+BAgMBAAECggEABF+1ZR++7e3bIzzOEAu3sQUf91RKAMWg+ABEWahTXDyv\njKHpinjvli/r186+ZCXd6AxGJbq0TqKYaI0s2AvENEYHQl1slWx2nDxmLqCKQP4j\nZSmr8BFYM7hnmEqO0p1Hq6OFYDInIPNcyG7TNilhPOY3qdg4dqh6FVIdkOVOIhOD\nNVTx5g2sGk9z+hjiOM6Zjljl5EANR1U2WNGe8c3T4eau8SnJi/9tsFxITe/uAj8n\n9LRPSsL2fD7U3Kug5O8j7qhFmsGMVXDKs4LOqIGbdcDpeL10RtGr00jGJlgpkNjc\n7r3KuKdNjdyqfl8d1VLHgcBqxvjxO4Cfc+iTtq7wfQKBgQDysscmOBqhQnsVfp7C\nVyLSRz8dKElJkmYHYZfAeWV62ZAYcpovJghin/vPiSJXHITynwnbCbi6YNbmhn9W\nScJaTRODZy4fCbyKbeG59ZBIBu0/5Ou+RTqL5f5k8UzFSm0CAaP+EmdoiPSOUzkH\nvzARDqzCnsKQ/nqEBfx3TzTTvQKBgQDMZqxJjGwHRvACVE1K+8RTfC3F7tmdAo75\ncAB7JpXlCeitcQ5omShVQ3J410iysdWZ3hgV+nMwsKfRimAwUm/Zo0H9tFjO/+An\np+q36khSBUhs4UtJFTlyQTg9H4AURa7b0whDjkl4MIFo19CSEWErmeL1VumEXSJU\nxZ5hsLMVFQKBgBHrf2bkB5tWlE3+/mvtESYjmpZljhu/kocC/rh4fjS28bvMYnQO\nw9m8ZFRrlLyH340mjwy8SAaC9fspfSd65L3UKRevu6kRB/nUqTEY36Fh2Yy5M2rm\nI6+GuOTtKDT9DNV0F46//yCp1BzaKkDXLg5kXf80x7r6/0LWSlDo6UalAoGBAIdj\n0ub8vmmrkTrZwEDUt1xdOqyK41Xe5flPOOJZ0pvdjmOkKVkbad3gSSjF4P+MT+IV\nfHrCZB5yRRbEw6X+VNwiCYoVNWYXktBxp0WfR7wch7anHIkSJ/UIQkoqXVoQNhyh\nki29R+j2qCFcImk+XdDVo8HCifcFAcKJC7nFozlpAoGBAJwBWfaS7UcC+kCfO2DO\nnNXaoA8kdjxPbGnj6N7bbixjPJJTiYEIjU6pLk0QWx0rnIV3EQmY/nSQ4cPppWgv\nPCqtIpvkrcHMtJ/o4ChcDSvCP8ZS3i2YYVDWSzWC3/ocrTFxrPJfjm5LRlkXEjFM\n63jYaJce0rAigB0JJBNkTuDF\n-----END PRIVATE KEY-----\n",
"client_email": "account@project.iam.gserviceaccount.com",
"client_id": "10034131231241",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/account%40project.iam.gserviceaccount.com",
"universe_domain": "googleapis.com"
}
Incremental data loadingโ
Incremental data loading allows you to:
- Add vertices to an existing graph.
- Add edges to new and existing vertices.
- Update properties of existing vertices.
To load data incrementally, add the incremental_mode
flag to the call
command.
g.with("evaluationTimeout", 24 * 60 * 60 * 1000).call("aerospike.graphloader.admin.bulk-load.load").with("aerospike.graphloader.vertices", "<path_to_vertices>").with("aerospike.graphloader.edges", "<path_to_edges>").with("incremental_load", true).