Incremental data loading

For the complete documentation index see: llms.txt

All documentation pages available in markdown.

In addition to creating new graphs with the bulk loader, you can incrementally load data into an existing graph. Incremental data loading supports the following operations:

Add new vertices
Introduce new vertices into an existing graph.
Add disconnected data
Load new vertices and edges that are not connected to any existing elements in the graph.
Add connected data
Load new vertices and edges that establish connections with existing vertices in the graph.
- The CSV files must include references to the vertex IDs of existing elements to establish the connections correctly.
Add edges to new or existing vertices
Introduce new edges that connect any combination of new and/or pre-existing vertices.
- Edge definitions in the CSV must include source and target vertex IDs, including those already present in the graph.
Update properties of existing vertices
Modify the properties of vertices that already exist in the graph.
- You can append new properties to a vertex, such as a status or last_seen property not previously present.
- You can also overwrite existing properties by specifying the same property key with a new value in the CSV file.
- If a property already exists on a vertex, the value in the incremental load replaces the current one.
- To add a new property to an existing vertex, you can use a CSV file which contains data for the new property. Existing properties are preserved.

Vertex insertion

To perform vertex insertions on an incremental dataset, the bulk loader uses the mergeV TinkerPop step.

The bulk loader merges vertices based on the ~id field.
If a vertex with the same ~id value is specified multiple times in the incremental dataset, the final vertex contains all the data of the combined rows. If the incremental dataset contains rows with duplicate ~id fields which have different values for the same properties, the final assigned property value is non-deterministic and may be any of the assigned values.

Edge insertion

Edge insertions on an incremental dataset behave the same as a fresh data load. No merging occurs, and all entries in the edge dataset create a valid edge between the specified vertices.

To load data incrementally in standalone mode, add the incremental_load flag to the call command, as demonstrated in the following example:

g.with("evaluationTimeout", 20000)
 .call("aerospike.graphloader.admin.bulk-load.load")
 .with("aerospike.graphloader.vertices", "<path_to_vertices>")
 .with("aerospike.graphloader.edges", "<path_to_edges>")
 .with("incremental_load", true)
 .next()

To load data incrementally in distributed mode, add the -incremental_load flag to the submit spark command for your cloud service.

gcloud dataproc jobs submit spark \
    --class=com.aerospike.firefly.bulkloader.SparkBulkLoader \
    --jars="gs://<bucket-name>/aerospike-graph-bulk-loader-x.y.z.jar" \
    --cluster="testcluster" \
    --region="us-central1" \
    -- -c "gs://<bucket-name>/bulk-loader.properties" -incremental_load

Spark job flags

The following flags are all optional.

Argument	Description
`-incremental_load`	Add new data to an existing graph.
`-validate_input_data`	Perform format and data validation of all Vertex and Edge CSV files before writing to Aerospike database.
`-verify_output_data`	Perform verification of a percentage of loaded elements, specified by `aerospike.graphloader.sampling-percentage`, by reading them back after loading. The verification process uses a traversal query.
`-resume`	Resume a previously failed job.
`-clear_existing_data`	Delete all existing data before beginning the new job.