Incremental data loading
In addition to creating new graphs with the bulk loader, you can incrementally load data into an existing graph. Incremental data loading supports the following operations:
-
Add new vertices
Introduce new vertices into an existing graph. -
Add disconnected data
Load new vertices and edges that are not connected to any existing elements in the graph. -
Add connected data
Load new vertices and edges that establish connections with existing vertices in the graph.- The CSV files must include references to the vertex IDs of existing elements to establish the connections correctly.
-
Add edges to new or existing vertices
Introduce new edges that connect any combination of new and/or pre-existing vertices.- Edge definitions in the CSV must include source and target vertex IDs, including those already present in the graph.
-
Update properties of existing vertices
Modify the properties of vertices that already exist in the graph.-
You can append new properties to a vertex, such as a
status
orlast_seen
property not previously present. -
You can also overwrite existing properties by specifying the same property key with a new value in the CSV file.
-
If a property already exists on a vertex, the value in the incremental load replaces the current one.
-
To add a new property to an existing vertex, you can use a CSV file which contains data for the new property. Existing properties are preserved.
-
Vertex insertion
To perform vertex insertions on an incremental dataset, the bulk loader uses the
mergeV
TinkerPop step.
-
The bulk loader merges vertices based on the
~id
field. -
If a vertex with the same
~id
value is specified multiple times in the incremental dataset, the final vertex contains all the data of the combined rows. If the incremental dataset contains rows with duplicate~id
fields which have different values for the same properties, the final assigned property value is non-deterministic and may be any of the assigned values.
Edge insertion
Edge insertions on an incremental dataset behave the same as a fresh data load. No merging occurs, and all entries in the edge dataset create a valid edge between the specified vertices.
Usage
To load data incrementally in standalone mode,
add the incremental_mode
flag to the call
command,
as demonstrated in the following example:
g.with("evaluationTimeout", 20000). call("aerospike.graphloader.admin.bulk-load.load"). with("aerospike.graphloader.vertices", "<path_to_vertices>"). with("aerospike.graphloader.edges", "<path_to_edges>"). with("incremental_load", true).next()
To load data incrementally in distributed mode,
add the -incremental_load
flag to the submit spark
command
for your cloud service.
gcloud dataproc jobs submit spark \ --class=com.aerospike.firefly.bulkloader.SparkBulkLoader \ --jars="gs://my_bucket/aerospike-graph-bulk-loader-x.y.z.jar" \ --cluster="testcluster" \ --region="us-central1" \ -- -c "gs://my_bucket/bulk-loader.properties" -incremental_load
Spark job flags
The following flags are all optional.
Argument | Description |
---|---|
-incremental_load | Add new data to an existing graph. |
-validate_input_data | Perform format and data validation of all Vertex and Edge CSV files before writing to Aerospike database. |
-verify_output_data | Perform verification of a percentage of loaded elements, specified by aerospike.graphloader.sampling-percentage , by reading them back after loading. The verification process uses a traversal query. |
-resume | Resume a previously failed job. |
-clear_existing_data | Delete all existing data before beginning the new job. |