Error Handling for Bulk Data Loading
The Aerospike Bulk Loader takes input data from files in CSV format. In some situations, the CSV data may contain inconsistent or incorrectly formatted data, and the user can specify how fault-tolerant the import process should be. Possible CSV errors include:
Vertices with duplicate IDs. If the bulk loader finds vertices with duplicate IDs, it writes only one to the database.
Incorrectly typed data. Each CSV file contains a header row in which the data type of each column is specified, and if the bulk loader finds a data item which does not match its specified data type it rejects the row.
Orphaned edges. If the bulk loader finds an edge which refers to a non-existent vertex it rejects the row.
You can specify an allowable number for each type of error in your configuration file. The following configuration options specify how many errors to allow in a bulk load operation before it aborts.
You must use the -validate_input_data
Spark flag
when any of the allowable error configuration options are in use with
distributed mode.
aerospike.graphloader.allowed-duplicate-vertex-id-count
aerospike.graphloader.allowed-bad-entry-count
aerospike.graphloader.allowed-bad-edges-count
See the options reference for full descriptions of the error-related configuration options.
The three options which deal with allowable errors all accept positive integers as values, and they all default to unlimited allowable errors. Set a reasonable number of allowable errors before starting a bulk loader job; leaving the default in place may lead to unusable data sets due to a high number of missing rows. Bulk loader performance is impacted when a high percentage of input data is rejected.
Call API error reporting
If you use the bulk loader via the call
API,
the command returns a string with the number of each type of error, if any.
String result = (String) g.call("aerospike.graphloader.admin.bulk-load.load").with(...).next();
// If there were no errors, the result string reads:
// Success
// If there were errors, the result string reads:
// Warning: Errors were encountered during bulk loading.
// duplicate-vertex-id-count: 1
// bad-edge-count: 3
// bad-entry-count: 2
// Use the g.call("erospike.graphloader.admin.bulk-load.errors") command for details.
When the bulk load operation completes, use the following command to
get a report of how many errors occurred with the call
API.
g.call("aerospike.graphloader.admin.bulk-load.error-count")
This command returns a map with the following keys:
duplicate-vertex-id-count
: The number of vertices that have the same ID in the input dataset.bad-edge-count
: The number of edges that refer to missing vertex IDs.bad-entry-count
: The number of vertices and edges that have incorrectly typed data, e.g. strings in a column which is specified for integers in the header row.
The value of each key is an integer specifying the number of errors found.
You can also do error reporting by error type with the following command:
g.call("get-bulk-load-errors").with("type", "<TYPE>")
Replace <TYPE>
with one of:
duplicate-vertex-ids
bad-edges
bad-entry
Each error type returns a map with key-value pairs containing the following information:
duplicate-vertex-ids
id
: duplicated vertex ID.count
: number of times duplicated.
bad-edges
bad-vertex-id
: the non-existent vertex ID in the edge dataset.count
: number of edges attached to the non-existent vertex ID.
bad-entry
row
: string representation of the CSV row that did not match specified header type. The value shown includes additional columns marked as null compared to the row in the CSV file.file
: CSV file that contained the erroneous row.