Skip to main content
Loading
Version: Graph 2.1.0

Error Handling for Bulk Data Loading

The Aerospike Bulk Loader takes input data from files in CSV format. In some situations, the CSV data may contain inconsistent or incorrectly formatted data, and the user can specify how fault-tolerant the import process should be. Possible CSV errors include:

  • Vertices with duplicate IDs. If the bulk loader finds vertices with duplicate IDs, it writes only one to the database.

  • Incorrectly typed data. Each CSV file contains a header row in which the data type of each column is specified, and if the bulk loader finds a data item which does not match its specified data type it rejects the row.

  • Orphaned edges. If the bulk loader finds an edge which refers to a non-existent vertex it rejects the row.

You can specify an allowable number for each type of error in your configuration file. The following configuration options specify how many errors to allow in a bulk load operation before it aborts.

note

You must use the -validate_input_data Spark flag when any of the allowable error configuration options are in use with distributed mode.

  • aerospike.graphloader.allowed-duplicate-vertex-id-count
  • aerospike.graphloader.allowed-bad-entry-count
  • aerospike.graphloader.allowed-bad-edges-count

See the options reference for full descriptions of the error-related configuration options.

The three options which deal with allowable errors all accept positive integers as values, and they all default to unlimited allowable errors. Set a reasonable number of allowable errors before starting a bulk loader job; leaving the default in place may lead to unusable data sets due to a high number of missing rows. Bulk loader performance is impacted when a high percentage of input data is rejected.

Call API error reporting

If you use the bulk loader via the call API, the command returns a string with the number of each type of error, if any.

String result = (String) g.call("aerospike.graphloader.admin.bulk-load.load").with(...).next();

// If there were no errors, the result string reads:
// Success

// If there were errors, the result string reads:
// Warning: Errors were encountered during bulk loading.
// duplicate-vertex-id-count: 1
// bad-edge-count: 3
// bad-entry-count: 2
// Use the g.call("erospike.graphloader.admin.bulk-load.errors") command for details.

When the bulk load operation completes, use the following command to get a report of how many errors occurred with the call API.

g.call("aerospike.graphloader.admin.bulk-load.error-count")

This command returns a map with the following keys:

  • duplicate-vertex-id-count: The number of vertices that have the same ID in the input dataset.

  • bad-edge-count: The number of edges that refer to missing vertex IDs.

  • bad-entry-count: The number of vertices and edges that have incorrectly typed data, such as strings in a column specified for integers in the header row.

The value of each key is an integer specifying the number of errors found.

You can also do error reporting by error type with the following command:

g.call("get-bulk-load-errors").with("type", "<TYPE>")

Replace <TYPE> with one of:

  • duplicate-vertex-ids
  • bad-edges
  • bad-entry

Each error type returns a map with key-value pairs containing the following information:

duplicate-vertex-ids

  • id: duplicated vertex ID.
  • count: number of times duplicated.

bad-edges

  • bad-vertex-id: the non-existent vertex ID in the edge dataset.
  • count: number of edges attached to the non-existent vertex ID.

bad-entry

  • row: string representation of the CSV row that did not match specified header type. The value shown includes additional columns marked as null compared to the row in the CSV file.
  • file: CSV file that contained the erroneous row.