CSV format for source data files
Overview
Source files for bulk data loading into Aerospike Graph Service (AGS) use the comma-separated values (CSV) format described here.
Header rows
Each CSV file has a comma-separated header row. Header rows must contain no spaces between delimited columns.
Vertex data file headers
Header | Required? | Description |
---|---|---|
~id | Yes | Unique ID for the vertex. ~id values may be of data type String or Long . Number values must be positive whole numbers. |
~label | No | Label for the vertex. If ~label is not specified, the bulk loader adds a default value of vertex for the ~label field. |
Edge data file headers
Header | Required? | Description |
---|---|---|
~from | Yes | Vertex ID of the from vertex. |
~to | Yes | Vertex ID of the to vertex. |
~label | No | Label for the edge. Each edge can have only one label. If ~label is not specified, the bulk loader adds a default value of edge for the ~label field. |
Property column headers
Specify a header for each data column with the format propertyName:type:cardinality
.
propertyName
is the name for the property.type
is an optional specifier for the data type and defaults toString
if not provided.cardinality
is an optional specifier to indicate that the property contains multiple values.
If no type
or cardinality
is specified, the value is treated as a single String
.
Examples of valid headers:
Header | Value format |
---|---|
propertyName | Single String value |
propertyName:string:list | Multiple String values |
propertyName:int | Single Int value |
propertyName:int:list | Multiple Int values |
Header considerations:
-
If a property name includes colons, you must specify both a type and cardinality. For example:
yyyy:mm:dd:string:single
. -
If you specify a cardinality, you must also specify a data type.
-
Data type and cardinality header elements are case-insensitive.
Allowable data types
The following data types are allowed:
Data type | Allowable values |
---|---|
Bool or Boolean | true , false |
Int or Integer | -2^31 to 2^31-1 |
Long | -2^63 to 2^63-1 |
Double | 64-bit IEEE 754 floating point |
String | Any string value. Quotation marks are optional. |
Date | Values must be in ISO-8601 format (for example, YYYY-MM-DD , YYYY-MM-DDTHH:MM:SS , YYYY-MM-DDTHH:MM:SSZ ) |
Property column values
The value in a row for an edge or vertex file underneath the specified
column header is taken as-is if the header indicates a single value.
Multiple values are delimited with the ;
character.
If you specify multiple values for a property:
- Ensure the the header of the value has cardinality set to
list
. Example:propertyName:int:list
. - Multiple values are separated by the
;
character in the CSV file. - Multiple string values are allowed, but the semicolon character cannot be escaped when the value is multiple strings.
Vertex property multi-values
If you specify multiple values underneath a vertex property
header in your CSV file, the resulting vertex has values for the property key that
adhere to the TinkerPop multi-properties standard.
For example, imagine a graph of baseball players with a vertex
named Shohei Ohtani
. It could have a property named hasPlayedFor
with the CSV value
Angels;Dodgers
. The resulting vertex has the following properties:
- Property 1
hasPlayedFor: Angels
- Property 2
hasPlayedFor: Dodgers
Edge property multi-values
If you specify multiple values underneath an edge property header with
the list
element in your CSV file, the resulting edge has a property
where the multiple values are contained in a list.
For example, imagine a graph of baseball players with an edge called
teams
. It could have the CSV value Yankees;Giants;Mariners
. The resulting edge has a
property with key teams
and the value [Yankees, Giants, Mariners]
.
Data row elements
Rows must contain no spaces between delimited elements.
Element | Description |
---|---|
Delimiter | Fields are separated by commas. Records are separated by a newline or a newline followed by a carriage return. |
Blank fields | Non-required columns may be left blank. Blank fields still require comma separators. |
Vertex IDs | The ~id value must be unique for all vertices in every vertex file. |
Edge IDs | User-provided ~id values for edges are not supported, so the ~id column is optional for edge CSV files. If your CSV file contains an ~id column, the values are ignored. |
Labels | Labels are case sensitive. |
String values | Surrounding string values with quotation marks is optional. Commas, newline, and carriage return characters are automatically escaped if they are included in a string surrounded by double quotation marks. |
CSV format specification
The CSV file format follows the RFC 4180 CSV specification, including the following requirements.
-
Both Unix and Windows-style line endings are supported (
\n
or\r\n
). -
Any field may be surrounded with double quotation marks (
"
). -
Fields containing a line-break, double-quote, or commas must be quoted. If they are not, the load process errors out immediately.
-
Blank fields are allowed. A blank field is considered an empty value.
-
For list type columns, semicolons are used as list item delimiters.
For more information, see Common Format and MIME Type for CSV Files on the Internet Engineering Task Force (IETF) website.
Data file examples
The following example vertex and edge data files illustrate an example graph of university student records.
Vertex file
The CourseNum
field in the following example has no specified
data type in the header row, so it defaults to type String
and the data values are all stored as strings.
Data file:
~id,Name:String,Scores:Int:list,Topic:String,Passed:Boolean,CourseNumv1,"Bob Warner",32;67;21,"Physics",false,201v2,"Gloria Mendes",41;85;92,"Music",true,"Three Hundred"v3,"Susan Wolff",77;42;51,"Biology",false,330v4,"James Halford",67;62;89,"Physics",true,101v5,"Frieda Wolinsky",57;71;94,"Biology",true,"Two Forty"v6,"Amy Cheng",28;59;73,"Music",false,101v7,"Zack Hulot",59;77;93,"History",true,220v8,"Rafael Kubelik",67;35;28,"History",false,"First Year Seminar"v9,"Leah Starke",66;82;79,"Biology",true,330v10,"Amber Florian",68;71;96,"Music",true,102
Tabular view of data:
~id | Name | Scores | Topic | Passed | CourseNum |
---|---|---|---|---|---|
v1 | ”Bob Warner” | 32, 67, 21 | ”Physics” | false | ”201” |
v2 | ”Gloria Mendes” | 32, 67, 21 | ”Music” | true | ”Three Hundred” |
v3 | ”Susan Wolff” | 77, 42, 51 | ”Biology” | false | ”330” |
v4 | ”James Halford” | 67, 62, 89 | ”Physics” | true | ”101” |
v5 | ”Frieda Wolinsky” | 57, 71, 94 | ”Biology” | true | ”Two Forty” |
v6 | ”Amy Cheng” | 28, 59, 73 | ”Music” | false | ”101” |
v7 | ”Zack Hulot” | 59, 77, 93 | ”History” | true | ”220” |
v8 | ”Rafael Kubelik” | 67, 35, 28 | ”History” | false | ”First Year Seminar” |
v9 | ”Leah Starke” | 66, 82, 79 | ”Biology” | true | ”330” |
v10 | ”Amber Florian” | 68, 71, 96 | ”Music” | true | ”102” |
Edge file
Data file:
~from,~to,~label,weight:Doublev1,v6,connected,0.7v2,v9,connected,0.7v3,v2,connected,0.7v4,v8,connected,0.7v5,v3,connected,0.7v6,v4,connected,0.7v7,v9,connected,0.7v8,v1,connected,0.7v9,v10,connected,0.7v10,v3,connected,0.7
Tabular view of data:
~id | ~from | ~to | label | weight |
---|---|---|---|---|
(Auto-generated by AGS) | v1 | v6 | ”connected” | 0.7 |
(Auto-generated by AGS) | v2 | v9 | ”connected” | 0.7 |
(Auto-generated by AGS) | v3 | v2 | ”connected” | 0.7 |
(Auto-generated by AGS) | v4 | v8 | ”connected” | 0.7 |
(Auto-generated by AGS) | v5 | v3 | ”connected” | 0.7 |
(Auto-generated by AGS) | v6 | v4 | ”connected” | 0.7 |
(Auto-generated by AGS) | v7 | v9 | ”connected” | 0.7 |
(Auto-generated by AGS) | v8 | v1 | ”connected” | 0.7 |
(Auto-generated by AGS) | v9 | v10 | ”connected” | 0.7 |
(Auto-generated by AGS) | v10 | v3 | ”connected” | 0.7 |
Directory structure
Source data files must be stored in directories specified by the
aerospike.graphloader.vertices
and aerospike.graphloader.edges
configuration options.
-
The directory specified in
aerospike.graphloader.vertices
must contain one or more subdirectories of vertex CSV files. -
The directory specified in
aerospike.graphloader.edges
must contain one or more subdirectories of edge CSV files. -
The CSV files in any one subdirectory must all contain the same row format, with the same header rows.
Directory structure examples
Local files example
The following examples illustrate the directory structures for local files.
-
Data directory:
/opt/aerospike/graph/data
-
Vertex directory:
/opt/aerospike/graph/data/vertices
- The
aerospike.graphloader.vertices
configuration option must be set to/opt/aerospike/graph/data/vertices
.
- The
-
Vertex CSV subdirectory 1:
/opt/aerospike/graph/data/vertices/vert_dir1
- All CSV files in the
vert_dir1
subdirectory must have the same row format.
- All CSV files in the
-
Vertex CSV subdirectory 2:
/opt/aerospike/graph/data/vertices/vert_dir2
- All CSV files in the
vert_dir2
subdirectory must have the same row format.
- All CSV files in the
-
Edge directory:
/opt/aerospike/graph/data/edges
- The
aerospike.graphloader.edges
configuration option must be set to/opt/aerospike/graph/data/edges
.
- The
-
Edge CSV subdirectory 1:
/opt/aerospike/graph/data/edges/edge_dir1
- All CSV files in the
edge_dir1
subdirectory must have the same row format.
- All CSV files in the
Visual representation of the directory structure:
/opt/aerospike/graph/data|---- /opt/aerospike/graph/data/vertices/|-------- /opt/aerospike/graph/data/vertices/vert_dir1|------------ /opt/aerospike/graph/data/vertices/vert_dir1/vert_file1.csv------------ /opt/aerospike/graph/data/vertices/vert_dir1/vert_file2.csv|-------- /opt/aerospike/graph/data/vertices/vert_dir2|------------ /opt/aerospike/graph/data/vertices/vert_dir2/vert_file3.csv------------ /opt/aerospike/graph/data/vertices/vert_dir2/vert_file4.csv|---- /opt/aerospike/graph/data/edges/|-------- /opt/aerospike/graph/data/edges/edge_dir1|------------ /opt/aerospike/graph/data/edges/edge_dir1/edge_file1.csv------------ /opt/aerospike/graph/data/edges/edge_dir1/edge_file2.csv
Cloud storage files example
The following example illustrates the directory structure for a set of CSV files stored in Amazon S3 storage.
-
S3 bucket:
my-bucket
-
Vertex directory:
/my-bucket/vertices
- The
aerospike.graphloader.vertices
configuration option must be set tos3://<bucket-name>/vertices
.
- The
-
Vertex CSV subdirectory 1:
/my-bucket/vertices/vert_dir1
- All CSV files in the
vert_dir1
subdirectory must have the same row format.
- All CSV files in the
-
Vertex CSV subdirectory 2:
/my-bucket/vertices/vert_dir2
- All CSV files in the
vert_dir2
subdirectory must have the same row format.
- All CSV files in the
-
Edge directory:
/my-bucket/edges
- The
aerospike.graphloader.edges
configuration option must be set tos3://<bucket-name>/edges
.
- The
-
Edge CSV subdirectory 1:
/my-bucket/vertices/edge_dir1
- All CSV files in the
edge_dir1
subdirectory must have the same row format.
- All CSV files in the
Visual representation of the directory structure:
/my-bucket|---- /my-bucket/vertices/|-------- /my-bucket/vertices/vert_dir1/|------------ /my-bucket/vertices/vert_dir1/vert_file1.csv------------ /my-bucket/vertices/vert_dir1/vert_file2.csv|-------- /my-bucket/vertices/vert_dir2/|------------ /my-bucket/vertices/vert_dir2/vert_file3.csv------------ /my-bucket/vertices/vert_dir2/vert_file4.csv|---- /my-bucket/edges/|-------- /my-bucket/edges/edge_dir1/|------------ /my-bucket/edges/edge_dir1/edge_file1.csv------------ /my-bucket/edges/edge_dir1/edge_file2.csv