File Format for Source Data Files
Overviewโ
Source files for bulk data loading into Aerospike Graph Service (AGS) use the comma-separated values (CSV) format described here.
The Aerospike Graph Bulk Loader requires source data files to be stored in a defined directory structure. For more information, see the Directory Structure section of this page.
Header rowsโ
Each CSV file has a comma-separated header row. Header rows should contain no spaces between delimited columns.
Vertex data file headersโ
Header | Required? | Description |
---|---|---|
~id | Yes | Unique ID for the vertex. ~id values may be of data type String , Int , or Long . Number values must be positive whole numbers. |
~label | No | A label for the vertex. If ~label is not specified, the bulk loader adds a default value of vertex for the ~label field. |
Edge data file headersโ
Header | Required? | Description |
---|---|---|
~from | Yes | The vertex ID of the from vertex. |
~to | Yes | The vertex ID of the to vertex. |
~label | No | A label for the edge. Each edge can have only one label. If ~label is not specified, the bulk loader adds a default value of edge for the ~label field. |
AGS does not support user-provided ~id
values for edges, so the ~id
column
is optional for edge CSV files. If your CSV file contains an ~id
column, the values
are ignored.
Property column headersโ
Specify a header for each data column with the format propertyname:type
.
propertyname
specifies a name for the data column. type
specifies a data type
for the column. Columns with no data type specified default to type String
.
Allowable data typesโ
The following data types are allowed:
Data type | Allowable values |
---|---|
Bool or Boolean | true , false |
Int | -2^31 to 2^31-1 |
Long | -2^63 to 2^63-1 |
Double | 64-bit IEEE 754 floating point |
String | Any string value. Quotation marks are optional. |
List valuesโ
Any property may contain a list of values. All values in the list must be the same data type. Mixed type lists are not supported. To specify a list of values, add []
to the data type for the column. Example: qty:Int[]
.
List values are separated by semicolons. Lists of strings are allowed, but the semicolon character cannot be escaped in a list of strings.
Data row elementsโ
Rows should contain no spaces between delimited elements.
Element | Description |
---|---|
Delimiter | Fields are separated by commas. Records are separated by a newline or a newline followed by a carriage return. |
Blank fields | Non-required columns may be left blank. Blank fields still require comma separators. |
Vertex IDs | The ~id value must be unique for all vertices in every vertex file. |
Edge IDs | User-provided ~id values for edges are not supported, so the ~id column is optional for edge CSV files. If your CSV file contains an ~id column, the values are ignored. |
Labels | Labels are case sensitive. |
Stringย values | Surrounding string values with quotation marks is optional. Commas, newline, and carriage return characters are automatically escaped if they are included in a string surrounded by double quotation marks. |
CSV format specificationโ
The CSV file format follows the RFC 4180 CSV specification, including the following requirements.
Both Unix and Windows-style line endings are supported (
\n
or\r\n
).Any field may be surrounded with double quotation marks (
"
).Fields containing a line-break, double-quote, or commas must be quoted. If they are not, the load process errors out immediately.
Blank fields are allowed. A blank field is considered an empty value.
For list type columns, semicolons are used as list item delimiters
For more information, see Common Format and MIME Type for CSV Files on the Internet Engineering Task Force (IETF) website.
Data file examplesโ
The following example vertex and edge data files illustrate an example graph of university student records.
Vertex file people.csv
โ
Data file:
~id,Name:String,Scores:Int[],Topic:String,Passed:Boolean,CourseNum
v1,"Bob Warner",32;67;21,"Physics",false,201
v2,"Gloria Mendes",41;85;92,"Music",true,"Three Hundred"
v3,"Susan Wolff",77;42;51,"Biology",false,330
v4,"James Halford",67;62;89,"Physics",true,101
v5,"Frieda Wolinsky",57;71;94,"Biology",true,"Two Forty"
v6,"Amy Cheng",28;59;73,"Music",false,101
v7,"Zack Hulot",59;77;93,"History",true,220
v8,"Rafael Kubelik",67;35;28,"History",false,"First Year Seminar"
v9,"Leah Starke",66;82;79,"Biology",true,330
v10,"Amber Florian",68;71;96,"Music",true,102
Tabular view of data:
~id | Name | Scores | Topic | Passed | CourseNum |
---|---|---|---|---|---|
v1 | "Bob Warner" | 32, 67, 21 | "Physics" | false | "201" |
v2 | "Gloria Mendes" | 32, 67, 21 | "Music" | true | "Three Hundred" |
v3 | "Susan Wolff" | 77, 42, 51 | "Biology" | false | "330" |
v4 | "James Halford" | 67, 62, 89 | "Physics" | true | "101" |
v5 | "Frieda Wolinsky" | 57, 71, 94 | "Biology" | true | "Two Forty" |
v6 | "Amy Cheng" | 28, 59, 73 | "Music" | false | "101" |
v7 | "Zack Hulot" | 59, 77, 93 | "History" | true | "220" |
v8 | "Rafael Kubelik" | 67, 35, 28 | "History" | false | "First Year Seminar" |
v9 | "Leah Starke" | 66, 82, 79 | "Biology" | true | "330" |
v10 | "Amber Florian" | 68, 71, 96 | "Music" | true | "102" |
In the above example, the CourseNum
field has no specified
data type in the header row, so it defaults to type String
and the data values are all stored as strings.
Edge file connected.csv
โ
Data file:
~from,~to,~label,weight:Double
v1,v6,connected,0.7
v2,v9,connected,0.7
v3,v2,connected,0.7
v4,v8,connected,0.7
v5,v3,connected,0.7
v6,v4,connected,0.7
v7,v9,connected,0.7
v8,v1,connected,0.7
v9,v10,connected,0.7
v10,v3,connected,0.7
Tabular view of data:
~id | ~from | ~to | label | weight |
---|---|---|---|---|
(Auto-generated by AGS) | v1 | v6 | "connected" | 0.7 |
(Auto-generated by AGS) | v2 | v9 | "connected" | 0.7 |
(Auto-generated by AGS) | v3 | v2 | "connected" | 0.7 |
(Auto-generated by AGS) | v4 | v8 | "connected" | 0.7 |
(Auto-generated by AGS) | v5 | v3 | "connected" | 0.7 |
(Auto-generated by AGS) | v6 | v4 | "connected" | 0.7 |
(Auto-generated by AGS) | v7 | v9 | "connected" | 0.7 |
(Auto-generated by AGS) | v8 | v1 | "connected" | 0.7 |
(Auto-generated by AGS) | v9 | v10 | "connected" | 0.7 |
(Auto-generated by AGS) | v10 | v3 | "connected" | 0.7 |
Directory structureโ
Source data files should be stored in directories specified by the
aerospike.graphloader.vertices
and aerospike.graphloader.edges
configuration options.
The directory specified in
aerospike.graphloader.vertices
must contain one or more subdirectories of vertex CSV files.The directory specified in
aerospike.graphloader.edges
must contain one or more subdirectories of edge CSV files.The CSV files in any one subdirectory must all contain the same row format, with the same header rows.
Directory structure examplesโ
Local files exampleโ
The following example uses a directory structure as follows:
Data directory:
/opt/aerospike/graph/data
Vertex directory:
/opt/aerospike/graph/data/vertices
- The
aerospike.graphloader.vertices
configuration option should be set to/opt/aerospike/graph/data/vertices
.
- The
Vertex CSV subdirectory 1:
/opt/aerospike/graph/data/vertices/vert_dir1
- All CSV files in the
vert_dir1
subdirectory must have the same row format.
- All CSV files in the
Vertex CSV subdirectory 2:
/opt/aerospike/graph/data/vertices/vert_dir2
- All CSV files in the
vert_dir2
subdirectory must have the same row format.
- All CSV files in the
Edge directory:
/opt/aerospike/graph/data/edges
- The
aerospike.graphloader.edges
configuration option should be set to/opt/aerospike/graph/data/edges
.
- The
Edge CSV subdirectory 1:
/opt/aerospike/graph/data/vertices/edge_dir1
- All CSV files in the
edge_dir1
subdirectory must have the same row format.
- All CSV files in the
Visual representation of the directory structure:
/opt/aerospike/graph/data
|
---- /opt/aerospike/graph/data/vertices/
|
-------- /opt/aerospike/graph/data/vertices/vert_dir1
|
------------ /opt/aerospike/graph/data/vertices/vert_dir1/vert_file1.csv
------------ /opt/aerospike/graph/data/vertices/vert_dir1/vert_file2.csv
|
-------- /opt/aerospike/graph/data/vertices/vert_dir2
|
------------ /opt/aerospike/graph/data/vertices/vert_dir2/vert_file3.csv
------------ /opt/aerospike/graph/data/vertices/vert_dir2/vert_file4.csv
|
---- /opt/aerospike/graph/data/edges/
|
-------- /opt/aerospike/graph/data/edges/edge_dir1
|
------------ /opt/aerospike/graph/data/edges/edge_dir1/edge_file1.csv
------------ /opt/aerospike/graph/data/edges/edge_dir1/edge_file2.csv
Cloud storage files exampleโ
The following example illustrates the directory structure for a set of CSV files stored in AWS S3 storage. It uses a directory structure as follows:
S3 bucket:
myBucket
Vertex directory:
/myBucket/vertices
- The
aerospike.graphloader.vertices
configuration option should be set tos3://myBucket/vertices
.
- The
Vertex CSV subdirectory 1:
/myBucket/vertices/vert_dir1
- All CSV files in the
vert_dir1
subdirectory must have the same row format.
- All CSV files in the
Vertex CSV subdirectory 2:
/myBucket/vertices/vert_dir2
- All CSV files in the
vert_dir2
subdirectory must have the same row format.
- All CSV files in the
Edge directory:
/myBucket/edges
- The
aerospike.graphloader.edges
configuration option should be set tos3://myBucket/edges
.
- The
Edge CSV subdirectory 1:
/myBucket/vertices/edge_dir1
- All CSV files in the
edge_dir1
subdirectory must have the same row format.
- All CSV files in the
Visual representation of the directory structure:
/myBucket
|
---- /myBucket/vertices/
|
-------- /myBucket/vertices/vert_dir1/
|
------------ /myBucket/vertices/vert_dir1/vert_file1.csv
------------ /myBucket/vertices/vert_dir1/vert_file2.csv
|
-------- /myBucket/vertices/vert_dir2/
|
------------ /myBucket/vertices/vert_dir2/vert_file3.csv
------------ /myBucket/vertices/vert_dir2/vert_file4.csv
|
---- /myBucket/edges/
|
-------- /myBucket/edges/edge_dir1/
|
------------ /myBucket/edges/edge_dir1/edge_file1.csv
------------ /myBucket/edges/edge_dir1/edge_file2.csv