# CSV format for source data files

## Overview

Source files for bulk data loading into Aerospike Graph Service (AGS) use the comma-separated values (CSV) format described here.

::: note
The AGS bulk loader requires source data files to be stored in a defined directory structure as described in the [Directory Structure](#directory-structure) section of this page.
:::

## Header rows

Each CSV file has a comma-separated header row. Header rows must contain no spaces between delimited columns.

### Vertex data file headers

| Header | Required? | Description |
|----|----|----|
| `~id` | Yes | Unique ID for the vertex. `~id` values may be of data type `String` or `Long`. Number values must be positive whole numbers. |
| `~label` | No | Label for the vertex. If `~label` is not specified, the bulk loader adds a default value of `vertex` for the `~label` field. |

### Edge data file headers

| Header | Required? | Description |
|----|----|----|
| `~from` | Yes | Vertex ID of the *from* vertex. |
| `~to` | Yes | Vertex ID of the *to* vertex. |
| `~label` | No | Label for the edge. Each edge can have only one label. If `~label` is not specified, the bulk loader adds a default value of `edge` for the `~label` field. |

::: note
AGS does not support user-provided `~id` values for edges, so the `~id` column is optional for edge CSV files. If your CSV file contains an `~id` column, the values are ignored.
:::

### Property column headers

Specify a header for each data column with the format `propertyName:type:cardinality`.

-   `propertyName` is the name for the property.
-   `type` is an optional specifier for the data type and defaults to `String` if not provided.
-   `cardinality` is an optional specifier to indicate that the property contains multiple values.

If no `type` or `cardinality` is specified, the value is treated as a single `String`.

Examples of valid headers:

| Header                     | Value format             |
|----------------------------|--------------------------|
| `propertyName`             | Single `String` value    |
| `propertyName:string:list` | Multiple `String` values |
| `propertyName:int`         | Single `Int` value       |
| `propertyName:int:list`    | Multiple `Int` values    |

Header considerations:

-   If a property name includes colons, you must specify both a type and cardinality. For example: `yyyy:mm:dd:string:single`.

-   If you specify a cardinality, you must also specify a data type.

-   Data type and cardinality header elements are case-insensitive.

### Allowable data types

The following data types are allowed:

| Data type | Allowable values |
|----|----|
| `Bool` or `Boolean` | `true`, `false` |
| `Int` or `Integer` | -2\^31 to 2\^31-1 |
| `Long` | -2\^63 to 2\^63-1 |
| `Double` | 64-bit IEEE 754 floating point |
| `String` | Any string value. Quotation marks are optional. |
| `Date` | Values must be in ISO-8601 format (for example, `YYYY-MM-DD`, `YYYY-MM-DDTHH:MM:SS`, `YYYY-MM-DDTHH:MM:SSZ`) |

### Property column values

The value in a row for an edge or vertex file underneath the specified column header is taken as-is if the header indicates a single value. Multiple values are delimited with the `;` character.

If you specify multiple values for a property:

-   Ensure the header of the value has cardinality set to `list`. Example: `propertyName:int:list`.
-   Multiple values are separated by the `;` character in the CSV file.
-   Multiple string values are allowed, but the semicolon character cannot be escaped when the value is multiple strings.

#### Vertex property multi-values

If you specify multiple values underneath a vertex property header in your CSV file, the resulting vertex has values for the property key that adhere to the [TinkerPop multi-properties standard](https://aerospike.com/docs/graph/3.1.0/develop/query/multi-properties.md). For example, imagine a graph of baseball players with a vertex named `Shohei Ohtani`. It could have a property named `hasPlayedFor` with the CSV value `Angels;Dodgers`. The resulting vertex has the following properties:

-   Property 1 `hasPlayedFor: Angels`
-   Property 2 `hasPlayedFor: Dodgers`

#### Edge property multi-values

If you specify multiple values underneath an edge property header with the `list` element in your CSV file, the resulting edge has a property where the multiple values are contained in a list. For example, imagine a graph of baseball players with an edge called `teams`. It could have the CSV value `Yankees;Giants;Mariners`. The resulting edge has a property with key `teams` and the value `[Yankees, Giants, Mariners]`.

## Data row elements

Rows must contain no spaces between delimited elements.

| Element | Description |
|----|----|
| Delimiter | Fields are separated by commas. Records are separated by a newline or a newline followed by a carriage return. |
| Blank fields | Non-required columns may be left blank. Blank fields still require comma separators. |
| Vertex IDs | The `~id` value must be unique for all vertices in every vertex file. |
| Edge IDs | User-provided `~id` values for edges are not supported, so the `~id` column is optional for edge CSV files. If your CSV file contains an `~id` column, the values are ignored. |
| Labels | Labels are case sensitive. |
| String values | Surrounding string values with quotation marks is optional. Commas, newline, and carriage return characters are automatically escaped if they are included in a string surrounded by double quotation marks. |

## CSV format specification

The CSV file format follows the RFC 4180 CSV specification, including the following requirements.

-   Both Unix and Windows-style line endings are supported (`\n` or `\r\n`).

-   Any field may be surrounded with double quotation marks (`"`).

-   Fields containing a line-break, double-quote, or commas must be quoted. If they are not, the load process errors out immediately.

-   Blank fields are allowed. A blank field is considered an empty value.

-   For list type columns, semicolons are used as list item delimiters.

For more information, see [Common Format and MIME Type for CSV Files](https://tools.ietf.org/html/rfc4180) on the Internet Engineering Task Force (IETF) website.

## Data file examples

The following example vertex and edge data files illustrate an example graph of university student records.

### Vertex file

The `CourseNum` field in the following example has no specified data type in the header row, so it defaults to type `String` and the data values are all stored as strings.

Data file:

``` txt
~id,Name:String,Scores:Int:list,Topic:String,Passed:Boolean,CourseNum
v1,"Bob Warner",32;67;21,"Physics",false,201
v2,"Gloria Mendes",41;85;92,"Music",true,"Three Hundred"
v3,"Susan Wolff",77;42;51,"Biology",false,330
v4,"James Halford",67;62;89,"Physics",true,101
v5,"Frieda Wolinsky",57;71;94,"Biology",true,"Two Forty"
v6,"Amy Cheng",28;59;73,"Music",false,101
v7,"Zack Hulot",59;77;93,"History",true,220
v8,"Rafael Kubelik",67;35;28,"History",false,"First Year Seminar"
v9,"Leah Starke",66;82;79,"Biology",true,330
v10,"Amber Florian",68;71;96,"Music",true,102
```

Tabular view of data:

| \~id | Name              | Scores     | Topic     | Passed | CourseNum            |
|------|-------------------|------------|-----------|--------|----------------------|
| v1   | ”Bob Warner”      | 32, 67, 21 | ”Physics” | false  | ”201”                |
| v2   | ”Gloria Mendes”   | 32, 67, 21 | ”Music”   | true   | ”Three Hundred”      |
| v3   | ”Susan Wolff”     | 77, 42, 51 | ”Biology” | false  | ”330”                |
| v4   | ”James Halford”   | 67, 62, 89 | ”Physics” | true   | ”101”                |
| v5   | ”Frieda Wolinsky” | 57, 71, 94 | ”Biology” | true   | ”Two Forty”          |
| v6   | ”Amy Cheng”       | 28, 59, 73 | ”Music”   | false  | ”101”                |
| v7   | ”Zack Hulot”      | 59, 77, 93 | ”History” | true   | ”220”                |
| v8   | ”Rafael Kubelik”  | 67, 35, 28 | ”History” | false  | ”First Year Seminar” |
| v9   | ”Leah Starke”     | 66, 82, 79 | ”Biology” | true   | ”330”                |
| v10  | ”Amber Florian”   | 68, 71, 96 | ”Music”   | true   | ”102”                |

### Edge file

Data file:

``` txt
~from,~to,~label,weight:Double
v1,v6,connected,0.7
v2,v9,connected,0.7
v3,v2,connected,0.7
v4,v8,connected,0.7
v5,v3,connected,0.7
v6,v4,connected,0.7
v7,v9,connected,0.7
v8,v1,connected,0.7
v9,v10,connected,0.7
v10,v3,connected,0.7
```

Tabular view of data:

| \~id                    | \~from | \~to | label       | weight |
|-------------------------|--------|------|-------------|--------|
| (Auto-generated by AGS) | v1     | v6   | ”connected” | 0.7    |
| (Auto-generated by AGS) | v2     | v9   | ”connected” | 0.7    |
| (Auto-generated by AGS) | v3     | v2   | ”connected” | 0.7    |
| (Auto-generated by AGS) | v4     | v8   | ”connected” | 0.7    |
| (Auto-generated by AGS) | v5     | v3   | ”connected” | 0.7    |
| (Auto-generated by AGS) | v6     | v4   | ”connected” | 0.7    |
| (Auto-generated by AGS) | v7     | v9   | ”connected” | 0.7    |
| (Auto-generated by AGS) | v8     | v1   | ”connected” | 0.7    |
| (Auto-generated by AGS) | v9     | v10  | ”connected” | 0.7    |
| (Auto-generated by AGS) | v10    | v3   | ”connected” | 0.7    |

## Directory structure

Source data files must be stored in directories specified by the `aerospike.graphloader.vertices` and `aerospike.graphloader.edges` [configuration options](https://aerospike.com/docs/graph/reference/config.md).

-   The directory specified in `aerospike.graphloader.vertices` must contain one or more subdirectories of vertex CSV files.

-   The directory specified in `aerospike.graphloader.edges` must contain one or more subdirectories of edge CSV files.

-   The CSV files in any one subdirectory must all contain the same row format, with the same header rows.

### Directory structure examples

#### Local files example

The following examples illustrate the directory structures for local files.

-   Data directory: `/opt/aerospike/graph/data`

-   Vertex directory: `/opt/aerospike/graph/data/vertices`

    -   The `aerospike.graphloader.vertices` configuration option must be set to `/opt/aerospike/graph/data/vertices`.

-   Vertex CSV subdirectory 1: `/opt/aerospike/graph/data/vertices/vert_dir1`

    -   All CSV files in the `vert_dir1` subdirectory must have the same row format.

-   Vertex CSV subdirectory 2: `/opt/aerospike/graph/data/vertices/vert_dir2`

    -   All CSV files in the `vert_dir2` subdirectory must have the same row format.

-   Edge directory: `/opt/aerospike/graph/data/edges`

    -   The `aerospike.graphloader.edges` configuration option must be set to `/opt/aerospike/graph/data/edges`.

-   Edge CSV subdirectory 1: `/opt/aerospike/graph/data/edges/edge_dir1`

    -   All CSV files in the `edge_dir1` subdirectory must have the same row format.

Visual representation of the directory structure:

``` txt
/opt/aerospike/graph/data
|
---- /opt/aerospike/graph/data/vertices/
|
-------- /opt/aerospike/graph/data/vertices/vert_dir1
|
------------ /opt/aerospike/graph/data/vertices/vert_dir1/vert_file1.csv
------------ /opt/aerospike/graph/data/vertices/vert_dir1/vert_file2.csv
|
-------- /opt/aerospike/graph/data/vertices/vert_dir2
|
------------ /opt/aerospike/graph/data/vertices/vert_dir2/vert_file3.csv
------------ /opt/aerospike/graph/data/vertices/vert_dir2/vert_file4.csv
|
---- /opt/aerospike/graph/data/edges/
|
-------- /opt/aerospike/graph/data/edges/edge_dir1
|
------------ /opt/aerospike/graph/data/edges/edge_dir1/edge_file1.csv
------------ /opt/aerospike/graph/data/edges/edge_dir1/edge_file2.csv
```

#### Cloud storage files example

The following example illustrates the directory structure for a set of CSV files stored in Amazon S3 storage.

-   S3 bucket: `my-bucket`

-   Vertex directory: `/my-bucket/vertices`

    -   The `aerospike.graphloader.vertices` configuration option must be set to `s3://<bucket-name>/vertices`.

-   Vertex CSV subdirectory 1: `/my-bucket/vertices/vert_dir1`

    -   All CSV files in the `vert_dir1` subdirectory must have the same row format.

-   Vertex CSV subdirectory 2: `/my-bucket/vertices/vert_dir2`

    -   All CSV files in the `vert_dir2` subdirectory must have the same row format.

-   Edge directory: `/my-bucket/edges`

    -   The `aerospike.graphloader.edges` configuration option must be set to `s3://<bucket-name>/edges`.

-   Edge CSV subdirectory 1: `/my-bucket/vertices/edge_dir1`

    -   All CSV files in the `edge_dir1` subdirectory must have the same row format.

Visual representation of the directory structure:

``` txt
/my-bucket
|
---- /my-bucket/vertices/
|
-------- /my-bucket/vertices/vert_dir1/
|
------------ /my-bucket/vertices/vert_dir1/vert_file1.csv
------------ /my-bucket/vertices/vert_dir1/vert_file2.csv
|
-------- /my-bucket/vertices/vert_dir2/
|
------------ /my-bucket/vertices/vert_dir2/vert_file3.csv
------------ /my-bucket/vertices/vert_dir2/vert_file4.csv
|
---- /my-bucket/edges/
|
-------- /my-bucket/edges/edge_dir1/
|
------------ /my-bucket/edges/edge_dir1/edge_file1.csv
------------ /my-bucket/edges/edge_dir1/edge_file2.csv
```