Input Raw Data Specification

In order to use GraphStorm’s graph construction pipeline on a single machine or a distributed environment, users should prepare their input raw data accroding to GraphStorm’s specifications explained below.

Data tables

The main part of GraphStorm input raw data is composed of two sets of tables. One for nodes and one for edges. These data tables could be in one of three file formats: csv files, parquet files, or HDF5 files. All of the three file formats store data in tables that contain headers, i.e., a list of column names, and values belonging to each column.

Node tables

GraphStorm requires each node type to have its own table(s). It is suggested to have one folder for one node type to store table(s).

In the table for one node type, there must be one column that stores the IDs of nodes. The IDs could be non-integers, such as strings. GraphStorm will treat non-integer IDs as strings and convert them into interger IDs.

If certain type of nodes has features, the features could be stored in multiple columns, each of which stores one type of features. These features could be numerical, categorial, or textual data. Similarly, training labels associated with certain type of nodes could be stored in multiple columns, each of which store one type of labels.

Edge tables

GraphStorm requires each edge type to have its own table(s). It is suggested to have one folder for one edge type to store tables(s).

In the table for one edge type, there must be two columns. One column stores the IDs of source node type of the edge type, while another column stores the IDs of destination node type of the edge type. The source and destination node type should have their corresponding node tables. Same as node features and labels, edge features and labels could be stored in multiple columns.

Note

If the number of rows is too large, it is suggested to split and store the data into mutliple table files that have the identical schema. Doing so could speed up the data reading process during graph construction if use multiple processing.
It is suggested to use parquet file format for its popularity and compressed file sizes. The HDF5 format is only suggested for data with large volume of high dimension features.
Users can also store columns in multiple sets of table files, for example, puting “node IDs” and “feature_1” in the set of “table1_1.parquet” file and “table1_2.parquet” file, and put “feature_2” in another set of “table2_1.h5” file and “table2_2.h5” file with the same row order.

Warning

If users split both rows and columns into mutliple sets of table files, they need to make sure that after files are sorted according to the file names, the order of the rows of each column will still keep the same.

Suppose the columns are split into two file sets. One set includes a list of files, i.e., table_1.h5, table_2.h5, ..., table_9.h5, table_10.h5, table_11.h5, and another set also includes a list of files, i.e., table_001.h5, table_002.h5, ..., table_009.h5, table_010.h5, table_011.h5. The order of rows in the two set of files is the same when using the original order of files in the two lists. However, after being sorted by Linux OS, we will get table_1.h5, table_10.h5, table_11.h5, table_2.h5, ..., table_9.h5 for the first list, and get table_001.h5, table_002.h5, ..., table_009.h5, table_010.h5, table_011.h5 for the second list. The order of files is different, which will cause mismatch between node IDs and node features.

Therefore, it is strongly suggested to use the _000* file name template, like table_001, table_002, ..., table_009, table_010, table_011, ..., table_100, table_101, ....

Label split files (Optional)

In some cases, users may want to control which nodes or edges should be used for training, validation, or testing. To achieve this goal, users can set the customized label split information in three JSON files or parquet files.

For node split files, users just need to list the node IDs used for training in one file, node IDs used for validation in one file, and node IDs used for testing in another file. If use JSON files, put one node ID in one line like this example below. If use parquet files, place these node IDs in one column and assign a column name to it.

Foe edge split files, users need to provide both source node IDs and destination node IDs in the split files. If use JSON files, put one edge as a JSON list with two elements, i.e., ["source node ID", "destination node ID"], in one line. If use parquet files, place the source node IDs and destination node IDs into two columns, and assign column names to them like this example below.

If there is no validation or testing set, users do not need to create the corresponding file(s).

A simple raw data example

To better help users to prepare the input raw data artifacts, this section provides a simple example.

This simple raw data has three types of nodes, paper, subject, author, and two types of edges, paper, has, subject and paper, written-by, author.

`paper` node tables

The paper table (paper_nodes.parquet) includes three columns, i.e., nid for node IDs, aff is a feature column with categorial values, class is a classification label column with 3 classes, and abs is a feature column with textual values.

nid	aff	class	abs
n1_1	NE	0	chips are
n1_2	MT	1	electricity
n1_3	UL	1	prime numbers
n1_4	TT	2	Questions are

`subject` node table

The subject table (subject_nodes.parquet) includes one column only, i.e., domain, functioning as node IDs.

domain
eee
mth
llm

`author` node table

The author table (author_nodes.parquet) includes two columns, i.e., n_id for node IDs, and hdx as a feature column with numerical values.

n_id	hdx
60	0.75
70	25.34
80	1.34

The author nodes also have a 2048 dimension embeddings pre-computed on a textual feature stored as an HDF5 file (author_node_embeddings.h5) as shown below.

embedding
0.2964, 0.0779, 1.2763, 2.8971, …, -0.2564, 0.9060, -0.8740
1.6941, -1.6765, 0.1862, -0.4449, …, 0.6474, 0.2358, -0.5952
-0.8417, 2.5096, -0.0393, -0.8208, …, 0.9894, 2.3389, 0.9778

Note

The order of rows in the author_node_embeddings.h5 file MUST be same as those in the author_nodes.parquet file, i.e., the first value row contains the embeddings for the author node with n_id as 60, and the second value row is for author node with n_id as 70, and so on.

`paper, has, subject` edge table

The paper, has, subject edge table (paper_has_subject_edges.parquet) includes three columns, i.e., nid as the source node IDs, domain as the destination IDs, and cnt as the label field for a regression task.

nid	domain	cnt
n1_1	eee	100
n1_2	eee	1
n1_3	mth	39
n1_4	llm	4700

`paper, written-by, author` edge table

The paper, written-by, author edge table (paper_written-by_author_edges.parquet) includes two columns, i.e., nid as the source node IDs, n_id as the destination IDs.

nid	n_id
n1_1	60
n1_2	60
n1_3	70
n1_4	70

Node split JSON files

This example sets customized node split files on the paper nodes for a node classification task in the JSON format. There are two nodes in the training set, one node for validation, and one node for testing.

train.json contents

n1_2
n1_3

val.json contents

n1_4

test.json contents

n1_1

Edge split parquet files

This example sets customized edge split files on the paper, has, subject edges for an edge regression task in the parquet format. There are three edges in the training set, one edge for validation, and no edge for testing.

train_edges.parquet contents

nid	domain
n1_1	eee
n1_2	eee
n1_4	llm

val_edges.parquet contents

nid	domain
n1_3	mth

Input Raw Data Specification

Data tables

Node tables

Edge tables

Label split files (Optional)

A simple raw data example

paper node tables

subject node table

author node table

paper, has, subject edge table

paper, written-by, author edge table