Use Your Own Data

It is easy for users to prepare their own graph data and leverage GraphStorm’s built-in GNN models, e.g., RGCN, RGAT and HGT, to perform GML tasks. It takes three steps to use your own graph data in GraphStorm:

Step 1: Prepare your own graph data in the required format.
Step 2: Modify the GraphStorm configuration YAML file.
Step 3: Launch GraphStorm commands for training/inference.

Step 1: Prepare Your Own Graph Data

There are two options to prepare your own graph data for using GraphStorm:

Option 1: prepare your graph in the raw table data format that GraphStorm’ construction tools required, and use these construction tools to automatically generate the input files. This is the preferred method, as GraphStorm provides distributed data processing and construciton tools to handle extreme large graph data.
Option 2: prepare your data as a DGL heterogeneous graph following the specific format described below, and then use GraphStorm’s partition tools to generate the input files. This option is for experienced DGL users and relatively small graph data.

Option 1: Required raw data format

GraphStorm provides a set of graph construction tools to generate input files for using the training/inference commands. To use these tools, users would need to prepare their graph data into the raw data format required.

In general, the graph construction tool needs three sets of files as inputs. The detailed information about the raw data format can be found in the Graph Construction Configurations.

A configuration JSON file (required). It describes the graph structure, i.e. nodes and edges information, the tasks to perform, the node features, label information, and raw data file paths.
A set of raw node data files (optional). Each type of nodes must have at least one file associated. If the file is too big, users can split this one file into multiple files that have the same columns and different rows.
A set of raw edge data files (required). Each type of edges must have at least one file associated. If the file is too big, users can split this one file into multiple files that have the same columns and different rows.

This tutorial uses the ACM publication graph as a demonstration to show how to prepare users’ own graph data, and what these files and their contents are like.

Note

The following commands assume users have installed GraphStorm.

First run the below commands to clone GraphStorm source code from GitHub, and go to the root path of GraphStorm source code.

git clone https://github.com/awslabs/graphstorm.git
cd graphstorm

And then run the command to create the ACM data with the required raw format.

python examples/acm_data.py --output-path /tmp/acm_raw

Once succeeded, the command will create the three sets of files under the /tmp/acm_raw/ folder, as shown below. The next sections will explain each of them in details.

/tmp/acm_raw
config.json
|- edges
    author_writing_paper.parquet
    paper_cited_paper.parquet
    paper_citing_paper.parquet
    paper_is-about_subject.parquet
    paper_written-by_author.parquet
    subject_has_paper.parquet
|- nodes
    author.parquet
    paper.parquet
    subject.parquet

The input configuration JSON

The above command automatically creates the examplary ACM config.json file, some of which are listed below.

{
    "version": "gconstruct-v0.1",
    "nodes": [

        ......

        {
            "node_type": "paper",
            "format": {
                "name": "parquet"
            },
            "files": [
                "/tmp/acm_raw/nodes/paper.parquet"
            ],
            "node_id_col": "node_id",
            "features": [
                {
                    "feature_col": "feat",
                    "feature_name": "feat"
                }
            ],
            "labels": [
                {
                    "label_col": "label",
                    "task_type": "classification",
                    "split_pct": [
                        0.8,
                        0.1,
                        0.1
                    ]
                }
            ]
        },

        ......

    ],
    "edges": [

        ......

        {
            "relation": [
                "paper",
                "citing",
                "paper"
            ],
            "format": {
                "name": "parquet"
            },
            "files": [
                "/tmp/acm_raw/edges/paper_citing_paper.parquet"
            ],
            "source_id_col": "source_id",
            "dest_id_col": "dest_id",
            "features": [
                {
                    "feature_col": "cate_feat",
                    "feature_name": "cate_feat",
                    "transform": {
                        "name": "to_categorical"
                    }
                }
            ],
            "labels": [
                {
                    "task_type": "link_prediction",
                    "split_pct": [
                        0.8,
                        0.1,
                        0.1
                    ]
                }
            ]
        },

    ......

    ]
}

Based on the original ACM dataset, this example builds a simple heterogenous graph that contains three types of nodes and six types of edges as shown in the diagram below.

The examplary ACM graph also predifines two sets of labels. One set of labels are associated to the paper type nodes for a node classification demonstration, and another set is associated to the paper,citing,paper type edges for a link prediction demonstration. The above JSON contents specify how to split these labels, i.e., asking GraphStorm graph construction tools to randomly split labels into three groups, and 80% for training, 10% for validation, and the rest 10% for testing.

Customized label split

If users want to split labels with your own logics, e.g., time sequence, you can split labels first, and then provide the split information in the configuration JSON file or parquet file like the example below. When using parquet files as input, please specify the column object you want to use for your labels. The column object here can be either a string or a list with a single string. When using parquet input, it allows input be either string or list of strings and it allows wildcard as the input.

JSON:

"labels": [
    {
        "label_col": "label",
        "task_type": "classification",
        "custom_split_filenames": {"train": "/tmp/acm_raw/nodes/train_idx.json",
                                   "valid": "/tmp/acm_raw/nodes/val_idx.json",
                                   "test": "/tmp/acm_raw/nodes/test_idx.json"}
    }
]

Parquet:

"labels": [
    {
        "label_col": "label",
        "task_type": "classification",
        "custom_split_filenames": {"train": "/tmp/acm_raw/nodes/train_idx.parquet",
                                   "valid": ["/tmp/acm_raw/nodes/val_idx_1.parquet", "/tmp/acm_raw/nodes/val_idx_2.parquet"],
                                   "test": "/tmp/acm_raw/nodes/test_idx_*.parquet"
                                    "column": "ID"}
    }
]

Instead of using the split_pct, users can specify the custom_split_filenames configuration with a value, which is a dictionary, to use custom data split. The dictionary’s keys could include train, valid, and test, and values of the dictionary are JSON files that contains node IDs in each set.

These JSON files only need to list the IDs on its own set. For example, in a node classification task, there are 100 nodes and node ID starts from 0, and assume the last 50 nodes (ID from 49 to 99) have labels associated. For some business logic, users want to have the first 10 of the 50 labeled nodes as training set, the last 30 as the test set, and the middle 10 as the validation set. Then the train_idx.json file should contain the integer from 50 to 59, and one integer per line. Similarly, the val_idx.json file should contain the integer from 60 to 69, and the test_idx.json file should contain the integer from 70 to 99. Contents of the train_idx.json file are like the followings.

For edge data, users can do the similar thing as defining customized node labels to define the customized edge labels. The configuration looks same for JSON files, for parquet files, users need to specify both the source id column and destination id column in a list of strings:

JSON:

"labels": [
    {
        "label_col": "label",
        "task_type": "classification",
        "custom_split_filenames": {"train": "/tmp/acm_raw/edges/train_idx.json",
                                   "valid": "/tmp/acm_raw/edges/val_idx.json",
                                   "test": "/tmp/acm_raw/edges/test_idx.json"}
    }
]

Parquet:

"labels": [
    {
        "label_col": "label",
        "task_type": "classification",
        "custom_split_filenames": {"train": "/tmp/acm_raw/edges/train_idx.parquet",
                                   "valid": "/tmp/acm_raw/edges/val_idx.parquet",
                                   "test": "/tmp/acm_raw/edges/test_idx.parquet",
                                   "column":  ["src", "dst"]}
    }
]

The values of dictionary files should be json as well here. Each line of the json file should an array with the source node and destination node. For example, contents of train_idx.json should look like the following:

["p0", "p1301"]
["p0", "p9830"]
["p1", "p1910"]
["p1", "p2165"]
["p1", "p6894"]
["p12497", "p12498"]

Input raw node/edge data files

The raw node and edge data files are both in a parquet format, whose contents are demonstrated as the diagram below.

In this example, only the paper nodes have labels and the task is node classification. So, in the JSON file, the paper node has the labels field, and the task_type is specified as classification. Correspondingly, in the paper node parquet file, there is a column, label, stores the label values. All edge types do not have features associated. Therefore, we only have two columns in these parquet files for edges, the source_id and the dest_id. For the link prediction task, there is no actual labels. Users just need to specify the labels field in one or more edge objects of the JSON config file.

Run graph construction

The configuration JSON file along with these node and edge parquet files are the required inputs of the GraphStorm’s construction tools. Then we can use the tool to create the partition graph data with the following command.

python -m graphstorm.gconstruct.construct_graph \
          --conf-file /tmp/acm_raw/config.json \
          --output-dir /tmp/acm_gs \
          --num-parts 1 \
          --graph-name acm

Outputs of graph construction

The above command reads in the JSON file, and matches its contents with the node and edge parquet files. It will then read all parquet files, construct the graph, check file correctness, pre-process features, and eventually split the graph into partitions. Outputs of the command will be saved under the /tmp/acm_gs/ folder as followings:

/tmp/acm_gs
acm.json
edge_label_stats.json
edge_mapping.pt
node_label_stats.json
node_mapping.pt
|- part0
    edge_feat.dgl
    graph.dgl
    node_feat.dgl
|- raw_id_mappings
    |- author
        part-00000.parquet
    |- paper
        part-00000.parquet
    |- subject
        part-00000.parquet

Because the above command specifies the --num-parts to be 1, there is only one partition created, which is saved in the part0 folder. These files become the inputs of GraphStorm’s launch scripts.

Note

Because the parquet format has some limitations, such as only supporting 2 billion elements in a column, etc, we suggest users to use HDF5 format for very large datasets.
The mapping files, node_mapping.pt, edge_mapping.pt and the files under raw_id_mappings, are used to record the mapping between the original node and edge ids in the raw data files and the ids of nodes and edges in the Graph Node ID space. They are important for mapping the training and inference outputs back to the Raw Node ID space in the original input data. Therefore, DO NOT move or delete them.

Option 2: Required DGL graph

For some users who are already familiar with DGL, they can convert their graph data into the required DGL graph format. And then use GraphStorm’s partition tools to create the inputs of GraphStorm’s launch scripts.

Required DGL graph format

a dgl.heterograph.
All nodes/edges features are set in nodes/edges’ data field, and remember the feature names, which will be used in the later steps.
- For nodes’ features, the common way to set features is like g.nodes['nodetypename'].data['featurename']=nodefeaturetensor, The formal explanation of DGL’s node feature could be found in the Using node features. Please make sure every node feature is a 2D tensor.
- For edges’ features, the common way to set features is like g.edges['edgetypename'].data['featurename']=edgefeaturetensor, The formal explanation of DGL’s edge feature could be found in the Using edge features. Please make sure every edge feature is a 2D tensor.
Save labels (for node/edge tasks) into the target nodes/edges as a feature, and remember the label feature names, which will be used in the later steps.
- The common way to set node-related labels as a feature is like g.nodes['predictnodetypename'].data['labelname']=nodelabeltensor.
- The common way to set edge-related labels as a feature is like g.nodes['predictedgetypename'].data['labelname']=edgelabeltensor.
- For link prediction task, a common way to extract labels is to use existing edges as the positive edges and use negative sampling method to extract non-exist edges as negative edges. So in this step, we do not need to set the labels. The GraphStorm has implemented this function.
(Optional) if you have your own train/validation/test split on nodes/edges, you can put the train/validation/test nodes/edges index tensors as three nodes/edges features with the feature names as train_mask, val_mask, and test_mask. If you do not have nodes/edges split, you can use the split functions provided in the GraphStorm partition tools to create them in the next step.
- For training nodes, the setting is like g.nodes['predictnodetypename'].data['train_mask']=trainingnodeindexetensor.
- For validation nodes, the setting is like g.nodes['predictnodetypename'].data['val_mask']=validationnodeindexetensor. Make sure you use ‘val_mask’ as the feature name because the GSF uses this name by default.
- For validation nodes, the setting is like g.nodes['predictnodetypename'].data['test_mask']=testnodeindexetensor.
- Similar to nodes splits, you can use the same feature names, train_mask, val_mask, and test_mask, to assign the edge index tensors.
- The index tensor is either a boolean tensor, or an integer tensor including only 0s and 1s.

Once this DGL graph is constructed, you can use DGL’s save_graphs() function to save it into a local file. The file name must follow GraphStorm convention: <datasetname>.dgl. You can give your graph dataset a name, e.g., acm or ogbn_mag.

The ACM graph data example

For the ACM data, the following command can create a DGL graph as the input for GraphStorm’s partition tools.

python examples/acm_data.py \
       --output-type dgl \
       --output-path /tmp/acm_dgl

The below image show how the built DGL ACM data looks like.

Note

Because the Option 2 method will not be supported after the 0.2 version, some new graph construction features, such as label statistics generation and node ID mapping, are not available in this option. To use the latest construction features, please refer to the Option 1.

Partition the DGL ACM graph

GraphStorm provides two graph partition tools, the partition_graph.py for node/edge prediction graph partition, and the partition_graph_lp.py for the link prediction graph partition.

The below command partition the DGL ACM graph, the acm.dgl in the /tmp/acm_dgl folder, into one partition, and save the partitioned data to /tmp/acm_nc/ folder for node classification task.

python tools/partition_graph.py \
       --dataset acm \
       --filepath /tmp/acm_dgl \
       --num-parts 1 \
       --target-ntype paper \
       --nlabel-field paper:label \
       --output /tmp/acm_nc

Outputs of the command are under the /tmp/acm_nc/ folder with the similar contents as the Option 1.

In terms of link prediction task, run the following command to partition the data and save to the /tmp/acm_lp/ folder.

python tools/partition_graph_lp.py \
       --dataset acm \
       --filepath /tmp/acm_dgl \
       --num-parts 1 \
       --target-etype paper,citing,paper \
       --output /tmp/acm_lp

Please refer to Graph Partition for DGL Graphs guideline for more details of the arguments of the two partition tools.

Step 2: Modify the YAML configuration file to include your own data’s information

It is common that users will copy and reuse GraphStorm’s built-in scripts and yaml files to run training/inference on their own graph data, but forget to change the contents of yaml files to match their own data. Below are some parameters that users need to double check and make changes accordingly.

node_feat_name: if some types of nodes have features, please make sure to specify these feature names in either the YAML file or use an argument in the launch command. Otherwise, GraphStorm will ignore any features the nodes might have, hence only using learnable embeddings as their features.

For Classification/Regression tasks:

label_field: please change values of this field to specify the field name of labeled data in your graph data.
num_classes: please change values of this filed to specify the number of classes to be predicted in your graph data if doing a Classification` task.

For Node Classification/Regression tasks:

target_ntype: please change values of this field to the node type that the label is associated, which should be the same node type for prediction.

For Edge Classification/Regression tasks:

target_etype: please change values of this field to the edge type that the label is associated, which should be the same edge type for prediction.

For Link Prediction tasks:

train_etype: please specify values of this field for the edge type that you want to do link prediction for the downstream task, e.g. recommendation or search. Although if not specified, i.e. put None as the value, all edge types will be used for training, this might not commonly used in practice for most Link Prediction related tasks.
eval_etype: it is highly recommended that you set this value to be the same as the value of train_etype, so that the evaluation metric can truly demonstrate the performance of models.

Besides these parameters, it is also important for you to use the correct format to configure node/edge types in the YAML files. For example, in an edge-related task, you should provide a canonical edge type, e.g. author,write,paper (no white spaces in this string), for edge types, rather than the edge name only, e.g. the write only.

For more detailed information of these parameters, please refer to the GraphStorm Training and Inference Configurations page.

Example ACM YAML files

Below is an example YAML configuration file for the ACM data, which sets to use GraphStorm’s built-in RGCN model for node classification on the paper nodes. The YAML file can also be found at the /graphstorm/examples/use_your_own_data/acm_nc.yaml.

---
version: 1.0
gsf:
basic:
    model_encoder_type: rgcn
    backend: gloo
    verbose: false
gnn:
    fanout: "50,50"
    num_layers: 2
    hidden_size: 256
    use_mini_batch_infer: false
input:
    restore_model_path: null
output:
    save_model_path: /tmp/acm_nc/models
hyperparam:
    dropout: 0.
    lr: 0.0001
    lm_tune_lr: 0.0001
    num_epochs: 200
    batch_size: 1024
    wd_l2norm: 0
    alpha_l2norm: 0.
rgcn:
    num_bases: -1
    use_self_loop: true
    sparse_optimizer_lr: 1e-2
    use_node_embeddings: false
node_classification:
    target_ntype: "paper"
    label_field: "label"
    multilabel: false
    num_classes: 14

For the link prediction task, the examplary YAML file can be found at the /graphstorm/examples/use_your_own_data/acm_lp.yaml.

Users can copy these YAML files to the /tmp folder within the GraphStorm container for the next step.

Step 3: Launch training and inference scripts on your own graphs

With the partitioned data and configuration YAML file available, it is easy to use GraphStorm’s training and inference scripts to launch the job.

Below is a launch script example that trains a GraphStorm built-in RGCN model on the ACM data for node classification.

python -m graphstorm.run.gs_node_classification \
          --workspace /tmp \
          --part-config /tmp/acm_gs/acm.json \
          --num-trainers 1 \
          --num-servers 1 \
          --cf /tmp/acm_nc.yaml \
          --save-model-path /tmp/acm_nc/models \
          --node-feat-name paper:feat author:feat subject:feat

Link prediction training can be performed using the following command.

python -m graphstorm.run.gs_link_prediction \
          --workspace /tmp \
          --part-config /tmp/acm_gs/acm.json \
          --num-trainers 1 \
          --num-servers 1 \
          --cf /tmp/acm_lp.yaml \
          --save-model-path /tmp/acm_lp/models \
          --node-feat-name paper:feat author:feat subject:feat

Similar to the Quick-Start tutorial, users can launch the inference script on their own data. Below is the customized scripts for inference in the ACM graph.

# Node Classification
python -m graphstorm.run.gs_node_classification \
          --inference \
          --workspace /tmp \
          --part-config /tmp/acm_gs/acm.json \
          --num-trainers 1 \
          --num-servers 1 \
          --cf /tmp/acm_nc.yaml \
          --node-feat-name paper:feat author:feat subject:feat \
          --restore-model-path /tmp/acm_nc/models/epoch-0 \
          --save-prediction-path  /tmp/acm_nc/predictions

# Link Prediction
python -m graphstorm.run.gs_link_prediction \
          --inference \
          --workspace /tmp \
          --part-config /tmp/acm_gs/acm.json \
          --num-trainers 1 \
          --num-servers 1 \
          --cf /tmp/acm_lp.yaml \
          --save-model-path /tmp/acm_lp/models \
          --node-feat-name paper:feat author:feat subject:feat \
          --restore-model-path /tmp/acm_lp/models/epoch-0 \
          --save-embed-path  /tmp/acm_lp/embeds

Once users get familiar with the three steps of using your own graph data, the next step would be look through GraphStorm’s Configurations that control the three steps for your specific requirements.