Notebook 0: Data Prepare

This notebook will create an example graph data to be used in the other notebooks to demonstrate how to program using GraphStorm APIs. The example graph data comes from DGL’s ACM publication dataset, which is the same as the data explainedin the Use Your Own Data tutorial.

Prerequisites

This notebook assumes the following: - Python 3; - Linux OS, Ubuntu or Amazon Linux; - GraphStorm and its dependencies (following the Setup GraphStorm with pip packages tutorial) - Jupyter web interactive server.

Users can use the following command to check if the above prerequisites are met.

[2]:

import graphstorm as gs
print(gs.__version__)

0.2.1

Download Data Generation Script

GraphStorm provides a Python script that can download and convert the DGL ACM publication data for GraphStorm usage. Therefore, first let’s download the script file from the GraphStorm Github repository.

[2]:

!wget -O ./acm_data.py https://github.com/awslabs/graphstorm/raw/main/examples/acm_data.py

--2023-12-11 18:58:35--  https://github.com/awslabs/graphstorm/raw/main/examples/acm_data.py
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/awslabs/graphstorm/main/examples/acm_data.py [following]
--2023-12-11 18:58:35--  https://raw.githubusercontent.com/awslabs/graphstorm/main/examples/acm_data.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18296 (18K) [text/plain]
Saving to: ‘./acm_data.py’

./acm_data.py       100%[===================>]  17.87K  --.-KB/s    in 0.001s

2023-12-11 18:58:35 (29.6 MB/s) - ‘./acm_data.py’ saved [18296/18296]

Generate ACM Raw Table Data

Then we can use the command below to build the raw table data, which is the standard input data for GraphStorm’s gconstruct module.

[2]:

!python ./acm_data.py --output-path ./acm_raw --output-type raw_w_text

Namespace(download_path='/tmp/ACM.mat', dataset_name='acm', output_type='raw_w_text', output_path='./acm_raw')
Graph(num_nodes={'author': 17431, 'paper': 12499, 'subject': 73},
      num_edges={('author', 'writing', 'paper'): 37055, ('paper', 'cited', 'paper'): 30789, ('paper', 'citing', 'paper'): 30789, ('paper', 'is-about', 'subject'): 12499, ('paper', 'written-by', 'author'): 37055, ('subject', 'has', 'paper'): 12499},
      metagraph=[('author', 'paper', 'writing'), ('paper', 'paper', 'cited'), ('paper', 'paper', 'citing'), ('paper', 'subject', 'is-about'), ('paper', 'author', 'written-by'), ('subject', 'paper', 'has')])

 Number of classes: 14

 Paper node labels: torch.Size([12499])

 ('paper', 'citing', 'paper') edge labels:30789
Saving ACM data to /tmp/acm.dgl ......
/tmp/acm.dgl saved.
Saving ACM node text to /tmp/acm_text.pkl ......
/tmp/acm_text.pkl saved.
author nodes have: Index(['node_id', 'feat', 'text'], dtype='object') columns ......
paper nodes have: Index(['node_id', 'label', 'feat', 'text'], dtype='object') columns ......
subject nodes have: Index(['node_id', 'feat', 'text'], dtype='object') columns ......
Saved author node data to ./acm_raw/nodes/author.parquet.
Saved paper node data to ./acm_raw/nodes/paper.parquet.
Saved subject node data to ./acm_raw/nodes/subject.parquet.
Saved ('author', 'writing', 'paper') edge data to ./acm_raw/edges/author_writing_paper.parquet
Saved ('paper', 'cited', 'paper') edge data to ./acm_raw/edges/paper_cited_paper.parquet
Saved ('paper', 'citing', 'paper') edge data to ./acm_raw/edges/paper_citing_paper.parquet
Saved ('paper', 'is-about', 'subject') edge data to ./acm_raw/edges/paper_is-about_subject.parquet
Saved ('paper', 'written-by', 'author') edge data to ./acm_raw/edges/paper_written-by_author.parquet
Saved ('subject', 'has', 'paper') edge data to ./acm_raw/edges/subject_has_paper.parquet

Construct GraphStorm Input Graph Data

With the raw ACM tables we then can use GraphStorm’s graph construction method to prepare the ACM graph for other notebooks. The graph construction module perform: - read in the raw data, and convert it to DGL graph; - split the DGL graph into multiple partitions as the distributed DGL graphs; - produce node id mapping files and other supporting files.

For the GraphStorm Standalone mode, we only need one partition. Therefore, in the command below we set the --num-parts to be 1. For other arguments, users can refer to GraphStorm Graph Construction arguments.

[4]:

!python -m graphstorm.gconstruct.construct_graph \
          --conf-file ./acm_raw/config.json \
          --output-dir ./acm_gs_1p \
          --num-parts 1 \
          --graph-name acm

INFO:root:The graph has 3 node types and 6 edge types.
INFO:root:Node type author has 17431 nodes
INFO:root:Node type paper has 12499 nodes
INFO:root:Node type subject has 73 nodes
INFO:root:Edge type ('author', 'writing', 'paper') has 37055 edges
INFO:root:Edge type ('paper', 'cited', 'paper') has 30789 edges
INFO:root:Edge type ('paper', 'citing', 'paper') has 30789 edges
INFO:root:Edge type ('paper', 'is-about', 'subject') has 12499 edges
INFO:root:Edge type ('paper', 'written-by', 'author') has 37055 edges
INFO:root:Edge type ('subject', 'has', 'paper') has 12499 edges
INFO:root:Node type author has features: ['feat', 'input_ids', 'attention_mask', 'token_type_ids'].
INFO:root:Node type paper has features: ['feat', 'input_ids', 'attention_mask', 'token_type_ids', 'train_mask', 'val_mask', 'test_mask', 'label'].
INFO:root:Train/val/test on paper: 9999, 1249, 1249
INFO:root:Node type subject has features: ['feat', 'input_ids', 'attention_mask', 'token_type_ids'].
INFO:root:Edge type ('paper', 'citing', 'paper') has features: ['train_mask', 'val_mask', 'test_mask'].
INFO:root:Train/val/test on ('paper', 'citing', 'paper'): 24631, 3078, 3078
Converting to homogeneous graph takes 0.004s, peak mem: 3.161 GB
Save partitions: 0.020 seconds, peak memory: 3.349 GB
There are 160686 edges in the graph and 0 edge cuts for 1 partitions.
INFO:root:Graph construction generates new node IDs for 'author'. The ID map is saved in ./acm_gs_1p/author_id_remap.parquet.
INFO:root:Graph construction generates new node IDs for 'paper'. The ID map is saved in ./acm_gs_1p/paper_id_remap.parquet.
INFO:root:Graph construction generates new node IDs for 'subject'. The ID map is saved in ./acm_gs_1p/subject_id_remap.parquet.

3-Partition Input Data

To better illustrate GraphStorm required input data structure, we can use the following command to create a 3-partition input data.

[5]:

!python -m graphstorm.gconstruct.construct_graph \
          --conf-file ./acm_raw/config.json \
          --output-dir ./acm_gs_3p \
          --num-parts 3 \
          --graph-name acm

INFO:root:The graph has 3 node types and 6 edge types.
INFO:root:Node type author has 17431 nodes
INFO:root:Node type paper has 12499 nodes
INFO:root:Node type subject has 73 nodes
INFO:root:Edge type ('author', 'writing', 'paper') has 37055 edges
INFO:root:Edge type ('paper', 'cited', 'paper') has 30789 edges
INFO:root:Edge type ('paper', 'citing', 'paper') has 30789 edges
INFO:root:Edge type ('paper', 'is-about', 'subject') has 12499 edges
INFO:root:Edge type ('paper', 'written-by', 'author') has 37055 edges
INFO:root:Edge type ('subject', 'has', 'paper') has 12499 edges
INFO:root:Node type author has features: ['feat', 'input_ids', 'attention_mask', 'token_type_ids'].
INFO:root:Node type paper has features: ['feat', 'input_ids', 'attention_mask', 'token_type_ids', 'train_mask', 'val_mask', 'test_mask', 'label'].
INFO:root:Train/val/test on paper: 9999, 1249, 1249
INFO:root:Node type subject has features: ['feat', 'input_ids', 'attention_mask', 'token_type_ids'].
INFO:root:Edge type ('paper', 'citing', 'paper') has features: ['train_mask', 'val_mask', 'test_mask'].
INFO:root:Train/val/test on ('paper', 'citing', 'paper'): 24631, 3078, 3078
Converting to homogeneous graph takes 0.004s, peak mem: 3.121 GB
Convert a graph into a bidirected graph: 0.004 seconds, peak memory: 3.308 GB
Construct multi-constraint weights: 0.010 seconds, peak memory: 3.308 GB
[21:29:03] /opt/dgl/src/graph/transform/metis_partition_hetero.cc:89: Partition a graph with 30003 nodes and 160987 edges into 3 parts and get 9798 edge cuts
Metis partitioning: 0.043 seconds, peak memory: 3.308 GB
Assigning nodes to METIS partitions takes 0.059s, peak mem: 3.308 GB
Reshuffle nodes and edges: 0.022 seconds
Split the graph: 0.014 seconds
Construct subgraphs: 0.015 seconds
Splitting the graph into partitions takes 0.050s, peak mem: 3.308 GB
part 0 has 6672 nodes of type author and 6015 are inside the partition
part 0 has 6656 nodes of type paper and 4152 are inside the partition
part 0 has 58 nodes of type subject and 25 are inside the partition
part 0 has 13733 edges of type ('author', 'writing', 'paper') and 12901 are inside the partition
part 0 has 11889 edges of type ('paper', 'cited', 'paper') and 10774 are inside the partition
part 0 has 11889 edges of type ('paper', 'citing', 'paper') and 10550 are inside the partition
part 0 has 5329 edges of type ('paper', 'is-about', 'subject') and 4028 are inside the partition
part 0 has 13733 edges of type ('paper', 'written-by', 'author') and 12732 are inside the partition
part 0 has 5329 edges of type ('subject', 'has', 'paper') and 4152 are inside the partition
part 1 has 6133 nodes of type author and 5664 are inside the partition
part 1 has 6161 nodes of type paper and 4031 are inside the partition
part 1 has 59 nodes of type subject and 23 are inside the partition
part 1 has 13067 edges of type ('author', 'writing', 'paper') and 12137 are inside the partition
part 1 has 12508 edges of type ('paper', 'cited', 'paper') and 11280 are inside the partition
part 1 has 12508 edges of type ('paper', 'citing', 'paper') and 10927 are inside the partition
part 1 has 5216 edges of type ('paper', 'is-about', 'subject') and 4264 are inside the partition
part 1 has 13067 edges of type ('paper', 'written-by', 'author') and 12372 are inside the partition
part 1 has 5216 edges of type ('subject', 'has', 'paper') and 4031 are inside the partition
part 2 has 6271 nodes of type author and 5752 are inside the partition
part 2 has 6977 nodes of type paper and 4316 are inside the partition
part 2 has 66 nodes of type subject and 25 are inside the partition
part 2 has 12772 edges of type ('author', 'writing', 'paper') and 12017 are inside the partition
part 2 has 10404 edges of type ('paper', 'cited', 'paper') and 8735 are inside the partition
part 2 has 10404 edges of type ('paper', 'citing', 'paper') and 9312 are inside the partition
part 2 has 5222 edges of type ('paper', 'is-about', 'subject') and 4207 are inside the partition
part 2 has 12772 edges of type ('paper', 'written-by', 'author') and 11951 are inside the partition
part 2 has 5222 edges of type ('subject', 'has', 'paper') and 4316 are inside the partition
Save partitions: 0.020 seconds, peak memory: 3.308 GB
There are 160686 edges in the graph and 0 edge cuts for 3 partitions.
INFO:root:Graph construction generates new node IDs for 'author'. The ID map is saved in ./acm_gs_3p/author_id_remap.parquet.
INFO:root:Graph construction generates new node IDs for 'paper'. The ID map is saved in ./acm_gs_3p/paper_id_remap.parquet.
INFO:root:Graph construction generates new node IDs for 'subject'. The ID map is saved in ./acm_gs_3p/subject_id_remap.parquet.

Data Exploration and Explanation

The above commands created two sets of ACM data, i.e., the raw ACM data tables, and ACM GraphStorm input graphs. Below we explore these datasets, and explain their format so that users can prepare their own graph data easily.

We can explore the acm_raw folder with the ls -al command.

[6]:

!ls -al ./acm_raw

total 24
drwxrwxr-x 4 ubuntu ubuntu 4096 Dec 19 21:27 .
drwxrwxr-x 6 ubuntu ubuntu 4096 Dec 19 21:29 ..
-rw-rw-r-- 1 ubuntu ubuntu 5306 Dec 19 21:27 config.json
drwxrwxr-x 2 ubuntu ubuntu 4096 Dec 19 21:27 edges
drwxrwxr-x 2 ubuntu ubuntu 4096 Dec 19 21:27 nodes

[7]:

!ls -al ./acm_raw/nodes

total 38744
drwxrwxr-x 2 ubuntu ubuntu     4096 Dec 19 21:27 .
drwxrwxr-x 4 ubuntu ubuntu     4096 Dec 19 21:27 ..
-rw-rw-r-- 1 ubuntu ubuntu 18843566 Dec 19 21:27 author.parquet
-rw-rw-r-- 1 ubuntu ubuntu 20704514 Dec 19 21:27 paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu   113462 Dec 19 21:27 subject.parquet

[8]:

!ls -al ./acm_raw/edges

total 1016
drwxrwxr-x 2 ubuntu ubuntu   4096 Dec 19 21:27 .
drwxrwxr-x 4 ubuntu ubuntu   4096 Dec 19 21:27 ..
-rw-rw-r-- 1 ubuntu ubuntu 263138 Dec 19 21:27 author_writing_paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu 156358 Dec 19 21:27 paper_cited_paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu 162714 Dec 19 21:27 paper_citing_paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu  87792 Dec 19 21:27 paper_is-about_subject.parquet
-rw-rw-r-- 1 ubuntu ubuntu 265948 Dec 19 21:27 paper_written-by_author.parquet
-rw-rw-r-- 1 ubuntu ubuntu  84005 Dec 19 21:27 subject_has_paper.parquet

Graph Description JSON File `config.json`

The acm_raw folder includes one config.json file that describes the table-based raw graph data. Except for a version object, the JSON file contains a nodes object and an edges object.

The nodes object contains a list of node objects, each of which includes a set of properties to describe one node type in a graph data. For example, in the config.json file, there is a node type, called “papers”. For each node type, GraphStorm defines a few other properties, such as format, files, and features.

Similarly, the edges object contains a list of edge objects. Most of edge properties are same as node’s except that edge object has the relation property that define an edge type in a canonical format, i.e., source node type, relation type, and destination node type.

For a full list of the JSON configuration properties, users can refer to the GraphStorm Graph Construction JSON Explanations.

To use your own graph, users need to prepare their own JSON file.

[9]:

!cat ./acm_raw/config.json

{
    "version": "gconstruct-v0.1",
    "nodes": [
        {
            "node_type": "author",
            "format": {
                "name": "parquet"
            },
            "files": [
                "./acm_raw/nodes/author.parquet"
            ],
            "node_id_col": "node_id",
            "features": [
                {
                    "feature_col": "feat",
                    "feature_name": "feat"
                },
                {
                    "feature_col": "text",
                    "feature_name": "text",
                    "transform": {
                        "name": "tokenize_hf",
                        "bert_model": "bert-base-uncased",
                        "max_seq_length": 16
                    }
                }
            ]
        },
        {
            "node_type": "paper",
            "format": {
                "name": "parquet"
            },
            "files": [
                "./acm_raw/nodes/paper.parquet"
            ],
            "node_id_col": "node_id",
            "features": [
                {
                    "feature_col": "feat",
                    "feature_name": "feat"
                },
                {
                    "feature_col": "text",
                    "feature_name": "text",
                    "transform": {
                        "name": "tokenize_hf",
                        "bert_model": "bert-base-uncased",
                        "max_seq_length": 16
                    }
                }
            ],
            "labels": [
                {
                    "label_col": "label",
                    "task_type": "classification",
                    "split_pct": [
                        0.8,
                        0.1,
                        0.1
                    ],
                    "label_stats_type": "frequency_cnt"
                }
            ]
        },
        {
            "node_type": "subject",
            "format": {
                "name": "parquet"
            },
            "files": [
                "./acm_raw/nodes/subject.parquet"
            ],
            "node_id_col": "node_id",
            "features": [
                {
                    "feature_col": "feat",
                    "feature_name": "feat"
                },
                {
                    "feature_col": "text",
                    "feature_name": "text",
                    "transform": {
                        "name": "tokenize_hf",
                        "bert_model": "bert-base-uncased",
                        "max_seq_length": 16
                    }
                }
            ]
        }
    ],
    "edges": [
        {
            "relation": [
                "author",
                "writing",
                "paper"
            ],
            "format": {
                "name": "parquet"
            },
            "files": [
                "./acm_raw/edges/author_writing_paper.parquet"
            ],
            "source_id_col": "source_id",
            "dest_id_col": "dest_id"
        },
        {
            "relation": [
                "paper",
                "cited",
                "paper"
            ],
            "format": {
                "name": "parquet"
            },
            "files": [
                "./acm_raw/edges/paper_cited_paper.parquet"
            ],
            "source_id_col": "source_id",
            "dest_id_col": "dest_id"
        },
        {
            "relation": [
                "paper",
                "citing",
                "paper"
            ],
            "format": {
                "name": "parquet"
            },
            "files": [
                "./acm_raw/edges/paper_citing_paper.parquet"
            ],
            "source_id_col": "source_id",
            "dest_id_col": "dest_id",
            "labels": [
                {
                    "task_type": "link_prediction",
                    "split_pct": [
                        0.8,
                        0.1,
                        0.1
                    ]
                }
            ]
        },
        {
            "relation": [
                "paper",
                "is-about",
                "subject"
            ],
            "format": {
                "name": "parquet"
            },
            "files": [
                "./acm_raw/edges/paper_is-about_subject.parquet"
            ],
            "source_id_col": "source_id",
            "dest_id_col": "dest_id"
        },
        {
            "relation": [
                "paper",
                "written-by",
                "author"
            ],
            "format": {
                "name": "parquet"
            },
            "files": [
                "./acm_raw/edges/paper_written-by_author.parquet"
            ],
            "source_id_col": "source_id",
            "dest_id_col": "dest_id"
        },
        {
            "relation": [
                "subject",
                "has",
                "paper"
            ],
            "format": {
                "name": "parquet"
            },
            "files": [
                "./acm_raw/edges/subject_has_paper.parquet"
            ],
            "source_id_col": "source_id",
            "dest_id_col": "dest_id"
        }
    ]
}

Raw ACM Tables in the `nodes/` and `edges/` folder.

As defined in the ./acm_raw/config.json file, the node data files are stored at the ./acm_raw/nodes/ folder, and edge data files are stored at the ./acm_raw/edges/ folder. General description of these files can be found at the Input raw node/edge data files. Here, we can read some node (“paper”) and edge ([“paper”, “citing”, “paper”]) tables to learn more about them.

[10]:

import pandas as pd

paper_node_path = './acm_raw/nodes/paper.parquet'
paper_citing_paper_edge_path = './acm_raw/edges/paper_citing_paper.parquet'

The “paper” node table

The paper node table could be read in as a Pandas DataFrame. The table has a few columns, whose names are used in the config.json. For the “paper” nodes, there is a node_id column, including a unique identifier for each node, a feat column, including a 256D numerical tensor for each node, a text column, including free text feature for each node, and a label column, including an integer to indicate the class that each node is assigned.

The other two node types, “author” and “subject”, have similar data tables. Users can explore them with the similar code below.

[11]:

paper_node_df = pd.read_parquet(paper_node_path)

print(paper_node_df.shape)
paper_node_df.sample(4)

(12499, 4)

[11]:

	node_id	label	feat	text
4011	p4011	4	[0.012342263, -0.01471429, -0.012913096, 0.007...	'User behavior driven ranking without editoria...
11379	p11379	12	[-0.012718345, 0.020719944, -0.010691697, 0.00...	'Reducing truth-telling online mechanisms to o...
9401	p9401	8	[-0.013923097, 0.017362924, -0.009770028, -0.0...	'The lazy adversary conjecture fails We prove ...
4928	p4928	1	[0.019353714, 0.0066366955, 0.0115322415, 0.01...	'Privacy preserving schema and data matching ...

The (paper, citing, paper) edge table

The “paper, citing, paper” edge table could also be read in as a Pandas DataFrame. It has three columns. The source_id and dest_id column contain the same identifiers listed in the “paper” node table. The label column is a placeholder to be used for spliting the “paper, citing, paper” edges for a link prediction task.

[12]:

pcp_edge_df = pd.read_parquet(paper_citing_paper_edge_path)

print(pcp_edge_df.shape)
pcp_edge_df.sample(4)

(30789, 3)

[12]:

	source_id	dest_id	label
28779	p11255	p12232	1.0
2791	p704	p6747	1.0
429	p119	p16	1.0
7301	p2354	p8747	1.0

GraphStorm Input Graph Data in the `./acm_gs_*p/` Folder

In the above cells, we created a 1-partition graph in the acm_gs_1p folder and a 3-partition graph in the acm_gs_3p folder. The contents of the two folders are nearly the same, including

a GraphStorm partitioned configuration JSON file;
original node id space to GraphStorm node id space mapping files, created during graph processing;
GraphStorm node id space to shuffle node id space mapping, created during graph patitioning;
label statitic files.

[13]:

!ls -al ./acm_gs_1p

total 1884
drwxrwxr-x 3 ubuntu ubuntu    4096 Dec 19 21:28 .
drwxrwxr-x 6 ubuntu ubuntu    4096 Dec 19 21:29 ..
-rw-rw-r-- 1 ubuntu ubuntu    1673 Dec 19 21:28 acm.json
-rw-rw-r-- 1 ubuntu ubuntu  213402 Dec 19 21:28 author_id_remap.parquet
-rw-rw-r-- 1 ubuntu ubuntu     191 Dec 19 21:28 edge_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu 1287802 Dec 19 21:28 edge_mapping.pt
-rw-rw-r-- 1 ubuntu ubuntu     515 Dec 19 21:28 node_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu  241655 Dec 19 21:28 node_mapping.pt
-rw-rw-r-- 1 ubuntu ubuntu  150409 Dec 19 21:28 paper_id_remap.parquet
drwxrwxr-x 2 ubuntu ubuntu    4096 Dec 19 21:28 part0
-rw-rw-r-- 1 ubuntu ubuntu    2934 Dec 19 21:28 subject_id_remap.parquet

[14]:

!ls -al ./acm_gs_3p

total 1892
drwxrwxr-x 5 ubuntu ubuntu    4096 Dec 19 21:29 .
drwxrwxr-x 6 ubuntu ubuntu    4096 Dec 19 21:29 ..
-rw-rw-r-- 1 ubuntu ubuntu    3319 Dec 19 21:29 acm.json
-rw-rw-r-- 1 ubuntu ubuntu  213402 Dec 19 21:29 author_id_remap.parquet
-rw-rw-r-- 1 ubuntu ubuntu     191 Dec 19 21:29 edge_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu 1287802 Dec 19 21:29 edge_mapping.pt
-rw-rw-r-- 1 ubuntu ubuntu     515 Dec 19 21:29 node_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu  241655 Dec 19 21:29 node_mapping.pt
-rw-rw-r-- 1 ubuntu ubuntu  150409 Dec 19 21:29 paper_id_remap.parquet
drwxrwxr-x 2 ubuntu ubuntu    4096 Dec 19 21:29 part0
drwxrwxr-x 2 ubuntu ubuntu    4096 Dec 19 21:29 part1
drwxrwxr-x 2 ubuntu ubuntu    4096 Dec 19 21:29 part2
-rw-rw-r-- 1 ubuntu ubuntu    2934 Dec 19 21:29 subject_id_remap.parquet

Because the choice of the different number of partitions, the two folders have different partition data sub-folders, named after “part0” to “partN”, where N is the number of partitions specified with the --num-parts argument of construct_graph command.

Tip: In the next sections, we use the 3-partition graph to explore these four sets of files and sub-folders one by one. But we will use the 1-partition graph in the other notebooks for GraphStorm standalone mode programming tutorials.

The GraphStorm Partition Configuration File `acm.json`

The acm.json file describe the partitioned graph that GraphStorm uses for model training and inference.

It includes basic information about the partitioned graph, such as node and edge types, the number of each node and edge type, and the number of partitions along with the other partition mapping information.

[15]:

!cat ./acm_gs_3p/acm.json

{
    "graph_name": "acm",
    "num_nodes": 30003,
    "num_edges": 160686,
    "part_method": "metis",
    "num_parts": 3,
    "halo_hops": 1,
    "node_map": {
        "author": [
            [
                0,
                6015
            ],
            [
                10192,
                15856
            ],
            [
                19910,
                25662
            ]
        ],
        "paper": [
            [
                6015,
                10167
            ],
            [
                15856,
                19887
            ],
            [
                25662,
                29978
            ]
        ],
        "subject": [
            [
                10167,
                10192
            ],
            [
                19887,
                19910
            ],
            [
                29978,
                30003
            ]
        ]
    },
    "edge_map": {
        "author:writing:paper": [
            [
                0,
                12901
            ],
            [
                55137,
                67274
            ],
            [
                110148,
                122165
            ]
        ],
        "paper:cited:paper": [
            [
                12901,
                23675
            ],
            [
                67274,
                78554
            ],
            [
                122165,
                130900
            ]
        ],
        "paper:citing:paper": [
            [
                23675,
                34225
            ],
            [
                78554,
                89481
            ],
            [
                130900,
                140212
            ]
        ],
        "paper:is-about:subject": [
            [
                34225,
                38253
            ],
            [
                89481,
                93745
            ],
            [
                140212,
                144419
            ]
        ],
        "paper:written-by:author": [
            [
                38253,
                50985
            ],
            [
                93745,
                106117
            ],
            [
                144419,
                156370
            ]
        ],
        "subject:has:paper": [
            [
                50985,
                55137
            ],
            [
                106117,
                110148
            ],
            [
                156370,
                160686
            ]
        ]
    },
    "ntypes": {
        "author": 0,
        "paper": 1,
        "subject": 2
    },
    "etypes": {
        "author:writing:paper": 0,
        "paper:cited:paper": 1,
        "paper:citing:paper": 2,
        "paper:is-about:subject": 3,
        "paper:written-by:author": 4,
        "subject:has:paper": 5
    },
    "part-0": {
        "node_feats": "part0/node_feat.dgl",
        "edge_feats": "part0/edge_feat.dgl",
        "part_graph": "part0/graph.dgl"
    },
    "part-1": {
        "node_feats": "part1/node_feat.dgl",
        "edge_feats": "part1/edge_feat.dgl",
        "part_graph": "part1/graph.dgl"
    },
    "part-2": {
        "node_feats": "part2/node_feat.dgl",
        "edge_feats": "part2/edge_feat.dgl",
        "part_graph": "part2/graph.dgl"
    }
}

Raw Node ID Mapping Files `****_id_remap.parquet`

Because the original node ids could be any types, e.g., strings, integers, or even floats, during graph processing GraphStorm conducts an ID mapping, which map the original node ID space given by users into the interger type node ID space, starting from 0. This mapping information is stored in those ****_id_remap.parquet files.

[16]:

author_nid_mapping_df = pd.read_parquet('./acm_gs_3p/author_id_remap.parquet')

print(author_nid_mapping_df.shape)
author_nid_mapping_df.sample(4)

(17431, 2)

[16]:

	orig	new
765	a765	765
8438	a8438	8438
14914	a14914	14914
5227	a5227	5227

As shown above, the author_id_remap.parquet file has two columns. The orig column contains the original string type node IDs in the raw node table data, while the new column contains the new integer node IDs in the Graph Node ID space.

GraphStorm Partition Node/Edge ID Mapping Files `****_mapping.pt`

GraphStorm relies on the distributed DGL graph as its input graph data. The distributed DGL graph has its own node ID space, thus creating another node id mapping during graph partition.

These node id mappings, in the form of a python dictionary, are stored in those ****_mapping.pt files, which can be loaded using Pytorch.

Tip:In general, uses do not need to do the id mapping back operations. If use GraphStorm’s command line interface to train models and do inference, GraphStorm will automatically remapping the partitioned ID space to the original node ID space.

[17]:

import torch as th

node_mapping_dict = th.load('./acm_gs_3p/node_mapping.pt')
print('Node id mapping:')
print(f'Node mapping keys: {list(node_mapping_dict.keys())}')
ntype0 = list(node_mapping_dict.keys())[0]
print(f'Node type \'{ntype0}\' first 10 mapping ids: {node_mapping_dict[ntype0][:10]}\n')

edge_mapping_dict = th.load('./acm_gs_3p/edge_mapping.pt')
print('Edge id mapping:')
print(f'Edge mapping keys: {list(edge_mapping_dict.keys())}')
etype0 = list(edge_mapping_dict.keys())[0]
print(f'Edge type \'{etype0}\' first 10 mapping ids: {edge_mapping_dict[etype0][:10]}\n')

Node id mapping:
Node mapping keys: ['author', 'paper', 'subject']
Node type 'author' first 10 mapping ids: tensor([16442,  7664,  7665,  7667, 16448,  7669,  7670, 16443,  7674, 16453])

Edge id mapping:
Edge mapping keys: [('author', 'writing', 'paper'), ('paper', 'cited', 'paper'), ('paper', 'citing', 'paper'), ('paper', 'is-about', 'subject'), ('paper', 'written-by', 'author'), ('subject', 'has', 'paper')]
Edge type '('author', 'writing', 'paper')' first 10 mapping ids: tensor([ 8198, 15018,  3479,   253, 21728, 20622, 15980, 13148, 11788,  9858])

The ID mapping logic in those tensors is that GraphStorm graph ID is stored in these tensors, and their position indexes are the new partitioned node IDs. For example, for “author” nodes, the GraphStorm graph ID 16442 has a new partitioned node ID 0 because the number 16642 is in the first position (index=0) of the mapping tensor.

Label Statistic Files `****_label_stats.json`

If users specify the label statistc property in the config.json file, e.g., for the “paper” node’s label object setting "label_stats_type": "frequency_cnt", GraphStorm will collect labels’ statistics and stored in the ****_label_stats.json files.

[18]:

!cat ./acm_gs_3p/node_label_stats.json

{
    "author": {},
    "paper": {
        "label": {
            "stats_type": "frequency_cnt",
            "info": {
                "0": 838,
                "1": 1118,
                "2": 1349,
                "3": 1277,
                "4": 1381,
                "5": 521,
                "7": 270,
                "8": 370,
                "9": 509,
                "10": 258,
                "11": 374,
                "12": 1270,
                "13": 464
            }
        }
    },
    "subject": {}
}

Partitioned Graph Data `partN/***.dgl`

The distributed DGL graph datasets are saved in these partN subfolders, each of which contains three DGL formated files: 1. edge_feat.dgl: edge features of one partition if have. 2. graph.dgl: graph structure of one partition. 3. node_feat.dgl: node features of one partition if have.

[19]:

!ls -al ./acm_gs_3p/part0

total 13892
drwxrwxr-x 2 ubuntu ubuntu     4096 Dec 19 21:29 .
drwxrwxr-x 6 ubuntu ubuntu     4096 Dec 19 21:29 ..
-rw-rw-r-- 1 ubuntu ubuntu    31926 Dec 19 21:29 edge_feat.dgl
-rw-rw-r-- 1 ubuntu ubuntu  2081555 Dec 19 21:29 graph.dgl
-rw-rw-r-- 1 ubuntu ubuntu 12097671 Dec 19 21:29 node_feat.dgl

[ ]:

Notebook 0: Data Prepare

Prerequisites

Download Data Generation Script

Generate ACM Raw Table Data

Construct GraphStorm Input Graph Data

3-Partition Input Data

Data Exploration and Explanation

Graph Description JSON File config.json

Raw ACM Tables in the nodes/ and edges/ folder.

GraphStorm Input Graph Data in the ./acm_gs_*p/ Folder

The GraphStorm Partition Configuration File acm.json

Raw Node ID Mapping Files ****_id_remap.parquet

GraphStorm Partition Node/Edge ID Mapping Files ****_mapping.pt

Label Statistic Files ****_label_stats.json

Partitioned Graph Data partN/***.dgl