Notebook 0: Data Preparation

This notebook will create an example graph data to be used in the other notebooks to demonstrate how to program using GraphStorm APIs. The example graph data comes from DGL’s ACM publication dataset, which is the same as the data explainedin the Use Your Own Data tutorial.

Prerequisites

This notebook assumes the following:

Python 3;
Linux OS, Ubuntu or Amazon Linux;
GraphStorm and its dependencies (following the Setup GraphStorm with pip packages tutorial)
Jupyter web interactive server.

Users can use the following command to check if the above prerequisites are met.

[1]:

import graphstorm as gs

Download Data Generation Script

GraphStorm provides a Python script that can download and convert the DGL ACM publication data for GraphStorm usage. Therefore, first let’s download the script file from the GraphStorm Github repository.

[25]:

!wget -O ./acm_data.py https://github.com/awslabs/graphstorm/raw/main/examples/acm_data.py

Generate ACM Raw Table Data

Then we can use the command below to build the raw table data, which is the standard input data for GraphStorm’s gconstruct module.

[26]:

!python ./acm_data.py --output-path ./acm_raw --output-type raw_w_text

Construct GraphStorm Input Graph Data

With the raw ACM tables we then can use GraphStorm’s graph construction method to prepare the ACM graph for other notebooks. The graph construction module perform:

read in the raw data, and convert it to DGL graph;
split the DGL graph into multiple partitions as the distributed DGL graphs;
produce node id mapping files and other supporting files.

For the GraphStorm Standalone mode, we only need one partition. Therefore, in the command below we set the --num-parts to be 1. For other arguments, users can refer to GraphStorm Graph Construction arguments.

[27]:

!python -m graphstorm.gconstruct.construct_graph \
          --conf-file ./acm_raw/config.json \
          --output-dir ./acm_gs_1p \
          --num-parts 1 \
          --graph-name acm

3-Partition Input Data

To better illustrate GraphStorm required input data structure, we can use the following command to create a 3-partition input data.

[28]:

!python -m graphstorm.gconstruct.construct_graph \
          --conf-file ./acm_raw/config.json \
          --output-dir ./acm_gs_3p \
          --num-parts 3 \
          --graph-name acm

Data Exploration and Explanation

The above commands created two sets of ACM data, i.e., the raw ACM data tables, and ACM GraphStorm input graphs. Below we explore these datasets, and explain their format so that users can prepare their own graph data easily.

We can explore the acm_raw folder with the ls -al command.

[6]:

!ls -al ./acm_raw

total 24
drwxrwxr-x 4 ubuntu ubuntu 4096 May 15 23:29 .
drwxrwxr-x 6 ubuntu ubuntu 4096 May 15 23:30 ..
-rw-rw-r-- 1 ubuntu ubuntu 5306 May 15 23:29 config.json
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:29 edges
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:29 nodes

[7]:

!ls -al ./acm_raw/nodes

total 38744
drwxrwxr-x 2 ubuntu ubuntu     4096 May 15 23:29 .
drwxrwxr-x 4 ubuntu ubuntu     4096 May 15 23:29 ..
-rw-rw-r-- 1 ubuntu ubuntu 18843828 May 15 23:29 author.parquet
-rw-rw-r-- 1 ubuntu ubuntu 20702289 May 15 23:29 paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu   113414 May 15 23:29 subject.parquet

[8]:

!ls -al ./acm_raw/edges

total 1016
drwxrwxr-x 2 ubuntu ubuntu   4096 May 15 23:29 .
drwxrwxr-x 4 ubuntu ubuntu   4096 May 15 23:29 ..
-rw-rw-r-- 1 ubuntu ubuntu 263138 May 15 23:29 author_writing_paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu 156358 May 15 23:29 paper_cited_paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu 162714 May 15 23:29 paper_citing_paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu  87792 May 15 23:29 paper_is-about_subject.parquet
-rw-rw-r-- 1 ubuntu ubuntu 265948 May 15 23:29 paper_written-by_author.parquet
-rw-rw-r-- 1 ubuntu ubuntu  84005 May 15 23:29 subject_has_paper.parquet

Graph Description JSON File `config.json`

The acm_raw folder includes one config.json file that describes the table-based raw graph data. Except for a version object, the JSON file contains a nodes object and an edges object.

The nodes object contains a list of node objects, each of which includes a set of properties to describe one node type in a graph data. For example, in the config.json file, there is a node type, called “papers”. For each node type, GraphStorm defines a few other properties, such as format, files, and features.

Similarly, the edges object contains a list of edge objects. Most of edge properties are same as node’s except that edge object has the relation property that define an edge type in a canonical format, i.e., source node type, relation type, and destination node type.

For a full list of the JSON configuration properties, users can refer to the GraphStorm Graph Construction JSON Explanations.

To use your own graph, users need to prepare their own JSON file.

[29]:

!cat ./acm_raw/config.json

Raw ACM Tables in the `nodes/` and `edges/` folder.

As defined in the ./acm_raw/config.json file, the node data files are stored at the ./acm_raw/nodes/ folder, and edge data files are stored at the ./acm_raw/edges/ folder. General description of these files can be found at the Input raw node/edge data files. Here, we can read some node (“paper”) and edge ([“paper”, “citing”, “paper”]) tables to learn more about them.

[10]:

import pandas as pd

paper_node_path = './acm_raw/nodes/paper.parquet'
paper_citing_paper_edge_path = './acm_raw/edges/paper_citing_paper.parquet'

The “paper” node table

The paper node table could be read in as a Pandas DataFrame. The table has a few columns, whose names are used in the config.json. For the “paper” nodes, there is a node_id column, including a unique identifier for each node, a feat column, including a 256D numerical tensor for each node, a text column, including free text feature for each node, and a label column, including an integer to indicate the class that each node is assigned.

The other two node types, “author” and “subject”, have similar data tables. Users can explore them with the similar code below.

[11]:

paper_node_df = pd.read_parquet(paper_node_path)

print(paper_node_df.shape)
paper_node_df.sample(4)

(12499, 4)

[11]:

	node_id	label	feat	text
6933	p6933	3	[-0.006179405, 0.010796122, -0.018994818, -0.0...	'User-oriented text segmentation evaluation me...
743	p743	4	[-0.016835907, -0.020954693, 0.009945098, -0.0...	'Similarity-aware indexing for real-time entit...
11497	p11497	12	[0.009553924, 0.019706111, 0.013354154, -0.010...	'Polynomial time algorithm for computing the t...
2588	p2588	2	[0.0036002623, -0.007723761, -0.012699484, -0....	'Microformats: a pragmatic path to the semanti...

The (paper, citing, paper) edge table

The “paper, citing, paper” edge table could also be read in as a Pandas DataFrame. It has three columns. The source_id and dest_id column contain the same identifiers listed in the “paper” node table. The label column is a placeholder to be used for spliting the “paper, citing, paper” edges for a link prediction task.

[12]:

pcp_edge_df = pd.read_parquet(paper_citing_paper_edge_path)

print(pcp_edge_df.shape)
pcp_edge_df.sample(4)

(30789, 3)

[12]:

	source_id	dest_id	label
1140	p241	p6987	1.0
17361	p6296	p6221	1.0
21762	p7578	p7328	1.0
28630	p11144	p11145	1.0

GraphStorm Input Graph Data in the `./acm_gs_*p/` Folder

In the above cells, we created a 1-partition graph in the acm_gs_1p folder and a 3-partition graph in the acm_gs_3p folder. The contents of the two folders are nearly the same, including

a GraphStorm partitioned configuration JSON file;
a subfolder named after raw_id_mappings that store the original node id space to GraphStorm node id space mapping files, created during graph processing;
GraphStorm node id space to shuffle node id space mapping, created during graph patitioning;
label statitic files.

[16]:

!ls -l ./acm_gs_1p

total 1516
-rw-rw-r-- 1 ubuntu ubuntu    1673 May 15 23:30 acm.json
-rw-rw-r-- 1 ubuntu ubuntu     191 May 15 23:30 edge_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu 1287802 May 15 23:30 edge_mapping.pt
-rw-rw-r-- 1 ubuntu ubuntu     515 May 15 23:30 node_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu  241655 May 15 23:30 node_mapping.pt
drwxrwxr-x 2 ubuntu ubuntu    4096 May 15 23:30 part0
drwxrwxr-x 5 ubuntu ubuntu    4096 May 15 23:30 raw_id_mappings

[17]:

!ls -l ./acm_gs_3p

total 1524
-rw-rw-r-- 1 ubuntu ubuntu    3325 May 15 23:30 acm.json
-rw-rw-r-- 1 ubuntu ubuntu     191 May 15 23:30 edge_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu 1287802 May 15 23:30 edge_mapping.pt
-rw-rw-r-- 1 ubuntu ubuntu     515 May 15 23:30 node_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu  241655 May 15 23:30 node_mapping.pt
drwxrwxr-x 2 ubuntu ubuntu    4096 May 15 23:30 part0
drwxrwxr-x 2 ubuntu ubuntu    4096 May 15 23:30 part1
drwxrwxr-x 2 ubuntu ubuntu    4096 May 15 23:30 part2
drwxrwxr-x 5 ubuntu ubuntu    4096 May 15 23:30 raw_id_mappings

Because the choice of the different number of partitions, the two folders have different partition data sub-folders, named after “part0” to “partN”, where N is the number of partitions specified with the --num-parts argument of construct_graph command.

Tip: In the next sections, we use the 3-partition graph to explore these four sets of files and sub-folders one by one. But we will use the 1-partition graph in the other notebooks for GraphStorm standalone mode programming tutorials.

The GraphStorm Partition Configuration File `acm.json`

The acm.json file describe the partitioned graph that GraphStorm uses for model training and inference.

It includes basic information about the partitioned graph, such as node and edge types, the number of each node and edge type, and the number of partitions along with the other partition mapping information.

[30]:

!cat ./acm_gs_3p/acm.json

Raw Node ID Mapping Files in the `raw_id_remappings` Folder

Because the original node ids could be any types, e.g., strings, integers, or even floats, during graph processing GraphStorm conducts an ID mapping, which map the original node ID space given by users into the interger type node ID space, starting from 0. This mapping information is stored in the raw_id_remappings folder that contains a set of subfolders named after each node type name.

[18]:

!ls -l ./acm_gs_3p/raw_id_mappings/

total 12
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 author
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 paper
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 subject

[19]:

!ls -l ./acm_gs_3p/raw_id_mappings/author/

total 208
-rw-rw-r-- 1 ubuntu ubuntu 212064 May 15 23:30 part-00000.parquet

In each subfolder, there will be a set of parquet files with names in the format as part-*****.parquet. The number of these parquet files are determined by the number of nodes in each type. The greater the number of nodes, the more files there will be. Users can use any parquet file exploration tools to check their contents like the below code does.

[23]:

author_nid_mapping_df = pd.read_parquet('./acm_gs_3p/raw_id_mappings/author/part-00000.parquet')

print(author_nid_mapping_df.shape)
author_nid_mapping_df.sample(4)

(17431, 2)

[23]:

	orig	new
7958	a7958	7958
7475	a7475	7475
13423	a13423	13423
9246	a9246	9246

As shown above, the author/part-00000.parquet file has two columns. The orig column contains the original string type node IDs in the raw node table data, while the new column contains the new integer node IDs in the Graph Node ID space.

GraphStorm Partition Node/Edge ID Mapping Files `****_mapping.pt`

GraphStorm relies on the distributed DGL graph as its input graph data. The distributed DGL graph has its own node ID space, thus creating another node id mapping during graph partition.

These node id mappings, in the form of a python dictionary, are stored in those ****_mapping.pt files, which can be loaded using Pytorch.

Tip:In general, uses do not need to do the id mapping back operations. If use GraphStorm’s command line interface to train models and do inference, GraphStorm will automatically remapping the partitioned ID space to the original node ID space.

[24]:

import torch as th

node_mapping_dict = th.load('./acm_gs_3p/node_mapping.pt')
print('Node id mapping:')
print(f'Node mapping keys: {list(node_mapping_dict.keys())}')
ntype0 = list(node_mapping_dict.keys())[0]
print(f'Node type \'{ntype0}\' first 10 mapping ids: {node_mapping_dict[ntype0][:10]}\n')

edge_mapping_dict = th.load('./acm_gs_3p/edge_mapping.pt')
print('Edge id mapping:')
print(f'Edge mapping keys: {list(edge_mapping_dict.keys())}')
etype0 = list(edge_mapping_dict.keys())[0]
print(f'Edge type \'{etype0}\' first 10 mapping ids: {edge_mapping_dict[etype0][:10]}\n')

Node id mapping:
Node mapping keys: ['author', 'paper', 'subject']
Node type 'author' first 10 mapping ids: tensor([9908, 5644, 5645, 5646, 5647, 5648, 5649, 5650, 9270, 5643])

Edge id mapping:
Edge mapping keys: [('author', 'writing', 'paper'), ('paper', 'cited', 'paper'), ('paper', 'citing', 'paper'), ('paper', 'is-about', 'subject'), ('paper', 'written-by', 'author'), ('subject', 'has', 'paper')]
Edge type '('author', 'writing', 'paper')' first 10 mapping ids: tensor([ 1622, 16688, 22176, 35837, 22116, 22183, 22234,  3538,  9921,  1062])

The ID mapping logic in those tensors is that GraphStorm graph ID is stored in these tensors, and their position indexes are the new partitioned node IDs. For example, for “author” nodes, the GraphStorm graph ID 9908 has a new partitioned node ID 0 because the number 9908 is in the first position (index=0) of the mapping tensor.

warning: The specific number of the first author node ID might not be the 9908 as partition process is not determistic. Users may see author node IDs different from the given example.

Label Statistic Files `****_label_stats.json`

If users specify the label statistc property in the config.json file, e.g., for the “paper” node’s label object setting "label_stats_type": "frequency_cnt", GraphStorm will collect labels’ statistics and stored in the ****_label_stats.json files.

[8]:

!cat ./acm_gs_3p/node_label_stats.json

Partitioned Graph Data `partN/***.dgl`

The distributed DGL graph datasets are saved in these partN subfolders, each of which contains three DGL formated files:

edge_feat.dgl: edge features of one partition if have.
graph.dgl: graph structure of one partition.
node_feat.dgl: node features of one partition if have.

[19]:

!ls -al ./acm_gs_3p/part0

total 13892
drwxrwxr-x 2 ubuntu ubuntu     4096 Dec 19 21:29 .
drwxrwxr-x 6 ubuntu ubuntu     4096 Dec 19 21:29 ..
-rw-rw-r-- 1 ubuntu ubuntu    31926 Dec 19 21:29 edge_feat.dgl
-rw-rw-r-- 1 ubuntu ubuntu  2081555 Dec 19 21:29 graph.dgl
-rw-rw-r-- 1 ubuntu ubuntu 12097671 Dec 19 21:29 node_feat.dgl

[ ]:

Notebook 0: Data Preparation

Prerequisites

Download Data Generation Script

Generate ACM Raw Table Data

Construct GraphStorm Input Graph Data

3-Partition Input Data

Data Exploration and Explanation

Graph Description JSON File config.json

Raw ACM Tables in the nodes/ and edges/ folder.

GraphStorm Input Graph Data in the ./acm_gs_*p/ Folder

The GraphStorm Partition Configuration File acm.json

Raw Node ID Mapping Files in the raw_id_remappings Folder

GraphStorm Partition Node/Edge ID Mapping Files ****_mapping.pt

Label Statistic Files ****_label_stats.json

Partitioned Graph Data partN/***.dgl