Notebook 0: Data Preparation
This notebook will create an example graph data to be used in the other notebooks to demonstrate how to program using GraphStorm APIs. The example graph data comes from DGL’s ACM publication dataset, which is the same as the data explainedin the Use Your Own Data tutorial.
Prerequisites
This notebook assumes the following:
Python 3;
Linux OS, Ubuntu or Amazon Linux;
GraphStorm and its dependencies (following the Setup GraphStorm with pip packages tutorial)
Users can use the following command to check if the above prerequisites are met.
[1]:
import graphstorm as gs
Download Data Generation Script
GraphStorm provides a Python script that can download and convert the DGL ACM publication data for GraphStorm usage. Therefore, first let’s download the script file from the GraphStorm Github repository.
[25]:
!wget -O ./acm_data.py https://github.com/awslabs/graphstorm/raw/main/examples/acm_data.py
Generate ACM Raw Table Data
Then we can use the command below to build the raw table data, which is the standard input data for GraphStorm’s gconstruct module.
[26]:
!python ./acm_data.py --output-path ./acm_raw --output-type raw_w_text
Construct GraphStorm Input Graph Data
With the raw ACM tables we then can use GraphStorm’s graph construction method to prepare the ACM graph for other notebooks. The graph construction module perform:
read in the raw data, and convert it to DGL graph;
split the DGL graph into multiple partitions as the distributed DGL graphs;
produce node id mapping files and other supporting files.
For the GraphStorm Standalone mode, we only need one partition. Therefore, in the command below we set the --num-parts to be 1. For other arguments, users can refer to GraphStorm Graph Construction arguments.
[27]:
!python -m graphstorm.gconstruct.construct_graph \
--conf-file ./acm_raw/config.json \
--output-dir ./acm_gs_1p \
--num-parts 1 \
--graph-name acm
3-Partition Input Data
To better illustrate GraphStorm required input data structure, we can use the following command to create a 3-partition input data.
[28]:
!python -m graphstorm.gconstruct.construct_graph \
--conf-file ./acm_raw/config.json \
--output-dir ./acm_gs_3p \
--num-parts 3 \
--graph-name acm
Data Exploration and Explanation
The above commands created two sets of ACM data, i.e., the raw ACM data tables, and ACM GraphStorm input graphs. Below we explore these datasets, and explain their format so that users can prepare their own graph data easily.
We can explore the acm_raw folder with the ls -al command.
[6]:
!ls -al ./acm_raw
total 24
drwxrwxr-x 4 ubuntu ubuntu 4096 May 15 23:29 .
drwxrwxr-x 6 ubuntu ubuntu 4096 May 15 23:30 ..
-rw-rw-r-- 1 ubuntu ubuntu 5306 May 15 23:29 config.json
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:29 edges
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:29 nodes
[7]:
!ls -al ./acm_raw/nodes
total 38744
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:29 .
drwxrwxr-x 4 ubuntu ubuntu 4096 May 15 23:29 ..
-rw-rw-r-- 1 ubuntu ubuntu 18843828 May 15 23:29 author.parquet
-rw-rw-r-- 1 ubuntu ubuntu 20702289 May 15 23:29 paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu 113414 May 15 23:29 subject.parquet
[8]:
!ls -al ./acm_raw/edges
total 1016
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:29 .
drwxrwxr-x 4 ubuntu ubuntu 4096 May 15 23:29 ..
-rw-rw-r-- 1 ubuntu ubuntu 263138 May 15 23:29 author_writing_paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu 156358 May 15 23:29 paper_cited_paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu 162714 May 15 23:29 paper_citing_paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu 87792 May 15 23:29 paper_is-about_subject.parquet
-rw-rw-r-- 1 ubuntu ubuntu 265948 May 15 23:29 paper_written-by_author.parquet
-rw-rw-r-- 1 ubuntu ubuntu 84005 May 15 23:29 subject_has_paper.parquet
Graph Description JSON File config.json
The acm_raw folder includes one config.json file that describes the table-based raw graph data. Except for a version object, the JSON file contains a nodes object and an edges object.
The nodes object contains a list of node objects, each of which includes a set of properties to describe one node type in a graph data. For example, in the config.json file, there is a node type, called “papers”. For each node type, GraphStorm defines a few other properties, such as format, files, and features.
Similarly, the edges object contains a list of edge objects. Most of edge properties are same as node’s except that edge object has the relation property that define an edge type in a canonical format, i.e., source node type, relation type, and destination node type.
For a full list of the JSON configuration properties, users can refer to the GraphStorm Graph Construction JSON Explanations.
To use your own graph, users need to prepare their own JSON file.
[29]:
!cat ./acm_raw/config.json
Raw ACM Tables in the nodes/ and edges/ folder.
As defined in the ./acm_raw/config.json file, the node data files are stored at the ./acm_raw/nodes/ folder, and edge data files are stored at the ./acm_raw/edges/ folder. General description of these files can be found at the Input raw node/edge data files. Here, we can read some node (“paper”) and edge ([“paper”, “citing”, “paper”]) tables to learn more about them.
[10]:
import pandas as pd
paper_node_path = './acm_raw/nodes/paper.parquet'
paper_citing_paper_edge_path = './acm_raw/edges/paper_citing_paper.parquet'
The “paper” node table
The paper node table could be read in as a Pandas DataFrame. The table has a few columns, whose names are used in the config.json. For the “paper” nodes, there is a node_id column, including a unique identifier for each node, a feat column, including a 256D numerical tensor for each node, a text column, including free text feature for each node, and a label column, including an integer to indicate the class that each node is assigned.
The other two node types, “author” and “subject”, have similar data tables. Users can explore them with the similar code below.
[11]:
paper_node_df = pd.read_parquet(paper_node_path)
print(paper_node_df.shape)
paper_node_df.sample(4)
(12499, 4)
[11]:
| node_id | label | feat | text | |
|---|---|---|---|---|
| 6933 | p6933 | 3 | [-0.006179405, 0.010796122, -0.018994818, -0.0... | 'User-oriented text segmentation evaluation me... |
| 743 | p743 | 4 | [-0.016835907, -0.020954693, 0.009945098, -0.0... | 'Similarity-aware indexing for real-time entit... |
| 11497 | p11497 | 12 | [0.009553924, 0.019706111, 0.013354154, -0.010... | 'Polynomial time algorithm for computing the t... |
| 2588 | p2588 | 2 | [0.0036002623, -0.007723761, -0.012699484, -0.... | 'Microformats: a pragmatic path to the semanti... |
The (paper, citing, paper) edge table
The “paper, citing, paper” edge table could also be read in as a Pandas DataFrame. It has three columns. The source_id and dest_id column contain the same identifiers listed in the “paper” node table. The label column is a placeholder to be used for spliting the “paper, citing, paper” edges for a link prediction task.
[12]:
pcp_edge_df = pd.read_parquet(paper_citing_paper_edge_path)
print(pcp_edge_df.shape)
pcp_edge_df.sample(4)
(30789, 3)
[12]:
| source_id | dest_id | label | |
|---|---|---|---|
| 1140 | p241 | p6987 | 1.0 |
| 17361 | p6296 | p6221 | 1.0 |
| 21762 | p7578 | p7328 | 1.0 |
| 28630 | p11144 | p11145 | 1.0 |
GraphStorm Input Graph Data in the ./acm_gs_*p/ Folder
In the above cells, we created a 1-partition graph in the acm_gs_1p folder and a 3-partition graph in the acm_gs_3p folder. The contents of the two folders are nearly the same, including
a GraphStorm partitioned configuration JSON file;
a subfolder named after
raw_id_mappingsthat store the original node id space to GraphStorm node id space mapping files, created during graph processing;GraphStorm node id space to shuffle node id space mapping, created during graph patitioning;
label statitic files.
[16]:
!ls -l ./acm_gs_1p
total 1516
-rw-rw-r-- 1 ubuntu ubuntu 1673 May 15 23:30 acm.json
-rw-rw-r-- 1 ubuntu ubuntu 191 May 15 23:30 edge_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu 1287802 May 15 23:30 edge_mapping.pt
-rw-rw-r-- 1 ubuntu ubuntu 515 May 15 23:30 node_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu 241655 May 15 23:30 node_mapping.pt
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 part0
drwxrwxr-x 5 ubuntu ubuntu 4096 May 15 23:30 raw_id_mappings
[17]:
!ls -l ./acm_gs_3p
total 1524
-rw-rw-r-- 1 ubuntu ubuntu 3325 May 15 23:30 acm.json
-rw-rw-r-- 1 ubuntu ubuntu 191 May 15 23:30 edge_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu 1287802 May 15 23:30 edge_mapping.pt
-rw-rw-r-- 1 ubuntu ubuntu 515 May 15 23:30 node_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu 241655 May 15 23:30 node_mapping.pt
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 part0
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 part1
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 part2
drwxrwxr-x 5 ubuntu ubuntu 4096 May 15 23:30 raw_id_mappings
Because the choice of the different number of partitions, the two folders have different partition data sub-folders, named after “part0” to “partN”, where N is the number of partitions specified with the --num-parts argument of construct_graph command.
Tip: In the next sections, we use the 3-partition graph to explore these four sets of files and sub-folders one by one. But we will use the 1-partition graph in the other notebooks for GraphStorm standalone mode programming tutorials.
The GraphStorm Partition Configuration File acm.json
The acm.json file describe the partitioned graph that GraphStorm uses for model training and inference.
It includes basic information about the partitioned graph, such as node and edge types, the number of each node and edge type, and the number of partitions along with the other partition mapping information.
[30]:
!cat ./acm_gs_3p/acm.json
Raw Node ID Mapping Files in the raw_id_remappings Folder
Because the original node ids could be any types, e.g., strings, integers, or even floats, during graph processing GraphStorm conducts an ID mapping, which map the original node ID space given by users into the interger type node ID space, starting from 0. This mapping information is stored in the raw_id_remappings folder that contains a set of subfolders named after each node type name.
[18]:
!ls -l ./acm_gs_3p/raw_id_mappings/
total 12
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 author
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 paper
drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 subject
[19]:
!ls -l ./acm_gs_3p/raw_id_mappings/author/
total 208
-rw-rw-r-- 1 ubuntu ubuntu 212064 May 15 23:30 part-00000.parquet
In each subfolder, there will be a set of parquet files with names in the format as part-*****.parquet. The number of these parquet files are determined by the number of nodes in each type. The greater the number of nodes, the more files there will be. Users can use any parquet file exploration tools to check their contents like the below code does.
[23]:
author_nid_mapping_df = pd.read_parquet('./acm_gs_3p/raw_id_mappings/author/part-00000.parquet')
print(author_nid_mapping_df.shape)
author_nid_mapping_df.sample(4)
(17431, 2)
[23]:
| orig | new | |
|---|---|---|
| 7958 | a7958 | 7958 |
| 7475 | a7475 | 7475 |
| 13423 | a13423 | 13423 |
| 9246 | a9246 | 9246 |
As shown above, the author/part-00000.parquet file has two columns. The orig column contains the original string type node IDs in the raw node table data, while the new column contains the new integer node IDs in the Graph Node ID space.
GraphStorm Partition Node/Edge ID Mapping Files ****_mapping.pt
GraphStorm relies on the distributed DGL graph as its input graph data. The distributed DGL graph has its own node ID space, thus creating another node id mapping during graph partition.
These node id mappings, in the form of a python dictionary, are stored in those ****_mapping.pt files, which can be loaded using Pytorch.
Tip:In general, uses do not need to do the id mapping back operations. If use GraphStorm’s command line interface to train models and do inference, GraphStorm will automatically remapping the partitioned ID space to the original node ID space.
[24]:
import torch as th
node_mapping_dict = th.load('./acm_gs_3p/node_mapping.pt')
print('Node id mapping:')
print(f'Node mapping keys: {list(node_mapping_dict.keys())}')
ntype0 = list(node_mapping_dict.keys())[0]
print(f'Node type \'{ntype0}\' first 10 mapping ids: {node_mapping_dict[ntype0][:10]}\n')
edge_mapping_dict = th.load('./acm_gs_3p/edge_mapping.pt')
print('Edge id mapping:')
print(f'Edge mapping keys: {list(edge_mapping_dict.keys())}')
etype0 = list(edge_mapping_dict.keys())[0]
print(f'Edge type \'{etype0}\' first 10 mapping ids: {edge_mapping_dict[etype0][:10]}\n')
Node id mapping:
Node mapping keys: ['author', 'paper', 'subject']
Node type 'author' first 10 mapping ids: tensor([9908, 5644, 5645, 5646, 5647, 5648, 5649, 5650, 9270, 5643])
Edge id mapping:
Edge mapping keys: [('author', 'writing', 'paper'), ('paper', 'cited', 'paper'), ('paper', 'citing', 'paper'), ('paper', 'is-about', 'subject'), ('paper', 'written-by', 'author'), ('subject', 'has', 'paper')]
Edge type '('author', 'writing', 'paper')' first 10 mapping ids: tensor([ 1622, 16688, 22176, 35837, 22116, 22183, 22234, 3538, 9921, 1062])
The ID mapping logic in those tensors is that GraphStorm graph ID is stored in these tensors, and their position indexes are the new partitioned node IDs. For example, for “author” nodes, the GraphStorm graph ID 9908 has a new partitioned node ID 0 because the number 9908 is in the first position (index=0) of the mapping tensor.
warning: The specific number of the first author node ID might not be the 9908 as partition process is not determistic. Users may see author node IDs different from the given example.
Label Statistic Files ****_label_stats.json
If users specify the label statistc property in the config.json file, e.g., for the “paper” node’s label object setting "label_stats_type": "frequency_cnt", GraphStorm will collect labels’ statistics and stored in the ****_label_stats.json files.
[8]:
!cat ./acm_gs_3p/node_label_stats.json
Partitioned Graph Data partN/***.dgl
The distributed DGL graph datasets are saved in these partN subfolders, each of which contains three DGL formated files:
edge_feat.dgl: edge features of one partition if have.graph.dgl: graph structure of one partition.node_feat.dgl: node features of one partition if have.
[19]:
!ls -al ./acm_gs_3p/part0
total 13892
drwxrwxr-x 2 ubuntu ubuntu 4096 Dec 19 21:29 .
drwxrwxr-x 6 ubuntu ubuntu 4096 Dec 19 21:29 ..
-rw-rw-r-- 1 ubuntu ubuntu 31926 Dec 19 21:29 edge_feat.dgl
-rw-rw-r-- 1 ubuntu ubuntu 2081555 Dec 19 21:29 graph.dgl
-rw-rw-r-- 1 ubuntu ubuntu 12097671 Dec 19 21:29 node_feat.dgl
[ ]: