{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebook 1: Use GraphStorm APIs for Building a Node Classification Pipeline\n", "\n", "This notebook demonstrates how to use GraphStorm's APIs to create graph machine learning pipelines. By playing with this notebook, users will be able to get familiar with GraphStorm APIs, hence further using them on their own tasks and models.\n", "\n", "In this notebook, we create a simple RGCN model and use it to conduct a node classification task on the ACM dataset created by the **[Notebook 0: Data Preparation](https://graphstorm.readthedocs.io/en/latest/api/notebooks/Notebook_0_Data_Prepare.html)**. \n", "\n", "
\n", "Note: This notebook is designed to run on GraphStorm's Standalone mode, i.e., in a single Linux machine with CPUs or a single GPU.
\n", "\n", "### Prerequisites\n", "\n", "- GraphStorm. Please find [more details on installation of GraphStorm](https://graphstorm.readthedocs.io/en/latest/install/env-setup.html#setup-graphstorm-with-pip-packages). Because this notebook runs on the Standalone mode, there is no need to configure the \"SSH No-password login\".\n", "- ACM data that has been created according to the [Notebook 0: Data Preparation](https://graphstorm.readthedocs.io/en/latest/api/notebooks/Notebook_0_Data_Prepare.html), and is stored in the `./acm_gs_1p/` folder. If users change the folder, please make sure change its location in this notebook.\n", "- Installation of other supporting libraries, e.g., `logging` and `matplotlib`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Setup log level in Jupyter Notebook to show running information\n", "import logging\n", "logging.basicConfig(level=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### 0. Initialize the GraphStorm Standalone Environment\n", "\n", "The first step to use GraphStorm is to call `gs.initialize()` for the Standalone mode." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "import graphstorm as gs\n", "gs.initialize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Setup GraphStorm Dataset and DataLoaders\n", "\n", "Similar as Pytorch model training pipeline, we create a dataset by constructing `gs.dataset.GSgnnData` class. In most cases, users only need to provide the location of the graph description JSON file, which is created in GraphStorm's gconstruct operation." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "# create a GraphStorm Dataset for the ACM graph data generated in the Notebook 0\n", "acm_data = gs.dataloading.GSgnnData(part_config='./acm_gs_1p/acm.json')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we create different `DataLoader`s for training, validation, and testing. As shown below, we allow users to specify different `DataLoader` settings, e.g., `fanout`, `batch_size`, except for a few model-related properties, such as `node_feats` and `label_field`.\n", "\n", "GNN models may only use parts of graph features, therefore, GraphStorm `DataLoader`s allow users to specifies the `node_feats` in the format of a dictionary of lists of strings. Keys of the dictionary are node type names, while values are lists of feature name strings." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "# define dataloaders for training, validation, and testing\n", "nfeats_4_modeling = {'author':['feat'], 'paper':['feat'],'subject':['feat']}\n", "\n", "train_dataloader = gs.dataloading.GSgnnNodeDataLoader(\n", " dataset=acm_data,\n", " target_idx=acm_data.get_node_train_set(ntypes=['paper']),\n", " node_feats=nfeats_4_modeling,\n", " label_field='label',\n", " fanout=[20, 20],\n", " batch_size=64,\n", " train_task=True)\n", "val_dataloader = gs.dataloading.GSgnnNodeDataLoader(\n", " dataset=acm_data,\n", " target_idx=acm_data.get_node_val_set(ntypes=['paper']),\n", " node_feats=nfeats_4_modeling,\n", " label_field='label',\n", " fanout=[100, 100],\n", " batch_size=256,\n", " train_task=False)\n", "test_dataloader = gs.dataloading.GSgnnNodeDataLoader(\n", " dataset=acm_data,\n", " target_idx=acm_data.get_node_test_set(ntypes=['paper']),\n", " node_feats=nfeats_4_modeling,\n", " label_field='label',\n", " fanout=[100, 100],\n", " batch_size=256,\n", " train_task=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Create a GraphStorm-compatible RGCN Model for Node Classification\n", "\n", "GraphStorm has a set of GNN component modules that could be freely combined for different tasks. This notebook depends on an RGCN model that extends from the `GSgnnModel` and uses a simple set of GraphStorm model APIs. Users can find the details in the `demon_models.py` file.\n", "\n", "We will explore GraphStorm model APIs in notebooks to be release in the future." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [], "source": [ "# import a simplified RGCN model for node classification\n", "from demo_models import RgcnNCModel\n", "\n", "model = RgcnNCModel(g=acm_data.g,\n", " num_hid_layers=2,\n", " node_feat_field=nfeats_4_modeling,\n", " hid_size=128,\n", " num_classes=14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Setup a GraphStorm Evaluator\n", "\n", "To check the performance during model training, GraphStorm relies on a set of built-in `Evaluator`s for different tasks. Here we create a `GSgnnClassificationEvaluator` for the node classification task." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [], "source": [ "# setup a classification evaluator for the trainer\n", "evaluator = gs.eval.GSgnnClassificationEvaluator(eval_frequency=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Setup a Trainer and Training\n", "\n", "For training loop, GraphStorm has different `Trainer`s for specific tasks. Here we use a `GSgnnNodePredictionTrainer` to orchestrate dataloaders, models, and evaluators. Users can specify other hyperparameters, e.g., `num_epochs`, when calling `Trainer`'s `fit()` function." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "# create a GraphStorm node task trainer for the RGCN model\n", "trainer = gs.trainer.GSgnnNodePredictionTrainer(model)\n", "trainer.setup_evaluator(evaluator)\n", "trainer.setup_device(gs.utils.get_device())" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Train the model with the trainer using fit() function\n", "trainer.fit(train_loader=train_dataloader,\n", " val_loader=val_dataloader,\n", " test_loader=test_dataloader,\n", " num_epochs=5,\n", " save_model_path='a_save_path/')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (Optional) 5. Visualize Model Performance History\n", "\n", "Besides the log, we can examine the model performance on the validation, and testing by visualizing evalutors' history property." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "# extract evaluation history of metrics from the evaluator's history property\n", "val_metrics, test_metrics = [], []\n", "for val_metric, test_metric in trainer.evaluator.history:\n", " val_metrics.append(val_metric['accuracy'])\n", " test_metrics.append(test_metric['accuracy'])\n", "\n", "# plot the performance curves\n", "fig, ax = plt.subplots()\n", "ax.plot(val_metrics, label='val')\n", "ax.plot(test_metrics, label='test')\n", "ax.set(xlabel='Epoch', ylabel='Accuracy')\n", "ax.legend(loc='best')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. Inference with the Trained Model\n", "\n", "GraphStorm automatically save the best performed model according to the value specified in the `save_model_path` argument. We can first find out what is the best model and its path. And then restore it by using model's `restore_model()` method." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best model path: a_save_path/epoch-4\n" ] } ], "source": [ "# after training, the best model is saved to disk:\n", "best_model_path = trainer.get_best_model_path()\n", "print('Best model path:', best_model_path)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# we can restore the model from the saved path using the model's restore_model() function.\n", "model.restore_model(best_model_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To do inference, users can either create a new dataloader as the following code does, or reuse one of the dataloaders defined in training." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Setup dataloader for inference\n", "infer_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data,\n", " target_idx=acm_data.get_node_test_set(ntypes=['paper']),\n", " node_feats=nfeats_4_modeling,\n", " label_field='label',\n", " fanout=[100, 100],\n", " batch_size=256,\n", " train_task=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "GraphStorm provides a set of inferrers that can perform highly efficient inference for very large graphs. Similar to trainers, users can create inferrers by giving a model as its input argument, and then call its `infer()` method with a few inference related parameters, such as folder paths for saving inference results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create an Inferrer object\n", "infer = gs.inference.GSgnnNodePredictionInferrer(model)\n", "\n", "# Run inference on the inference dataset\n", "infer.infer(infer_dataloader,\n", " save_embed_path='infer/embeddings',\n", " save_prediction_path='infer/predictions',\n", " use_mini_batch_infer=True)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 640K\n", "-rw-rw-r-- 1 ubuntu ubuntu 626K May 9 22:57 embed-00000.pt\n", "-rw-rw-r-- 1 ubuntu ubuntu 11K May 9 22:57 embed_nids-00000.pt\n", "total 84K\n", "-rw-rw-r-- 1 ubuntu ubuntu 70K May 9 22:57 predict-00000.pt\n", "-rw-rw-r-- 1 ubuntu ubuntu 11K May 9 22:57 predict_nids-00000.pt\n" ] } ], "source": [ "# The GNN embeddings and predictions on the inference graph are saved to the folder named after the target_ntype\n", "!ls -lh infer/embeddings/paper\n", "!ls -lh infer/predictions/paper" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 4 }