{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c4392287-0c04-44e6-a6b2-e34f5d7157ad",
   "metadata": {},
   "source": [
    "# Notebook 0: Data Preparation\n",
    "\n",
    "This notebook will create an example graph data to be used in the other notebooks to demonstrate how to program using GraphStorm APIs.\n",
    "The example graph data comes from [DGL's ACM publication dataset](https://data.dgl.ai/dataset/ACM.mat), which is the same as the data explainedin the [Use Your Own Data tutorial](https://graphstorm.readthedocs.io/en/latest/tutorials/own-data.html).\n",
    "\n",
    "-----\n",
    "\n",
    "## Prerequisites\n",
    "This notebook assumes the following:\n",
    "- Python 3;\n",
    "- Linux OS, Ubuntu or Amazon Linux;\n",
    "- GraphStorm and its dependencies (following the [Setup GraphStorm with pip packages tutorial](https://graphstorm.readthedocs.io/en/latest/install/env-setup.html#setup-graphstorm-with-pip-packages))\n",
    "- [Jupyter web interactive server](https://jupyter.org/).\n",
    "\n",
    "Users can use the following command to check if the above prerequisites are met."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ef5d9b4a-2efa-4158-be55-53b22226d3d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import graphstorm as gs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edeaa1cd-283a-4c5e-8f20-bbd3dd186928",
   "metadata": {},
   "source": [
    "## Download Data Generation Script\n",
    "GraphStorm provides a Python script that can download and convert the DGL ACM publication data for GraphStorm usage. Therefore, first let's download the script file from the [GraphStorm Github repository](https://github.com/awslabs/graphstorm)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "609b0869-b4eb-4467-8353-31e2acc92203",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget -O ./acm_data.py https://github.com/awslabs/graphstorm/raw/main/examples/acm_data.py"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a78c88ef-91f0-4737-9506-985dc7d03c7f",
   "metadata": {},
   "source": [
    "## Generate ACM Raw Table Data\n",
    "Then we can use the command below to build the raw table data, which is the standard input data for GraphStorm's gconstruct module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "4e3e6c4e-9d6d-4a6d-9c51-cf856eef06fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python ./acm_data.py --output-path ./acm_raw --output-type raw_w_text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02183d5a-f574-4938-956a-69eeec1fd1ee",
   "metadata": {},
   "source": [
    "## Construct GraphStorm Input Graph Data\n",
    "With the raw ACM tables we then can use GraphStorm's graph construction method to prepare the ACM graph for other notebooks. The graph construction module perform:\n",
    "- read in the raw data, and convert it to DGL graph;\n",
    "- split the DGL graph into multiple partitions as the distributed DGL graphs;\n",
    "- produce node id mapping files and other supporting files.\n",
    "\n",
    "For the GraphStorm Standalone mode, we only need one partition. Therefore, in the command below we set the `--num-parts` to be `1`. For other arguments, users can refer to [GraphStorm Graph Construction arguments](https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "2b8c752c-b6e8-4562-9637-4c56a4b09875",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m graphstorm.gconstruct.construct_graph \\\n",
    "          --conf-file ./acm_raw/config.json \\\n",
    "          --output-dir ./acm_gs_1p \\\n",
    "          --num-parts 1 \\\n",
    "          --graph-name acm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "314eed68-b10b-4f59-94c5-fcac538cc202",
   "metadata": {},
   "source": [
    "#### 3-Partition Input Data\n",
    "To better illustrate GraphStorm required input data structure, we can use the following command to create a 3-partition input data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "fc5eb53c-50ed-4b51-8b94-57094ee4dc15",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m graphstorm.gconstruct.construct_graph \\\n",
    "          --conf-file ./acm_raw/config.json \\\n",
    "          --output-dir ./acm_gs_3p \\\n",
    "          --num-parts 3 \\\n",
    "          --graph-name acm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e79e2712-4a3c-41a2-8c1b-819c7e8364d2",
   "metadata": {},
   "source": [
    "## Data Exploration and Explanation\n",
    "The above commands created two sets of ACM data, i.e., the **raw ACM data tables**, and ACM **GraphStorm input graphs**. Below we explore these datasets, and explain their format so that users can prepare their own graph data easily."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "148db3bc-ca76-4a75-b3b8-e2ed55784b03",
   "metadata": {},
   "source": [
    "### Raw ACM Table Data in the `./acm_raw` Folder\n",
    "We can explore the `acm_raw` folder with the `ls -al` command. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c0a6cc31-2230-498e-9a19-33466ed0f916",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 24\n",
      "drwxrwxr-x 4 ubuntu ubuntu 4096 May 15 23:29 .\n",
      "drwxrwxr-x 6 ubuntu ubuntu 4096 May 15 23:30 ..\n",
      "-rw-rw-r-- 1 ubuntu ubuntu 5306 May 15 23:29 config.json\n",
      "drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:29 edges\n",
      "drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:29 nodes\n"
     ]
    }
   ],
   "source": [
    "!ls -al ./acm_raw"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "2cc6e2db-b5fd-4a03-9192-11a577f36988",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 38744\n",
      "drwxrwxr-x 2 ubuntu ubuntu     4096 May 15 23:29 .\n",
      "drwxrwxr-x 4 ubuntu ubuntu     4096 May 15 23:29 ..\n",
      "-rw-rw-r-- 1 ubuntu ubuntu 18843828 May 15 23:29 author.parquet\n",
      "-rw-rw-r-- 1 ubuntu ubuntu 20702289 May 15 23:29 paper.parquet\n",
      "-rw-rw-r-- 1 ubuntu ubuntu   113414 May 15 23:29 subject.parquet\n"
     ]
    }
   ],
   "source": [
    "!ls -al ./acm_raw/nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "998dc284-8f0c-4f2a-89ca-f63c5d2fc65d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 1016\n",
      "drwxrwxr-x 2 ubuntu ubuntu   4096 May 15 23:29 .\n",
      "drwxrwxr-x 4 ubuntu ubuntu   4096 May 15 23:29 ..\n",
      "-rw-rw-r-- 1 ubuntu ubuntu 263138 May 15 23:29 author_writing_paper.parquet\n",
      "-rw-rw-r-- 1 ubuntu ubuntu 156358 May 15 23:29 paper_cited_paper.parquet\n",
      "-rw-rw-r-- 1 ubuntu ubuntu 162714 May 15 23:29 paper_citing_paper.parquet\n",
      "-rw-rw-r-- 1 ubuntu ubuntu  87792 May 15 23:29 paper_is-about_subject.parquet\n",
      "-rw-rw-r-- 1 ubuntu ubuntu 265948 May 15 23:29 paper_written-by_author.parquet\n",
      "-rw-rw-r-- 1 ubuntu ubuntu  84005 May 15 23:29 subject_has_paper.parquet\n"
     ]
    }
   ],
   "source": [
    "!ls -al ./acm_raw/edges"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74be0c90-64f8-4d83-a741-35a6f58b3a9b",
   "metadata": {},
   "source": [
    "#### Graph Description JSON File `config.json`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2346f42f-b8bb-4e5f-8485-cc9c508384f2",
   "metadata": {},
   "source": [
    "The `acm_raw` folder includes one `config.json` file that describes the table-based raw graph data. Except for a **version** object, the JSON file contains a **nodes** object and an **edges** object.\n",
    "\n",
    "The **nodes** object contains a list of *node* objects, each of which includes a set of properties to describe one node type in a graph data. For example, in the `config.json` file, there is a node type, called \"papers\". For each node type, GraphStorm defines a few other properties, such as **format**, **files**, and **features**.\n",
    "\n",
    "Similarly, the **edges** object contains a list of *edge* objects. Most of *edge* properties are same as *node*'s except that *edge* object has the **relation** property that define an edge type in a canonical format, i.e., *source node type*, *relation type*, and *destination node type*.\n",
    "\n",
    "For a full list of the JSON configuration properties, users can refer to the [GraphStorm Graph Construction JSON Explanations](https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations).\n",
    "\n",
    "To use your own graph, users need to prepare their own JSON file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "ddfef233-7fff-41ab-84f7-30d96885c472",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!cat ./acm_raw/config.json"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6babcfd3-b570-4045-9cb6-541bf7ceaf4d",
   "metadata": {},
   "source": [
    "#### Raw ACM Tables in the `nodes/` and `edges/` folder.\n",
    "As defined in the `./acm_raw/config.json` file, the node data files are stored at the `./acm_raw/nodes/` folder, and edge data files are stored at the `./acm_raw/edges/` folder. General description of these files can be found at the [Input raw node/edge data files](https://graphstorm.readthedocs.io/en/latest/tutorials/own-data.html#input-raw-node-edge-data-files). Here, we can read some node (\"paper\") and edge (\\[\"paper\", \"citing\", \"paper\"\\]) tables to learn more about them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "475be4f2-405a-4809-b51e-7a439566f00d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "paper_node_path = './acm_raw/nodes/paper.parquet'\n",
    "paper_citing_paper_edge_path = './acm_raw/edges/paper_citing_paper.parquet'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7237ea83-fc92-4a36-87b8-adecae9bdf99",
   "metadata": {},
   "source": [
    "**The \"paper\" node table**\n",
    "\n",
    "The paper node table could be read in as a Pandas DataFrame. The table has a few columns, whose names are used in the `config.json`. For the \"paper\" nodes, there is a `node_id` column, including a unique identifier for each node, a `feat` column, including a 256D numerical tensor for each node, a `text` column, including free text feature for each node, and a `label` column, including an integer to indicate the class that each node is assigned.\n",
    "\n",
    "The other two node types, \"author\" and \"subject\", have similar data tables. Users can explore them with the similar code below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "930314ae-330c-4159-9c3b-d18ed8621c3d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(12499, 4)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>node_id</th>\n",
       "      <th>label</th>\n",
       "      <th>feat</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6933</th>\n",
       "      <td>p6933</td>\n",
       "      <td>3</td>\n",
       "      <td>[-0.006179405, 0.010796122, -0.018994818, -0.0...</td>\n",
       "      <td>'User-oriented text segmentation evaluation me...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>743</th>\n",
       "      <td>p743</td>\n",
       "      <td>4</td>\n",
       "      <td>[-0.016835907, -0.020954693, 0.009945098, -0.0...</td>\n",
       "      <td>'Similarity-aware indexing for real-time entit...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11497</th>\n",
       "      <td>p11497</td>\n",
       "      <td>12</td>\n",
       "      <td>[0.009553924, 0.019706111, 0.013354154, -0.010...</td>\n",
       "      <td>'Polynomial time algorithm for computing the t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2588</th>\n",
       "      <td>p2588</td>\n",
       "      <td>2</td>\n",
       "      <td>[0.0036002623, -0.007723761, -0.012699484, -0....</td>\n",
       "      <td>'Microformats: a pragmatic path to the semanti...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      node_id  label                                               feat  \\\n",
       "6933    p6933      3  [-0.006179405, 0.010796122, -0.018994818, -0.0...   \n",
       "743      p743      4  [-0.016835907, -0.020954693, 0.009945098, -0.0...   \n",
       "11497  p11497     12  [0.009553924, 0.019706111, 0.013354154, -0.010...   \n",
       "2588    p2588      2  [0.0036002623, -0.007723761, -0.012699484, -0....   \n",
       "\n",
       "                                                    text  \n",
       "6933   'User-oriented text segmentation evaluation me...  \n",
       "743    'Similarity-aware indexing for real-time entit...  \n",
       "11497  'Polynomial time algorithm for computing the t...  \n",
       "2588   'Microformats: a pragmatic path to the semanti...  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "paper_node_df = pd.read_parquet(paper_node_path)\n",
    "\n",
    "print(paper_node_df.shape)\n",
    "paper_node_df.sample(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74d50c03-0485-4b33-be76-b33fe7c013fd",
   "metadata": {},
   "source": [
    "**The (paper, citing, paper) edge table**\n",
    "\n",
    "The \"paper, citing, paper\" edge table could also be read in as a Pandas DataFrame. It has three columns. The `source_id` and `dest_id` column contain the same identifiers listed in the \"paper\" node table. The `label` column is a placeholder to be used for spliting the \"paper, citing, paper\" edges for a link prediction task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "cdce82e9-eecd-43c7-a0ac-ab7b161d4211",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(30789, 3)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source_id</th>\n",
       "      <th>dest_id</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1140</th>\n",
       "      <td>p241</td>\n",
       "      <td>p6987</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17361</th>\n",
       "      <td>p6296</td>\n",
       "      <td>p6221</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21762</th>\n",
       "      <td>p7578</td>\n",
       "      <td>p7328</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28630</th>\n",
       "      <td>p11144</td>\n",
       "      <td>p11145</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      source_id dest_id  label\n",
       "1140       p241   p6987    1.0\n",
       "17361     p6296   p6221    1.0\n",
       "21762     p7578   p7328    1.0\n",
       "28630    p11144  p11145    1.0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pcp_edge_df = pd.read_parquet(paper_citing_paper_edge_path)\n",
    "\n",
    "print(pcp_edge_df.shape)\n",
    "pcp_edge_df.sample(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cd709f6-9089-4084-b218-71d8cf0a8924",
   "metadata": {},
   "source": [
    "### GraphStorm Input Graph Data in the `./acm_gs_*p/` Folder\n",
    "\n",
    "In the above cells, we created a 1-partition graph in the `acm_gs_1p` folder and a 3-partition graph in the `acm_gs_3p` folder. The contents of the two folders are nearly the same, including \n",
    "\n",
    "1. a GraphStorm partitioned configuration JSON file;\n",
    "2. a subfolder named after `raw_id_mappings` that store the original node id space to GraphStorm node id space mapping files, created during graph processing;\n",
    "3. GraphStorm node id space to shuffle node id space mapping, created during graph patitioning;\n",
    "4. label statitic files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "86ef095b-5997-46e4-a8a1-99f1db61de57",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 1516\n",
      "-rw-rw-r-- 1 ubuntu ubuntu    1673 May 15 23:30 acm.json\n",
      "-rw-rw-r-- 1 ubuntu ubuntu     191 May 15 23:30 edge_label_stats.json\n",
      "-rw-rw-r-- 1 ubuntu ubuntu 1287802 May 15 23:30 edge_mapping.pt\n",
      "-rw-rw-r-- 1 ubuntu ubuntu     515 May 15 23:30 node_label_stats.json\n",
      "-rw-rw-r-- 1 ubuntu ubuntu  241655 May 15 23:30 node_mapping.pt\n",
      "drwxrwxr-x 2 ubuntu ubuntu    4096 May 15 23:30 part0\n",
      "drwxrwxr-x 5 ubuntu ubuntu    4096 May 15 23:30 raw_id_mappings\n"
     ]
    }
   ],
   "source": [
    "!ls -l ./acm_gs_1p"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "d5a11133-75ef-4299-b167-81498c4f1dc4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 1524\n",
      "-rw-rw-r-- 1 ubuntu ubuntu    3325 May 15 23:30 acm.json\n",
      "-rw-rw-r-- 1 ubuntu ubuntu     191 May 15 23:30 edge_label_stats.json\n",
      "-rw-rw-r-- 1 ubuntu ubuntu 1287802 May 15 23:30 edge_mapping.pt\n",
      "-rw-rw-r-- 1 ubuntu ubuntu     515 May 15 23:30 node_label_stats.json\n",
      "-rw-rw-r-- 1 ubuntu ubuntu  241655 May 15 23:30 node_mapping.pt\n",
      "drwxrwxr-x 2 ubuntu ubuntu    4096 May 15 23:30 part0\n",
      "drwxrwxr-x 2 ubuntu ubuntu    4096 May 15 23:30 part1\n",
      "drwxrwxr-x 2 ubuntu ubuntu    4096 May 15 23:30 part2\n",
      "drwxrwxr-x 5 ubuntu ubuntu    4096 May 15 23:30 raw_id_mappings\n"
     ]
    }
   ],
   "source": [
    "!ls -l ./acm_gs_3p"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cbd682e-36e8-4410-ae77-b0d6931f3052",
   "metadata": {},
   "source": [
    "Because the choice of the different number of partitions, the two folders have different partition data sub-folders, named after \"part0\" to \"part***N***\", where ***N*** is the number of partitions specified with the `--num-parts` argument of construct_graph command.\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>Tip:</b> In the next sections, we use the 3-partition graph to explore these four sets of files and sub-folders one by one. But we will use the 1-partition graph in the other notebooks for GraphStorm standalone mode programming tutorials. </div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e675a0de-0324-4f5b-86b0-45e1909bbd68",
   "metadata": {},
   "source": [
    "#### The GraphStorm Partition Configuration File `acm.json`\n",
    "The `acm.json` file describe the partitioned graph that GraphStorm uses for model training and inference. \n",
    "\n",
    "It includes basic information about the partitioned graph, such as node and edge types, the number of each node and edge type, and the number of partitions along with the other partition mapping information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "5f63bcba-fe24-4890-b232-4f9208886f02",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!cat ./acm_gs_3p/acm.json"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7c84137-0f41-4287-984f-a7358372fde6",
   "metadata": {},
   "source": [
    "#### Raw Node ID Mapping Files in the `raw_id_remappings` Folder\n",
    "Because the original node ids could be any types, e.g., strings, integers, or even floats, during graph processing GraphStorm conducts an ID mapping, which map the original node ID space given by users into the interger type node ID space, starting from 0. This mapping information is stored in the `raw_id_remappings` folder that contains a set of subfolders named after each node type name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "3e1f4f36-67d1-46f7-b6a8-cafb4b765fc4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 12\n",
      "drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 author\n",
      "drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 paper\n",
      "drwxrwxr-x 2 ubuntu ubuntu 4096 May 15 23:30 subject\n"
     ]
    }
   ],
   "source": [
    "!ls -l ./acm_gs_3p/raw_id_mappings/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "eeb3791d-66c4-40d7-94b3-20ad7dfac8ca",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 208\n",
      "-rw-rw-r-- 1 ubuntu ubuntu 212064 May 15 23:30 part-00000.parquet\n"
     ]
    }
   ],
   "source": [
    "!ls -l ./acm_gs_3p/raw_id_mappings/author/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "597434ba-7b0e-41ce-a75e-9f61c91fce77",
   "metadata": {},
   "source": [
    "In each subfolder, there will be a set of parquet files with names in the format as `part-*****.parquet`. The number of these parquet files are determined by the number of nodes in each type. The greater the number of nodes, the more files there will be. Users can use any parquet file exploration tools to check their contents like the below code does."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "bf47dee7-aabd-4d18-860b-218665a488a9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(17431, 2)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>orig</th>\n",
       "      <th>new</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>7958</th>\n",
       "      <td>a7958</td>\n",
       "      <td>7958</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7475</th>\n",
       "      <td>a7475</td>\n",
       "      <td>7475</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13423</th>\n",
       "      <td>a13423</td>\n",
       "      <td>13423</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9246</th>\n",
       "      <td>a9246</td>\n",
       "      <td>9246</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         orig    new\n",
       "7958    a7958   7958\n",
       "7475    a7475   7475\n",
       "13423  a13423  13423\n",
       "9246    a9246   9246"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "author_nid_mapping_df = pd.read_parquet('./acm_gs_3p/raw_id_mappings/author/part-00000.parquet')\n",
    "\n",
    "print(author_nid_mapping_df.shape)\n",
    "author_nid_mapping_df.sample(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "249a6a09-489a-4e45-a101-5ba459ffa50c",
   "metadata": {},
   "source": [
    "As shown above, the `author/part-00000.parquet` file has two columns. The `orig` column contains the original string type node IDs in the raw node table data, while the `new` column contains the new integer node IDs in the Graph Node ID space."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05de822c-5eb6-4656-bbe6-b3bd0ba034b4",
   "metadata": {},
   "source": [
    "#### GraphStorm Partition Node/Edge ID Mapping Files `****_mapping.pt`\n",
    "GraphStorm relies on the distributed DGL graph as its input graph data. The distributed DGL graph has its own node ID space, thus creating another node id mapping during graph partition.\n",
    "\n",
    "These node id mappings, in the form of a python dictionary, are stored in those `****_mapping.pt` files, which can be loaded using Pytorch.\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>Tip:</b>In general, uses do not need to do the id mapping back operations. If use GraphStorm's command line interface to train models and do inference, GraphStorm will automatically remapping the partitioned ID space to the original node ID space. </div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "96b2fa7e-cf62-44f0-8f14-714b472e1c17",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Node id mapping:\n",
      "Node mapping keys: ['author', 'paper', 'subject']\n",
      "Node type 'author' first 10 mapping ids: tensor([9908, 5644, 5645, 5646, 5647, 5648, 5649, 5650, 9270, 5643])\n",
      "\n",
      "Edge id mapping:\n",
      "Edge mapping keys: [('author', 'writing', 'paper'), ('paper', 'cited', 'paper'), ('paper', 'citing', 'paper'), ('paper', 'is-about', 'subject'), ('paper', 'written-by', 'author'), ('subject', 'has', 'paper')]\n",
      "Edge type '('author', 'writing', 'paper')' first 10 mapping ids: tensor([ 1622, 16688, 22176, 35837, 22116, 22183, 22234,  3538,  9921,  1062])\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import torch as th\n",
    "\n",
    "node_mapping_dict = th.load('./acm_gs_3p/node_mapping.pt')\n",
    "print('Node id mapping:')\n",
    "print(f'Node mapping keys: {list(node_mapping_dict.keys())}')\n",
    "ntype0 = list(node_mapping_dict.keys())[0]\n",
    "print(f'Node type \\'{ntype0}\\' first 10 mapping ids: {node_mapping_dict[ntype0][:10]}\\n')\n",
    "\n",
    "edge_mapping_dict = th.load('./acm_gs_3p/edge_mapping.pt')\n",
    "print('Edge id mapping:')\n",
    "print(f'Edge mapping keys: {list(edge_mapping_dict.keys())}')\n",
    "etype0 = list(edge_mapping_dict.keys())[0]\n",
    "print(f'Edge type \\'{etype0}\\' first 10 mapping ids: {edge_mapping_dict[etype0][:10]}\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80ff4d2a",
   "metadata": {},
   "source": [
    "The ID mapping logic in those tensors is that GraphStorm graph ID is stored in these tensors, and their position indexes are the new partitioned node IDs. For example, for \"author\" nodes, the GraphStorm graph ID `9908` has a new partitioned node ID `0` because the number `9908` is in the first position (index=`0`) of the mapping tensor.\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>warning:</b> The specific number of the first author node ID might not be the <b>9908</b> as partition process is not determistic. Users may see author node IDs different from the given example. </div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "431462ea-401b-497b-a8b0-73a78f92a34e",
   "metadata": {},
   "source": [
    "#### Label Statistic Files `****_label_stats.json`\n",
    "\n",
    "If users specify the label statistc property in the `config.json` file, e.g., for the \"paper\" node's `label` object setting `\"label_stats_type\": \"frequency_cnt\"`, GraphStorm will collect labels' statistics and stored in the `****_label_stats.json` files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "144790b0-dcca-4907-ab5b-f68d522a985d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat ./acm_gs_3p/node_label_stats.json"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ae5ed8d-72e5-4fbb-b31b-0291dfb43ab3",
   "metadata": {},
   "source": [
    "#### Partitioned Graph Data `partN/***.dgl`\n",
    "\n",
    "The distributed DGL graph datasets are saved in these `partN` subfolders, each of which contains three DGL formated files:\n",
    "1. `edge_feat.dgl`: edge features of one partition if have.\n",
    "2. `graph.dgl`: graph structure of one partition.\n",
    "3. `node_feat.dgl`: node features of one partition if have."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "4de36699-68f3-4c2f-a7aa-ccc8640ea3e1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 13892\n",
      "drwxrwxr-x 2 ubuntu ubuntu     4096 Dec 19 21:29 .\n",
      "drwxrwxr-x 6 ubuntu ubuntu     4096 Dec 19 21:29 ..\n",
      "-rw-rw-r-- 1 ubuntu ubuntu    31926 Dec 19 21:29 edge_feat.dgl\n",
      "-rw-rw-r-- 1 ubuntu ubuntu  2081555 Dec 19 21:29 graph.dgl\n",
      "-rw-rw-r-- 1 ubuntu ubuntu 12097671 Dec 19 21:29 node_feat.dgl\n"
     ]
    }
   ],
   "source": [
    "!ls -al ./acm_gs_3p/part0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "466a4da8-85be-490f-9da6-b9f5c3f92029",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}