GraphStorm Output Node ID Remapping
During Graph Construction, GraphStorm converts
user provided node IDs into integer-based node IDs. Thus, the outputs of
GraphStorm training and inference jobs, i.e., saved node
embeddings and saved prediction results,
are stored with their integer-based node IDs. GraphStorm provides a
gconstruct.remap_result
module to remap the integer-based node IDs
back to the original user provided node IDs according to the node ID
mapping files.
Note
If the training or inference tasks are launched by GraphStorm CLIs,
the gconstruct.remap_result
module is automatically triggered to
to remap the integer-based node IDs back to the original user provided
node IDs.
Output Node Embeddings after Remapping
By default, the output node embeddings after gconstruct.remap_result
are stored in the path specified by save_embed_path
in parquet format.
The node embeddings for different node
types are stored in separate directories, each named after the
corresponding node type. The content of the output directory will look like following:
emb_dir/
ntype0:
embed-00000_00000.parquet
embed-00000_00001.parquet
...
ntype1:
embed-00000_00000.parquet
embed-00000_00001.parquet
...
For multi-task learning tasks, the output node embeddings may have
task specific versions. (Details can be found in Multi-task
Learning Output). The task specific
node embeddings are also processed by the gconstruct.remap_result
module.
The content of the output directory will look like following:
emb_dir/
ntype0/
embed-00000_00000.parquet
embed-00000_00001.parquet
...
ntype1:
embed-00000_00000.parquet
embed-00000_00001.parquet
...
link_prediction-paper_cite_paper/
ntype0/
embed-00000_00000.parquet
embed-00000_00001.parquet
...
ntype1:
embed-00000_00000.parquet
embed-00000_00001.parquet
...
edge_regression-paper_cite_paper-year/
ntype0/
embed-00000_00000.parquet
embed-00000_00001.parquet
...
ntype1:
embed-00000_00000.parquet
embed-00000_00001.parquet
...
The content of each parquet file will look like following:
nid |
emb |
---|---|
n0 |
[0.2964, 0.0779, 1.2763, 2.8971, …, -0.2564, 0.9060, -0.8740] |
n1 |
[1.6941, -1.6765, 0.1862, -0.4449, …, 0.6474, 0.2358, -0.5952] |
n10 |
[-0.8417, 2.5096, -0.0393, -0.8208, …, 0.9894, 2.3389, 0.9778] |
Note
gconstruct.remap_result
uses nid
as the default column name
for node IDs and emb
as the default column name for embeddings
Output Prediction Results after Remapping
By default, the output prediction results after gconstruct.remap_result
are stored in the path specified by save_prediction_path
in parquet format.
The prediction results for different node
types are stored in separate directories, each named after the
corresponding node type. The prediction results for different edge
types are stored in separate directories, each named after the
corresponding edge type. The content of the directory of node prediction results will look like following:
predict_dir/
ntype0:
predict-00000_00000.parquet
predict-00000_00001.parquet
...
ntype1:
predict-00000_00000.parquet
predict-00000_00001.parquet
...
The content of the directory of edge prediction results will look like following:
predict_dir/
etype0:
predict-00000_00000.parquet
predict-00000_00001.parquet
...
etype1:
predict-00000_00000.parquet
predict-00000_00001.parquet
...
For multi-task learning tasks, there can be multiple prediction results
for different tasks. (Details can be found in Multi-task
Learning Output). The task specific
prediction results are also processed by the gconstruct.remap_result
module.
The content of the output directory will look like following:
prediction_dir/
edge_regression-paper_cite_paper-year/
paper_cite_paper/
predict-00000_00000.parquet
predict-00000_00001.parquet
...
node_classification-paper-venue/
paper/
predict-00000_00000.parquet
predict-00000_00001.parquet
...
The content of a node prediction result file will look like following:
nid |
pred |
---|---|
n0 |
[0.2964, 0.7036] |
n1 |
[0.1862, 0.8138] |
n10 |
[0.9778, 0.0222] |
Note
gconstruct.remap_result
uses ``nid``as the default column name
for node IDs and ``pred``as the default column name for prediction results.
The content of an edge prediction result file will look like following:
src_nid |
dst_nid |
pred |
---|---|---|
n0 |
n32 |
[0.2964, 0.7036] |
n1 |
n21 |
[0.1862, 0.8138] |
n10 |
n2 |
[0.9778, 0.0222] |
Run remap_result Command
If users want to run remap_result by themselves, they can run the
gconstruct.remap_result
command by following the command example:
python -m graphstorm.gconstruct.remap_result \
--node-id-mapping PATH_TO/id_mapping \
--pred-ntypes "n0" "n1" \
--prediction-dir PATH_TO/pred/ \
--node-emb-dir PATH_TO/emb/ \
This example provides the actual Python command. It will do node ID remapping for prediction results of node type n0 and n1` stored under PATH_TO/pred/. It will also do node ID remapping for node embeddings stored under PATH_TO/emb/. The remapped data will be saved in the save directory as the input data and the input data will be removed to save disk space.
Below lists the full argument list of the gconstruct.remap_result
command:
--node-id-mapping: (Required) the path storing the node ID mapping files.
--cf: the path to the yaml configuration file of the corresponding training or inference task. By providing the configuration file,
gconstruct.remap_result
will automatically infer the necessary information for ID remappings for node embeddings and prediction results.--num-processes: The number of processes to process the data simultaneously. A larger number of processes will speedup the ID remapping progress but consumes more CPU memory. Default is 4.
--node-emb-dir: The directory storing the node embeddings to be remapped. Default is None.
--prediction-dir: The directory storing the graph prediction results to be remapped. Default is None.
--pred-etypes: A list of canonical edge types which have prediction results to be remmaped. For example,
--pred-etypes user,rate,movie user,watch,movie
. Must be used with--prediction-dir
. Default is None.--pred-ntypes: A list of node types which have prediction results to be remmaped. For example,
--pred-ntypes user movie
. Must be used with--prediction-dir
. Default is None.--output-format: The output format. It can be
parquet
orcsv
. Default isparquet
.--output-delimiter: The delimiter used when --output-format set to
csv
. Default is,
.--column-names: Defines how to rename default column names to new names. For example, given
--column-names nid,~id emb,embedding
, the columnnid``will be renamed to ``~id
and the columnemb
will be renamed to embedding. Default is None.--logging-level: The logging level. The possible values: debug,
info
,warning
,error
. Default isinfo
.--output-chunk-size: Number of rows per output file.
gconstruct.remap_result
will automatically split output file into multiple files. By default, it is set tosys.maxsize
--preserve-input: Whether we preserve the input data. This is only for debug purpose. Default is False.