Single Machine Graph Construction
Prerequisites
A machine with Linux operation system that has proper CPU memory according to the raw data size.
Following the Setup GraphStorm with pip Packages guideline to install GraphStorm and its dependencies.
Following the Input Raw Data Explanations guideline to prepare the input raw data.
Graph consturction command
GraphStorm provides a gconstruct.construct_graph module for graph construction in a signle machine. Users can run the gconstruct.construct_graph command by following the command template below.
python -m graphstorm.gconstruct.construct_graph \
--conf-file config.json \
--output-dir /a_path \
--num-parts 1 \
--graph-name a_name
This template provides the actual Python command, and it also indicates the three required command arguments, i.e., --conf-file specifies a JSON file containing graph construction configurations, --output-dir specifies the directory for outputs, and --graph-name specifies a string as a name given to the constructed graph. The --num-parts whose default given value is 1 is also an important argument. It determines how many partitions to be constructed. In distrusted model training and inference, the number of machines is determined by the number of partitions.
Configuration JSON Object Explanations
The configuration JSON file is the key input argument for graph construction. The file contains a JSON object that defines the overall graph schema in terms of node type and edge type. For each node and edge type, it defines where the node and edge data are stored and in what file format. When a type of node or edge has features, it defines which columns in the data table are features and what feature transformation operations will be used to encode the features. When a type of node or edge has labels, it defines which columns in the data table are labels and how to split the labels into the training, validation, and testing sets.
In the highest level, the JSON object contains three fields: version, nodes and edges.
version (Optional)
version marks the version of the configuration file schema, allowing its identification to be self-contained for downstream applications. The current (and expected) version is gconstruct-v0.1.
nodes (Required)
nodes contains a list of node types and the information of a node type is stored in a dictionary. A node dictionary contains multiple fields and most fields are optional.
node_type: (Required) specifies the node type. Think this as a name given to one type of nodes, e.g. “author” and “paper”.files: (Required) specifies the input files for the node type. There are multiple options to specify the input files. For a single input file, it contains the path of a single file. For multiple files, it could contain the paths of files with a wildcard, e.g., file_name*.parquet, or a list of file paths, e.g., [“file_name001.parquet”, “file_name002.parquet”, …].format: (Required) specifies the input file format. Currently, the construction command supports three input file formats:csv,parquet, andHDF5. The value of this field is a dictionary, where the key isnameand the value is eithercsv,parquetorHDF5, e.g., {“name”:”csv”}. The detailed format information could be found in the Input Raw Data Explanations guideline.node_id_col: specifies the column name that contains the node IDs. This field is optional. If not provided, the construction command will create node IDs according to the total number of rows and consider each row in the node table is a unique node. If user choose to store columns of a node type in multiple sets of tables, only one of the set of tables require to specify the node ID column. For example of this multiple sets of tables, please refer to the simple input data example document.featuresis a list of dictionaries that define how to get features and transform features. This is optional. The format of a feature dictionary is defined in the Feature dictionary format section below.labelsis a list of dictionaries that define where to get labels and how to split the labels into training/validation/test set. This is optional. The format of a label dictionary is defined in the Label dictionary format section below.
edges (Required)
Similarly, edges contains a list of edge types and the information of an edge type is stored in a dictionary. An edge dictionary also contains the same fields of files, format, features and labels as the nodes field. In addition, it contains the following unique fields:
source_id_col: (Required) specifies the column name of the source node IDs.dest_id_col: (Required) specifies the column name of the destination node IDs.relation: (Required) is a list of three elements that contains the node type of the source nodes, the relation type of the edges, and the node type of the destination nodes. Values of node types should be same as the corresponding values specified in thenode_typefields innodesobjects, e.g., [“author”, “write”, “paper”].
Feature dictionary format
feature_col: (Required) specifies the column name in the input file that contains the feature. Thefeature_colcan accept either a string or a list. Whenfeature_colis specified as a list with multiple columns, the same feature transformation operation will be applied to each column, and then the transformed feature will be concatenated to form the final feature.feature_name: specifies the prefix of the column feature name. This is optional. If feature_name is not provided,feature_colis used as the feature name. If the feature transformation generates multiple tensors,feature_namebecomes the prefix of the names of the generated tensors. If there are multiple columns defined infeature_col,feature_nameis required.out_dtypespecifies the data type of the transformed feature.out_dtypeis optional. If it is not set, no data type casting is applied to the transformed feature. If it is set, the output feature will be cast into the corresponding data type. Now only float16, float32, and float64 are supported.transform: specifies the actual feature transformation. This is a dictionary and its name field indicates the feature transformation operation. Each transformation operation has its own argument(s). The list of feature transformations supported by the pipeline are listed in the section of Feature Transformation below.
Label dictionary format
task_type: (Required) specifies the task defined on the nodes or edges. Currently, its value can be one ofclassification,regression,link_prediction, andreconstruct_node_feat.label_col: specifies the column name in the input file that contains the labels. This has to be specified forclassificationandregressiontasks.label_colis also used as the label name.split_pct: specifies how to split the data into training/validation/test. If it’s not specified, the data is split into 80% for training 10% for validation and 10% for testing. The pipeline constructs three additional vectors indicating the training/validation/test masks. Forclassificationandregressiontasks, the names of the mask tensors aretrain_mask,val_maskandtest_mask.custom_split_filenames: specifies the customized training/validation/test mask. It has field namedtrain,valid, andtestto specify the path of the mask files. It is possible that one of the subfield here leaves empty and it will be treated as none. It will override thesplit_pctonce provided. Refer to Label split files for detailed explanations.label_stats_type: specifies the statistic type used to summarize labels. So far, only support one value, i.e.,frequency_cnt.
Feature transformation
GraphStorm provides a set of transformation operations for different types of feautures.
HuggingFace tokenizer transformation tokenizes text strings with a HuggingFace tokenizer. The
namefield in the feature transformation dictionary istokenize_hf. The dict should contain two additional fields.bert_modelspecifies the LM model used for tokenization. Users can choose any HuggingFace LM models from one of the following types:"bert", "roberta", "albert", "camembert", "ernie", "ibert", "luke", "mega", "mpnet", "nezha", "qdqbert","roc_bert", such as"bert-base-uncased" and "roberta-base"max_seq_lengthspecifies the maximal sequence length.
Example:
"transform": {"name": "tokenize_hf", "bert_model": "bert-base-uncased", "max_seq_length": 16},
HuggingFace LM transformation encodes text strings with a HuggingFace LM model. The
namefield in the feature transformation dictionary isbert_hf. The dict should contain two additional fields.bert_modelspecifies the LM model used for embedding text. Users can choose any HuggingFace LM models from one of the following types:"bert", "roberta", "albert", "camembert", "ernie", "ibert", "luke", "mega", "mpnet", "nezha", "qdqbert","roc_bert", such as"bert-base-uncased" and "roberta-base"max_seq_lengthspecifies the maximal sequence length.
Example:
"transform": {"name": "bert_hf", "bert_model": "roberta-base", "max_seq_length": 256},
Numerical MAX_MIN transformation normalizes numerical input features with val = (val-min)/(max-min), where val is the feature value, max is the maximum value in the feature and min is the minimum value in the feature. The
namefield in the feature transformation dictionary ismax_min_norm. The dictionary can contain four optional fields:max_bound,min_bound,max_valandmin_val.max_boundspecifies the maximum value allowed in the feature. Any number larger thanmax_boundwill be set tomax_bound. Here, max = min(np.amax(feats), ``max_bound`)`.min_boundspecifies the minimum value allowed in the feature. Any number smaller thanmin_boundwill be set tomin_bound. Here, min = max(np.amin(feats),min_bound).max_valdefines the max in the transformation formula. Whenmax_valis provided, max is always equal tomax_val.min_valdefines the min in the transformation formula. Whenmin_valis provided, min is always equal tomin_val.
max_valandmin_valare mainly used in the inference stage, where we want to use the same max and min values computed in the training stage to normalize inference data.Example:
"transform": {"name": "max_min_norm", "max_bound": 2., "min_bound": -2.}
Numerical Rank Gauss transformation normalizes numerical input features with rank gauss normalization. It maps the numeric feature values to gaussian distribution based on ranking. The method follows the description in the normalization section of the Porto Seguro’s Safe Driver Prediction kaggle competition. The
namefield in the feature transformation dictionary isrank_gauss. The dict can contains two optional fields, i.e.,epsilonwhich is used to avoidINFfloat during computation anduniquifywhich controls whether deduplicating input features before computing rank gauss norm.Example:
"transform": {"name": "rank_gauss", "epsilon": 1e-5, "uniquify": True, }Convert to categorical values converts text data to categorial values. The
namefield isto_categorical, andseparatorspecifies how to split the string into multiple categorical values (this is only used to define multiple categorical values). Ifseparatoris not specified, the entire string is considered as a single categorical value.mapping(optional) is a dictionary that specifies how to map a string to an integer value that defines a categorical value. Ifmappingis provided, any string value which is not in themappingwill be ignored. Themappingfield is mainly used in the inference stage when we want to keep the same categorical mapping as in the training stage.Example:
"transform": {"name": "to_categorical"},
Numerical Bucket transformation normalizes numerical input features with buckets. The input features are divided into one or multiple buckets. Each bucket stands for a range of floats. An input value can fall into one or more buckets depending on the transformation configuration. The
namefield in the feature transformation dictionary isbucket_numerical. Users can to providerangeandbucket_cntfields, whererangedefines a numerical range, andbucket_cntdefines number of buckets among the range. All buckets will have same length, and each of them is left included. e.g, bucket[a, b)will includea, but notb. All input feature column data are categorized into respective buckets using this method. Any input data lower than the minimum value will be assigned to the first bucket, and any input data exceeding the maximum value will be assigned to the last bucket. For example, withrange: [10,30]andbucket_cnt: 2, input data1will fall into the bucket[10, 20], input data11will be mapped to[10, 20], input data21will be mapped to[20, 30], input data31will be mapped to[20, 30]. Finally GraphStorm uses one-hot-encoding to encode the feature for each numerical bucket. If a user wants to make numeric values fall into more than one bucket, it is suggested to use theslide_window_sizefield.slide_window_sizedefines a number, e.g.,s. Then each valuevwill be transformed into a range fromv - s/2throughv + s/2, and assigns the valuevto every bucket that the range covers.Example:
"transform": {"name": "bucket_numerical", "range": [10, 50], "bucket_cnt": 2, "slide_window_size": 10},
No-op vector truncation (experimental) truncates feature vectors to the length requested. The
namefield can be empty (e.g.,{name: }), and an integertruncate_dimvalue will determine the length of the output vector. This can be useful when experimenting with input features that were trained using Matryoshka Representation Learning.Example:
"transform": {"name": , "truncate_dim": 24},
Outputs of the graph consturction command
The graph construction command outputs two formats: DistDGL or DGL specified by the argument --output-format.
If select DGL, the output includes an DGLGraph file, named <graph_name>.dgl under the folder specified by the --output-dir argument, where <graph_name> is the value of argument --graph-name.
If select DistDGL, the output will be a partitioned DistDGL graph. It includes a JSON file, named <graph_name>.json that describes the meta-information of the partitioned graph, a set of part* folders under the folder specified by the --output-dir argument, where the * is the number specified by the --num-parts argument.
Besides the graph data, the graph construction command also generate other files that contain related metadata information associated with the graph data, including a set of node and edge ID mapping files, a new construction configuration JSON file that records the details of feature transformation operations, and lable statistic summary files if required in the label_stats_type field.
Node and Edge Mapping Files: There are two node/edge id mapping stages during graph construction. The first mapping occurs when GraphStorm converts the original user provided node ids into integer-based node ids, and the second mapping happends when graph partition operation shuffles these integer-based node ids to each partition with new node ids. Meanwhile, graph construction also saves two sets of node id mapping files as parts of its outputs.
Outputs of the first mapping stage are stored at the
raw_id_mappingsfolder under the path specified by the --output-dir argument. For each node type, there is a dedicated folder named after thenode_typefiled, in which contains parquet format files named afterpart-*****.parquet, where*****represents five digit numbers starting from00000.Outputs of the second mapping stage are two PyTorch tensor files, i.e.,
node_mapping.ptandedge_mapping.pt, each of which maps the node and edge in the partitoined graph into the integer original node and edge id space. The node ID mapping is stored as a dictionary of 1D tensors whose key is the node type and value is a 1D tensor mapping between shuffled node IDs and the original node IDs. The edge ID mapping is stored as a dictionary of 1D tensors whose key is the edge type and value is a 1D tensor mapping between shuffled edge IDs and the original edge IDs.New Construction Configuration JSON: By default, GraphStorm will regenerate a construction configuration JSON file that copies the contents in the given JSON file specified by the –conf-file argument. In addition if there are transformations of features occurred, this newly generated JSON file will include some additional information. For example, if the original configuration JSON file requires to perform a Convert to categorical values transformation without giving the
mappingdictionary, the newly generated configuration JSON file will add thismappingdictionary with the actual values and their mapping ids. This added information could help construct new graphs for fine-tunning saved models or doing inference with saved models.If users provide a value of the --output-conf-file argument, the newly generated configuration file will use this value as the file name. Otherwise GraphStorm will save the configuration JSON file in the --output-dir with name
data_transform_new.json.Label Statistic Summary JSONs: If required in the
label_stats_typefield, the graph construction command will compute statistics of labels and save them in anode_label_stats.jsonor aedge_label_stats.json.
Note
These mapping files are important for mapping the training and inference outputs. Therefore, DO NOT move or delete them.
A construction configuration JSON example
This section provides a construction configuration JSON associated to the simple raw data example as an example for refernece.
{
"version": "gconstruct-v0.1",
"nodes": [
{
"node_id_col": "nid",
"node_type": "paper",
"format": {"name": "parquet"},
"files": "paper_nodes.parquet",
"features": [
{
"feature_col": ["aff"],
"feature_name": "aff_feat",
"transform": {"name": "to_categorical",
"mapping": {"NE": 0, "MT": 1,"UL": 2, "TT": 3,"UC": 4}}
},
{
"feature_col": "abs",
"feature_name": "abs_bert",
"out_dtype": "float32",
"transform": {"name": "bert_hf",
"bert_model": "roberta",
"max_seq_length": 16}
},
],
"labels": [
{
"label_col": "class",
"task_type": "classification",
"custom_split_filenames": {
"train": "train.json",
"valid": "val.json",
"test": "test.json"},
"label_stats_type": "frequency_cnt",
},
],
},
{
"node_id_col": "domain",
"node_type": "subject",
"format": {"name": "parquet"},
"files": "subject_nodes.parquet",
},
{
"node_id_col": "n_id",
"node_type": "author",
"format": {"name": "parquet"},
"files": "author_nodes.parquet",
"features": [
{
"feature_col": ["hdx"],
"feature_name": "feat",
"out_dtype": 'float16',
"transform": {"name": "max_min_norm",
"max_bound": 1000.,
"min_val": 0.}
},
],
},
{
"node_type": "author",
"format": {"name": "hdf5"},
"files": "author_node_embeddings.h5",
"features": [
{
"feature_col": ["embedding"],
"feature_name": "embed",
"out_dtype": 'float16',
},
],
},
],
"edges": [
{
"source_id_col": "nid",
"dest_id_col": "domain",
"relation": ["paper", "has", "subject"],
"format": {"name": "parquet"},
"files": ["paper_has_subject_edges.parquet"],
"labels": [
{
"label_col": "cnt",
"task_type": "regression",
"custom_split_filenames": {
"train": "train_edges.json",
"valid": "val_edges.json",
},
},
],
},
{
"source_id_col": "nid",
"dest_id_col": "n_id",
"relation": ["paper", "written-by", "author"],
"format": {"name": "parquet"},
"files": ["paper_written-by_author_edges.parquet"],
}
]
}
Note
For a real runnable example, please refer to the input JSON file used in the Use Your Own Graphs Tutorial.
A full argument list of the gconstruct.construct_graph command
--conf-file: (Required) the path of the configuration JSON file.
--num-processes: the number of processes to process the data simulteneously. Default is 1. Increase this number can speed up data processing, but will also increase the CPU memory consumption.
--num-processes-for-nodes: the number of processes to process node data simulteneously. Increase this number can speed up node data processing.
--num-processes-for-edges: the number of processes to process edge data simulteneously. Increase this number can speed up edge data processing.
--output-dir: (Required) the path of the output data files.
--graph-name: (Required) the name assigned for the graph.
--remap-node-id: boolean value to decide whether to rename node IDs or not. Adding this argument will set it to be true, otherwise false.
--add-reverse-edges: boolean value to decide whether to add reverse edges for the given graph. Adding this argument sets it to true; otherwise, it defaults to false. It is strongly suggested to include this argument for graph construction, as some nodes in the original data may not have in-degrees, and thus cannot update their presentations by aggregating messages from their neighbors. Adding this arugment helps prevent this issue.
--output-format: the format of constructed graph, options are
DGL,DistDGL. Default isDistDGL. It also accepts multiple graph formats at the same time separated by an space, for example--output-format "DGL DistDGL". The output format is explained in the Output section above.--num-parts: an integer value that specifies the number of graph partitions to produce. This is only valid if the output format is
DistDGL.--skip-nonexist-edges: boolean value to decide whether skip edges whose endpoint nodes don’t exist. Default is true.
--ext-mem-workspace: the directory where the tool can store intermediate data during graph construction. Suggest to use high-speed SSD as the external memory workspace.
--ext-mem-feat-size: the minimal number of feature dimensions that features can be stored in external memory. Default is 64.
--output-conf-file: The output file with the updated configurations that records the details of data transformation, e.g., convert to categorical value mappings, and max-min normalization ranges. If not specified, will save the updated configuration file in the --output-dir with name data_transform_new.json.
Graph Partition for DGL Graphs
Warning
The two graph partition tools in this section were originally implemented for quick code debugging and are no longer maintained. It is strongly suggested to use the gconstruct.construct_graph command or the Distributed Graph Construction guideline for graph construction.
For users who are already familiar with DGL and know how to construct DGL graphs, GraphStorm provides two graph partition tools to split DGL graphs into the required input format for GraphStorm model training and inference.
partition_graph.py: for Node/Edge Classification/Regress task graph partition.
partition_graph_lp.py: for Link Prediction task graph partition.
partition_graph.py arguments
--dataset: (Required) the graph dataset name defined for the saved DGL graph file.
--filepath: (Required) the file path of the saved DGL graph file.
--target-ntype: the node type for making prediction, required for node classification/regression tasks. This argument is associated with the node type having labels. Current GraphStorm supports one prediction node type only.
--ntype-task: the node type task to perform. Only support
classificationandregressionso far. Default isclassification.--nlabel-field: the field that stores labels on the prediction node type, required if target-ntype is set. The format is
nodetype:labelname, e.g., “paper:label”.--target-etype: the canonical edge type for making prediction, required for edge classification/regression tasks. This argument is associated with the edge type having labels. Current GraphStorm supports one prediction edge type only. The format is
src_ntype,etype,dst_ntype, e.g., “author,write,paper”.--etype-task: the edge type task to perform. Only allow
classificationandregressionso far. Default isclassification.--elabel-field: the field that stores labels on the prediction edge type, required if target-etype is set. The format is
src_ntype,etype,dst_ntype:labelname, e.g., “author,write,paper:label”.--generate-new-node-split: a boolean value, required if need the partition script to split nodes for training/validation/test sets. If this argument is set to
true, the target-ntype argument must also be set.--generate-new-edge-split: a boolean value, required if need the partition script to split edges for training/validation/test sets. If this argument is set to
true, the target-etype argument must also be set.--train-pct: a float value (>0. and <1.) with default value
0.8. If you want the partition script to split nodes/edges for training/validation/test sets, you can set this value to control the percentage of nodes/edges for training.--val-pct: a float value (>0. and <1.) with default value
0.1. You can set this value to control the percentage of nodes/edges for validation.
Note
The sum of the train-pct and val-pct should be less than 1. And the percentage of test nodes/edges is the result of 1-(train_pct + val_pct).
--add-reverse-edges: if add this argument, will add reverse edges to the given graph.
--num-parts: (Required) an integer value that specifies the number of graph partitions to produce. Remember this number because we will need to set it in the model training step.
--output: (Required) the folder path that the partitioned DGL graphs will be saved.
partition_graph_lp.py arguments
--dataset: (Required) the graph name defined for the saved DGL graph file.
--filepath: (Required) the file path of the saved DGL graph file.
--target-etypes: (Required) the canonical edge types for making prediction. GraphStorm supports multiple predict edge types that are separated by a white space. The format is
src_ntype1,etype1,dst_ntype1 src_ntype2,etype2,dst_ntype2, e.g., “author,write,paper paper,citing,paper”.--train-pct: a float value (>0. and <1.) with default value
0.8. If you want the partition script to split edges for training/validation/test sets, you can set this value to control the percentage of edges for training.--val-pct: a float value (>0. and <1.) with default value
0.1. You can set this value to control the percentage of edges for validation.
Note
The sum of the train-pct and val-pct should less than 1. And the percentage of test edges is the result of 1-(train_pct + val_pct).
--add-reverse-edges: if add this argument, will add reverse edges to the given graphs.
--num-parts: (Required) an integer value that specifies the number of graph partitions to produce. Remember this number because we will need to set it in the model training step.
--output: (Required) the folder path that the partitioned DGL graph will be saved.