GSProcessing Input Configuration
GraphStorm Processing uses a JSON configuration file to parse and process the data into the format needed by GraphStorm partitioning and training downstream.
We use this configuration format as an intermediate between other config formats, such as the one used by the single-machine GConstruct module.
GSProcessing can take a GConstruct-formatted file
directly, and we also provide a script
that can convert a GConstruct
input configuration file into the GSProcessing
format,
although this is mostly aimed at developers, users are
can rely on the automatic conversion.
The GSProcessing input data configuration has two top-level objects:
{
"version": "gsprocessing-v0.3.1",
"graph": {}
}
version
(String, required): The version of configuration file being used. We include the package name to allow self-contained identification of the file format.graph
(JSON object, required): one configuration object that defines each of the edge and node types that constitute the graph.
We describe the graph
object next.
Contents of the graph
configuration object
The graph
configuration object can have two top-level objects:
{
"edges": [{}],
"nodes": [{}]
}
edges
: (array of JSON objects, required). Each JSON object in this array describes one edge type and determines how the edge structure will be parsed.nodes
: (array of JSON objects, optional). Each JSON object in this array describes one node type. This key is optional, in case it is missing, node IDs are derived from theedges
objects.
Contents of an edges
configuration object
An edges
configuration object can contain the following top-level
objects:
{
"data": {
"format": "String",
"files": ["String"],
"separator": "String"
},
"source": {"column": "String", "type": "String"},
"relation": {"type": "String"},
"dest": {"column": "String", "type": "String"},
"labels" : [
{
"column": "String",
"type": "String",
"split_rate": {
"train": "Float",
"val": "Float",
"test": "Float"
}
},
],
"features": [{}]
}
data
(JSON Object, required): Describes the physical files that store the data described in this object. The JSON object has two top level objects:format
(String, required): indicates the format the data is stored in. We accept either"csv"
or"parquet"
as valid file formats.files
(array of String, required): the physical location of files. The format accepts two options:a single-element list a with directory-like (ending in
/
) relative path under which all the files that correspond to the current edge type are stored.e.g.
"files": ['path/to/edge/type/']
This option allows for concise listing of entire types and would be preferred. All the files under the path will be loaded.
You can also provide a wildcard pattern, e.g.
"files": ['path/to/edge-type*.parquet']
would match all files that start withedge-type
and end with.parquet
under<input_prefix>/path/to
.
a multi-element list of relative file paths.
"files": ['path/to/edge/type/file_1.csv', 'path/to/edge/type/file_2.csv']
This option allows for multiple types to be stored under the same input prefix, but will result in more verbose spec files.
Since the spec expects relative paths, the caller is responsible for providing a path prefix to the execution engine. The prefix will determine if the source is a local filesystem or S3, allowing the spec to be portable, i.e. a user can move the physical files and the spec will still be valid, as long as the relative structure is kept.
separator
(String, optional): Only relevant for CSV files, determines the separator used between each column in the files.
source
: (JSON object, required): Describes the source nodes for the edge type. The top-level keys for the object are:column
: (String, required) The name of the column in the physical data files.type
: (String, optional) The type name of the nodes. If not provided, we assume that the column name is the type name.
destination
: (JSON object, required): Describes the destination nodes for the edge type. Its format is the same as thesource
key, with a JSON object that contains{“column: String, and ”type“: String}
.relation
: (JSON object, required): Describes the relation modeled by the edges. The top-level keys for the object are:type
(String, required): The type of the relation described by the edges. For example, for a source typeuser
, destinationmovie
we can have a relation typerated
for an edge typeuser:rated:movie
.
labels
(List of JSON objects, optional): Describes the label for the current edge type. The label object has the following top-level objects:column
(String, required): The column that contains the values for the label. Should be the empty string,""
if thetype
key has the value"link_prediction"
.type
(String, required): The type of the learning task. Can take the following String values:“classification”
: An edge classification task. The values in the specifiedcolumn
as treated as categorical variables."regression"
: An edge regression task. The values in the specifiedcolumn
are treated as numerical values."link_prediction"
: A link prediction tasks. Thecolumn
should be""
in this case.
separator
: (String, optional): For multi-label classification tasks, this separator is used within the column to list multiple classification labels in one entry.split_rate
(JSON object, optional): Defines a split rate for the label items. The sum of the values fortrain
,val
andtest
needs to be 1.0.train
: The percentage of the data with available labels to assign to the train set (0.0, 1.0].val
: The percentage of the data with available labels to assign to the validation set [0.0, 1.0).test
: The percentage of the data with available labels to assign to the test set [0.0, 1.0).
custom_split_filenames
(JSON object, optional): Specifies pre-assigned training/validation/test masks. If defined, GSProcessing will ignoresplit_rate
if provided.column
: (List[String], optional) A list of length one for node splits, or two for edge splits, containing the name(s) of the column(s) that contain the node/edge ids for each split. For example, if the node ids to include in each split exist in column"nid"
of the custom train/val/test files, this needs to be["nid"]
. For edges it would be a value like["src_id", "dst_id"]
. If not provided for nodes we assume the first column in the data contains the node ids to include. For edges, we assume the first column is the source id and the second the destination id.train
: (List[String], optional) Paths of the training mask parquet file such that each line contains the original ID for node tasks, or the pair[source_id, destination_id]
for edge tasks.val
: (List[String], optional) Paths of the validation mask parquet file such that each line contains the original ID for node tasks, or the pair[source_id, destination_id]
for edge tasks.test
: (List[String], optional) Paths of the test mask parquet file such that each line contains the original ID for node tasks, or the pair[source_id, destination_id]
for edge tasks.Note: At least one of the
["train", "val", "test"]
keys must be present.
features
(List of JSON objects, optional): Describes the set of features for the current edge type. See the Contents of a features configuration object section for details.
Contents of a nodes
configuration object
A node configuration object in a nodes
field can contain the
following top-level keys:
{
"data": {
"format": "String",
"files": ["String"],
"separator": "String"
},
"column": "String",
"type": "String",
"labels" : [
{
"column": "String",
"type": "String",
"split_rate": {
"train": "Float",
"val": "Float",
"test": "Float"
}
}
],
"features": [{}]
}
data
: (JSON object, required): Has the same definition as for the edges object, with one top-level key for theformat
that takes a String value, and one for thefiles
that takes an array of String values.column
: (String, required): The name of the column in the data that stores the node ids.type:
(String, optional): A type name for the nodes described in this object. If not provided thecolumn
value is used as the node type.labels
: (List of JSON objects, optional): Similar to the labels object defined for edges, but the values that thetype
can take are different.column
(String, required): The name of the column that contains the label values.type
(String, required): Specifies that target task type which can be:"classification"
: A node classification task. The values in the specifiedcolumn
are treated as categorical variables."regression"
: A node regression task. The values in the specifiedcolumn
are treated as float values.
separator
(String, optional): For multi-label classification tasks, this separator is used within the column to list multiple classification labels in one entry.e.g. with separator
|
we can haveaction|comedy
as a label value.
split_rate
(JSON object, optional): Defines a split rate for the label items. The sum of the values fortrain
,val
andtest
needs to be 1.0.train
: The percentage of the data with available labels to assign to the train set (0.0, 1.0].val
: The percentage of the data with available labels to assign to the validation set [0.0, 1.0).test
: The percentage of the data with available labels to assign to the test set [0.0, 1.0).
features
(List of JSON objects, optional): Describes the set of features for the current node type. See the section Contents of a features configuration object for details.
Contents of a features
configuration object
An element of a features
configuration object (for edges or nodes)
can contain the following top-level keys:
{
"column": "String",
"name": "String",
"transformation": {
"name": "String",
"kwargs": {
"arg_name": "<value>"
}
},
"data": {
"format": "String",
"files": ["String"],
"separator": "String"
}
}
column
(String, required): The column that contains the raw feature values in the data.transformation
(JSON object, optional): The type of transformation that will be applied to the feature. For details on the individual transformations supported see Supported transformations. If this key is missing, the feature is treated as a no-op feature withoutkwargs
.name
(String, required): The name of the transformation to be applied.kwargs
(JSON object, optional): A dictionary of parameter names and values. Each individual transformation will have its own supported parameters, described in Supported transformations.
name
(String, optional): The name that will be given to the encoded feature. If not given, column is used as the output name.data
(JSON object, optional): If the data for the feature exist in a file source that’s different from the rest of the data of the node/edge type, they are provided here. For example, you could have each feature in one file source each:# Example node config with multiple features { # This is where the node structure data exist, just need an id col in these files "data": { "format": "parquet", "files": ["path/to/node_ids"] }, "column" : "node_id", "type" : "my_node_type", "features": [ # Feature 1 { "column": "feature_one", # The files contain one "node_id" col and one "feature_one" col "data": { "format": "parquet", "files": ["path/to/feature_one/"] } }, # Feature 2 { "column": "feature_two", # The files contain one "node_id" col and one "feature_two" col "data": { "format": "parquet", "files": ["path/to/feature_two/"] } } ] }
The file source needs to contain the column names of the parent node/edge type to allow a 1-1 mapping between the structure and feature files.
For nodes the the feature files need to have one column named with the node id column name, (the value of
"column"
for the parent node type), for edges we need both thesource
anddestination
columns to use as a composite key.
Supported transformations
In this section we’ll describe the transformations we support.
The name of the transformation is the value that would appear
in the ['transformation']['name']
element of the feature configuration,
with the attached kwargs
for the transformations that support
arguments.
no-op
Passes along the data as-is to be written to storage and used in the partitioning pipeline. The data are assumed to be single values or vectors of floats.
kwargs
:separator
(String, optional): Only relevant for CSV file sources, when a separator is used to encode vector feature values into one column. If given, the separator will be used to split the values in the column and create a vector column output. Example: for a separator'|'
the CSV value1|2|3
would be transformed to a vector,[1, 2, 3]
.truncate_dim
(Integer, Optional): Relevant for vector inputs. Allows you to truncate the input vector to the firsttruncate_dim
values, which can be useful when your inputs are Matryoshka representation learning embeddings.out_dtype
(String, Optional): Specify the data type of the transformed feature. Currently we only supportfloat32
andfloat64
.
numerical
Transforms a numerical column using a missing data imputer and an optional normalizer.
kwargs
:imputer
(String, optional): A method to fill in missing values in the data. Valid values are:none
(Default),mean
,median
, andmost_frequent
. Missing values will be replaced with the respective value computed from the data.normalizer
(String, optional): Applies a normalization to the data, after imputation. Can take the following values:none
: (Default) Don’t normalize the numerical values during encoding.min-max
: Normalize each value by subtracting the minimum value from it, and then dividing it by the difference between the maximum value and the minimum.standard
: Normalize each value by dividing it by the sum of all the values.rank-gauss
: Normalize each value using Rank-Gauss normalization. Rank-gauss first ranks all values, converts the ranks to the -1/1 range, and applies the inverse of the error function to make the values conform to a Gaussian distribution shape. This transformation only supports a single column as input.
out_dtype
(String, Optional): Specify the data type of the transformed feature. Currently we only supportfloat32
andfloat64
.epsilon
: Only relevant forrank-gauss
, this epsilon value is added to the denominator to avoid infinite values during normalization.
multi-numerical
Column-wise transformation for vector-like numerical data using a missing data imputer and an optional normalizer.
kwargs
:imputer
(String, optional): Same as fornumerical
transformation, will apply no imputation by default.normalizer
(String, optional): Same as fornumerical
transformation, no normalization is applied by default.separator
(String, optional): Same as forno-op
transformation, used to separate numerical values in CSV input. If the input data are in Parquet format, each value in the column is assumed to be an array of floats.out_dtype
(Optional): Specify the data type of the transformed feature. Currently we only supportfloat32
andfloat64
.
bucket-numerical
Transforms a numerical column to a one-hot or multi-hot bucket representation, using bucketization. Also supports optional missing value imputation through the imputer kwarg.
kwargs
:imputer
(String, optional): A method to fill in missing values in the data. Valid values are:none
(Default),mean
,median
, andmost_frequent
. Missing values will be replaced with the respective value computed from the data.range
(List[float], required), The range defines the start and end point of the buckets with[a, b]
. It should be a list of two floats. For example,[10, 30]
defines a bucketing range between 10 and 30.bucket_cnt
(Integer, required), The count of bucket lists used in the bucket feature transform. GSProcessing calculates the size of each bucket as( b - a ) / c
, and encodes each numeric value as the number of whatever bucket it falls into. Any value less than a is considered to belong in the first bucket, and any value greater than b is considered to belong in the last bucket.slide_window_size
(Integer, optional), slide_window_size can be used to make numeric values fall into more than one bucket, by specifying a slide-window sizes
, wheres
can an integer or float. GSProcessing then transforms each numeric valuev
of the property into a range fromv - s/2
throughv + s/2
, and assigns the value v to every bucket that the range covers.
categorical
Transforms values from a fixed list of possible values (categorical features) to a one-hot encoding. The length of the resulting vector will be the number of categories in the data minus one, with a 1 in the index of the single category, and zero everywhere else.
Note
The maximum number of categories in any categorical feature is 100. If a property has more than 100 categories of value, only the most common 99 of them are placed in distinct categories, and the rest are placed in a special category named OTHER.
multi-categorical
Encodes vector-like data from a fixed list of possible values (i.e. multi-label/multi-categorical data) using a multi-hot encoding. The length of the resulting vector will be the number of categories in the data minus one, and each value will have a 1 value for every category that appears, and 0 everwhere else.
kwargs
:separator
(String, optional): Same as the one in the No-op operation, the separator is used to split multiple input values for CSV files e.g.detective|noir
. If it is not provided, then the whole value will be considered as an array. For Parquet files, if the input type is ArrayType(StringType()), then the separator is ignored; if it is StringType(), it will apply same logic as in CSV.
huggingface
Transforms a text feature column to tokens or embeddings with different Hugging Face models, enabling nuanced understanding and processing of natural language data.
kwargs
:action
(String, required): Currently we support embedding creation using HuggingFace models, where the input text is transformed to a vector representation, or tokenization of text the using using HuggingFace tokenizers, where the output is a tokenized version of the text to be used downstream as input to a Huggingface model during training.tokenize_hf
: Tokenize text strings with a HuggingFace tokenizer. The tokenizer_hf can use any HuggingFace LM models available in the huggingface model repository. You can find more information about tokenization at huggingface autotokenizer docs The expected input are text strings, and the expected output will includeinput_ids
for token IDs on the input text,attention_mask
for a mask to avoid performing attention on padding token indices, andtoken_type_ids
for segmenting two sentences in models. The output here is compatible for graphstorm language model training and inference pipelines.embedding_hf
: Encode text strings with a HuggingFace embedding model. The value can be any HuggingFace language model available in the Huggingface model repository, e.g.bert-base-uncased
. The expected input are text strings, and the expected output will be the vector embeddings for the text strings.
hf_model
(String, required): An identifier of a pre-trained model available in the Hugging Face Model Hub, e.g.bert-base-uncased
. You can find all models in the Huggingface model repository.max_seq_length
(Integer, required): Specifies the maximum number of tokens of the input. You can use a length greater than the dataset’s longest sentence; or for a safe value choose 128. Make sure to check the model’s max supported length when setting this value.
edge_dst_hard_negative
Encodes a hard negative edge feature for link prediction. For detail information for hard negative support, please refer to Hard Negative sampling.
kwargs
: -separator
(String, optional): The separator is used tosplit multiple values in an input string for data in CSV files e.g.
p0;s1
. If it is not provided, then the whole value will be treated as a single string.
Creating a graph for multi-task training
To create a graph for multi-task training, you need to
define custom label names in your config for each of the labels you want to
use during training in a mask_field_names
entry for each label config.
GSProcessing will generate separate
train/val/test masks for each of your labels named
accordingly.
After partitioning the data, you need to then provide the same
mask names in a mask_fields
entry
in your train YAML file during multi-task training.
For details on running multi-task training see Multi-task Learning in GraphStorm.
Here we list an example multi-task GSProcessing config for the ACM data described in
/tutorials/use-own-data, where we prepare a node classification label for the
paper
node type, and a link prediction task on the paper,citing,paper
edge
type.
Example multi-task GSProcessing config
{
"version": "gsprocessing-v1.0",
"graph": {
"nodes": [
{
"data": {
"format": "parquet",
"files": [
"nodes/paper.parquet"
]
},
"type": "paper",
"column": "node_id",
"features": [
{
"column": "feat",
"name": "feat",
"transformation": {
"name": "no-op"
}
}
],
"labels": [
{
"column": "label",
"type": "classification",
"split_rate": {
"train": 0.8,
"val": 0.1,
"test": 0.1
},
"mask_field_names": [
"train_mask_class",
"val_mask_class",
"test_mask_class"
]
}
]
}
],
"edges": [
{
"data": {
"format": "parquet",
"files": [
"edges/paper_citing_paper.parquet"
]
},
"source": {
"column": "source_id",
"type": "paper"
},
"dest": {
"column": "dest_id",
"type": "paper"
},
"relation": {
"type": "citing"
},
"labels": [
{
"column": "",
"type": "link_prediction",
"split_rate": {
"train": 0.8,
"val": 0.1,
"test": 0.1
},
"mask_field_names": [
"train_mask_lp",
"val_mask_lp",
"test_mask_lp"
]
}
]
}
]
}
When using the above GSProcessing config you can then use the following train YAML file to run multi-task training:
Example multi-task training YAML
---
version: 1.0
gsf:
basic:
model_encoder_type: rgcn
backend: gloo
verbose: false
gnn:
fanout: "50,50"
num_layers: 2
hidden_size: 256
use_mini_batch_infer: false
hyperparam:
dropout: 0.
lr: 0.0001
lm_tune_lr: 0.0001
num_epochs: 300
batch_size: 1024
wd_l2norm: 0
alpha_l2norm: 0.
rgcn:
num_bases: -1
use_self_loop: true
multi_task_learning:
- node_classification:
target_ntype: "paper"
label_field: "label"
mask_fields:
- "train_mask_class"
- "val_mask_class"
- "test_mask_class"
num_classes: 14
task_weight: 1.0
- link_prediction:
num_negative_edges: 4
num_negative_edges_eval: 100
train_negative_sampler: joint
train_etype:
- "paper,citing,paper"
mask_fields:
- "train_mask_lp"
- "val_mask_lp"
- "test_mask_lp"
reverse_edge_types_map: ["paper,citing,cited,paper"]
task_weight: 0.5 # weight of the task
Creating a graph for inference
If no label entries are provided for any of the entries in the input configuration, the processed data will not include any train/val/test masks. You can use this mode when you want to produce a graph just for inference.
Examples
Node classification for node type field in OAG-Paper dataset
{
"version" : "gsprocessing-v1.0",
"graph" : {
"edges" : [
{
"data": {
"format": "csv",
"files": [
"edges.csv"
],
"separator": ","
},
"source": {"column": "~from", "type": "paper"},
"dest": {"column": "~to", "type": "paper"},
"relation": {"type": "cites"}
}
],
"nodes" : [
{
"type": "paper",
"column": "ID",
"data": {
"format": "csv",
"separator": ",",
"files": [
"node_feat.csv"
]
},
"features": [
{
"column": "n_citation",
"transformation": {
"name": "numerical",
"kwargs": {
"imputer": "mean",
"normalizer": "min-max"
}
}
}
],
"labels": [
{
"column": "field",
"type": "classification",
"separator": ";",
"split_rate": {
"train": 0.7,
"val": 0.1,
"test": 0.2
}
}
]
}
]
}
}