GSPureLMNodeInputLayer

class graphstorm.model.GSPureLMNodeInputLayer(g, node_lm_configs, num_train=0, lm_infer_batch_size=16, use_fp16=True, cached_embed_path=None, wg_cached_embed=False)

Bases: GSNodeInputLayer

The node encoder input embedding layer with language model (LM) supported only.

This input layer only has the LM layer and requires all node types to have textual features. The output dimension will be the same as the output dimension of the LM.

Use GSLMNodeEncoderInputLayer if there are extra node features or a different output dimension is required.

Parameters

g: DistGraph

The input DGL distributed graph.

node_lm_configs: LM config

A list of language model configurations.

num_train: int

The number of nodes with textual features used for LM model finetuning in a mini-batch. Default: 0.

lm_infer_batch_size: int

Batch size used for computing text embeddings for static LM model. Default: 16.

use_fp16bool

Whether to use float16 to store LM embeddings. Default: True.

cached_embed_pathstr

The path where the generated LM embeddings are cached.

Examples:

from graphstorm.model import GSgnnNodeModel, GSPureLMNodeInputLayer
from graphstorm.dataloading import GSgnnData

node_lm_configs = [
    {
        "lm_type": "bert",
        "model_name": "bert-base-uncased",
        "gradient_checkpoint": True,
        "node_types": ['ntype1', 'ntype2']
    }
]
np_data = GSgnnData(...)
model = GSgnnNodeModel(...)
lm_train_nodes=10
encoder = GSPureLMNodeInputLayer(g=np_data.g,
                                 node_lm_configs=node_lm_configs,
                                 num_train=lm_train_nodes)
model.set_node_input_encoder(encoder)
get_general_dense_parameters()

Get dense layers’ model parameters of this node encoder input layer.

Returns

empty list. There is no dense parameters in this type of input layer.

get_lm_dense_parameters()

Get the language model related parameters.

Returns

list of Tensors: the language model related parameters.

prepare(g)

Preparing input layer for training or inference.

If the number of nodes for LM model finetuning is zero, freeze this layer.

freeze(_)

Generate LM cache.

The LM cache is used in the following cases:

  1. No need to fine-tune the LM, i.e., num_train == 0. In this case, only generate LM cache once before model training.

  2. GNN warm up when lm_freeze_epochs > 0 (controlled by trainer). Generate the emb_cache before model training. In the first lm_freeze_epochs epochs, the number of nodes with text features for LM fine-tuning will be set to 0, and the LM cache will not be refreshed.

  3. if num_train > 0, no emb_cache is used unless Case 2.

unfreeze()

Disable LM caching.

If num_train > 0, and not use LM cache, clear existing LM cache.

require_cache_embed()

Ask to cache the embeddings for inference.

Returns

bool : return True to cache the embeddings for inference.

forward(input_feats, input_nodes)

Input layer forward computation.

The forward function only computes the LM embeddings and ignore the input node features if there are node features.

Parameters

input_feats: dict of Tensor

The input features in the format of {ntype: feats}.

input_nodes: dict of Tensor

The input node indexes in the format of {ntype: indexes}.

Returns

embs: a dict of Tensor

The projected node embeddings in the format of {ntype: emb}.

property in_dims

Return the LM embedding size.

The LM embeddings are usually pre-computed as node features. So here considers the LM embedding size as input node feature size.

property out_dims

Return the LM embedding size.