Link Prediction Learning in GraphStorm
Link prediction is widely used in the industry as a pre-training method to produce high-quality entity representations. However, performing link prediction training on large graphs is non-trivial both in terms of model performance and efficiency. GraphStorm offers a wide array of options for users to customize their link prediction model training from both model performance and training efficiency standpoints.
Optimizing model performance
GraphStorm incorporates three ways of improving model performance of link prediction. Firstly, GraphStorm avoids information leak in model training. Secondly, to better handle heterogeneous graphs, GraphStorm provides four ways to compute link prediction scores: dot product, DistMult, TransE, and RotatE. Thirdly, GraphStorm provides two options to compute training losses, i.e., cross entropy loss and contrastive loss. The following sub-sections provide more details.
Prevent Information Leakage in Link Prediction
Implementing a training loop for link prediction tasks needs to carefully handle the information leakage problems caused by 1) including target edges in message passing, and 2) including validation/test edges in message passing during training. (This paper provides more details.)
GraphStorm provides supports to avoid theses problems:
To avoid including target edges in message passing, users need to set
exclude_training_targets
to True and providereverse_edge_types_map
when launching link prediction training tasks. These two arguments tell GraphStorm to exclude the training target edges and the corresponding reverse edges when doing message passing. More explanation of the two arguments can be found on the Training and Inference Configurations.To avoid including validation/test edges in message passing during model training, users need to mask validation edges and test edges with
val_mask
andtest_mask
respectively. Users also need to mask all the other edges withtrain_mask
.
Computing Link Prediction Scores
GraphStorm provides four ways to compute link prediction scores: Dot Product, DistMult, TransE, and RotatE.
Dot Product: The Dot Product score function is as:
\[score = \mathrm{sum}(head\_emb * tail\_emb)\]where the
head_emb
is the node embedding of the head node and thetail_emb
is the node embedding of the tail node.DistMult: The DistMult score function is as:
\[score = \mathrm{sum}(head\_emb * relation\_emb * tail\_emb)\]where the
head_emb
is the node embedding of the head node, thetail_emb
is the node embedding of the tail node and therelation_emb
is the relation embedding of the specific edge type. Therelation_emb
values are initialized from a uniform distribution within the range of(-gamma/hidden_size, gamma/hidden_size)
, wheregamma
andhidden_size
are hyperparameters defined in Model Configurations.TransE: The TransE score function is as:
\[score = gamma - \|h+r-t\|^{frac{1}{2}} \text{or} gamma - \|h+r-t\|\]where the
head_emb
is the node embedding of the head node, thetail_emb
is the node embedding of the tail node, therelation_emb
is the relation embedding of the specific edge type. Therelation_emb
values are initialized from a uniform distribution within the range of(-gamma/(hidden_size/2), gamma/(hidden_size/2))
, wheregamma
andhidden_size
are hyperparameters defined in Model Configurations. To learn more information about TransE, please refer to the DGLKE doc.RotatE: The RotatE score function is as:
\[score = gamma - \|head\_emb \circ relation\_emb - tail\_emb\|^2\]where the
head_emb
is the node embedding of the head node, thetail_emb
is the node embedding of the tail node, therelation_emb
is the relation embedding of the specific edge type, and \(\circ\) is the element-wise product. Therelation_emb
values are initialized from a uniform distribution within the range of(-gamma/(hidden_size/2), gamma/(hidden_size/2))
, wheregamma
andhidden_size
are hyperparameters defined in Model Configurations. To learn more information about RotatE, please refer to the DGLKE doc.
Link Prediction Loss Functions
GraphStorm provides four options to compute training losses:
Cross Entropy Loss: The cross entropy loss turns a link prediction task into a binary classification task. We treat positive edges as 1 and negative edges as 0. The loss of an edge
e
is as:\[loss = - y \cdot \log score + (1 - y) \cdot \log (1 - score)\]where
y
is 1 whene
is a positive edge and 0 when it is a negative edge.score
is the score value ofe
computed by the score function.Weighted Cross Entropy Loss: The weighted cross entropy loss is similar to Cross Entropy Loss except that it allows users to set a weight for each positive edge. The loss function of an edge
e
is as:\[loss = - w\_e \left[ y \cdot \log score + (1 - y) \cdot \log (1 - score) \right]\]where
y
is 1 whene
is a positive edge and 0 when it is a negative edge.score
is the score value ofe
computed by the score function,w_e
is the weight ofe
and is defined as\[\begin{split}w\_e = \left \{ \begin{array}{lc} 1, & \text{ if } e \in G, \\ 0, & \text{ if } e \notin G \end{array} \right.\end{split}\]where
G
is the training graph.Adversarial Cross Entropy Loss: The adversarial cross entropy loss turns a link prediction task into a binary classification task. We treat positive edges as 1 and negative edges as 0. In addition, adversarial cross-entropy loss adjusts the loss value of each negative sample based on its degree of difficulty. This is enabled by setting the
adversarial_temperature
config.The loss of positive edges is as:
\[loss_{pos} = - \log score\]where
score
is the score value of the positive edges computed by the score function.The loss of negative edges is as:
\[\begin{split}\begin{gather*} loss_{neg} = \log (1 - score) \\ loss_{neg} = \mathrm{softmax}(score * adversarial\_temperature) * loss_{neg} \end{gather*}\end{split}\]where
score
is the score value of the negative edges computed by the score function andadversarial_temperature
is a hyper-parameter.The final loss is as:
\[loss = \dfrac{\mathrm{avg}(loss_{pos}) + \mathrm{avg}(loss_{neg})}{2}\]Weighted Adversarial Cross Entropy Loss The weighted cross entropy loss is similar to Adversarial Cross Entropy Loss except that it allows users to set a weight for each positive edge. The loss function of a positive edge
e
is as:\[loss_{pos} = - w * \log score\]where
score
is the score value of the positive edges computed by the score function,w
is the weight of each positive edge. The loss of the negative edges is the same as Adversarial Cross Entropy Loss.The final loss is as:
\[loss = \dfrac{\mathrm{avg}(loss_{pos}) + \mathrm{avg}(loss_{neg})}{2}\]Bayesian Personalized Ranking Loss: Bayesian personalized ranking (BPR, https://arxiv.org/abs/1205.2618) is a pairwise personalized ranking loss that is derived from the maximum posterior estimator. The loss is defined as:
\[loss = - \log\left(\frac{ 1 }{ 1 + \exp(-x)}\right)\]where
x
is the distance between the score of a positive edge and the score of its corresponding negative edge. The distance is defined as:\[x = positive\_score - negative\_score\]Weighted Bayesian Personalized Ranking Loss: The weighted bayesian personalized ranking is similar to Bayesian Personalized Ranking Loss except that it allows users to set a weight for each positive edge. The loss is defined as:
\[loss = - w\_e [ \log\left(\frac{ 1 }{ 1 + \exp(-x)}\right) ]\]where
x
is the distance between the score of a positive edge and the score of its corresponding negative edge, and w_e` is the weight of the positive edge.Contrastive Loss: The contrastive loss compels the representations of connected nodes to be similar while forcing the representations of disconnected nodes remains dissimilar. In the implementation, we use the score computed by the score function to represent the distance between nodes. When computing the loss, we group one positive edge with the
N
negative edges corresponding to it.The loss function is as follows:\[loss = -\log \left(\dfrac{\exp(pos\_score)}{\sum_{i=0}^N \exp(score\_i)} \right)\]where
pos_score
is the score of the positive edge.score_i
is the score of the i-th edge. In total, there areN+1
edges, within which there is 1 positive edge andN
negative edges.
Selecting the Negative Sampler
GraphStorm provides a wide list of negative samplers:
Uniform negative sampling: Given
N
training edges under edge type(src_t, rel_t, dst_t)
and the number of negatives set toK
, uniform negative sampling randomly samplesK
nodes fromdst_t
for each training edge. It corrupts the training edge to formK
negative edges by replacing its destination node with sampled negative nodes. In total, it will sampleN * K
negative nodes.uniform
: Uniformly sampleK
negative edges for each positive edge.fast_uniform
: same asuniform
except that the sampled subgraphs
will not exclude edges with
val_mask
andtest_mask
.all_etype_uniform
: same asuniform
, but it ensures that each
training edge type appears in every mini-batch.
Local uniform negative sampling: Local uniform negative sampling samples negative edges in the same way as uniform negative sampling except that all the negative nodes are sampled from the local graph partition.
localuniform
: Uniformly sampleK
negative edges for each positive edge.
However the negative nodes are sampled from the local graph partition instead of being sampled globally.
fast_localuniform
: same aslocaluniform
except that the sampled subgraphs
will not exclude edges with
val_mask
andtest_mask
. Please see the details in Speedup Link Prediction Training.Joint negative sampling: Given
N
training edges under edge type(src_t, rel_t, dst_t)
and the number of negatives set toK
, joint negative sampling randomly samplesK
nodes fromdst_t
for everyK
training edges. For theseK
training edges, it corrupts each edge to formK
negative edges by replacing its destination node with the same set of negative nodes. In total, it only needs to sample $N$ negative nodes. (We supposeN
is dividable byK
for simplicity.)joint
: SampleK
negative nodes for everyK
positive edges.
The
K
positive edges will share the same set of negative nodesfast_joint
: same asjoint
except that the sampled subgraphs
will not exclude edges with
val_mask
andtest_mask
. Please see the details in Speedup Link Prediction Training.all_etype_joint
: same asjoint
, but it ensures that each
training edge type appears in every mini-batch.
Local joint negative sampling: Local joint negative sampling samples negative edges in the same way as joint negative sampling except that all the negative nodes are sampled from the local graph partition.
localjoint
: SampleK
negative nodes for everyK
positive edges.
However the negative nodes are sampled from the local graph partition instead of being sampled globally.
fast_localjoint
: same aslocaljoint
except that the sampled subgraphs
will not exclude edges with
val_mask
andtest_mask
.In-batch negative sampling: In-batch negative sampling creates negative edges by exchanging destination nodes between training edges. For example, suppose there are three training edges
(u_1, v_1), (u_2, v_2), (u_3, v_3)
, In-batch negative sampling will create two negative edges(u_1, v_2)
and(u_1, v_3)
for(u_1, v_1)
, two negative edges(u_2, v_1)
and(u_2, v_3)
for(u_2, v_2)
and two negative edges(u_3, v_1)
and(u_3, v_2)
for(u_3, v_3)
. If the batch size is smaller than the number of negatives, either of the above three negative sampling methods can be used to sample extra negative edges.inbatch_joint
: In-batch joint negative sampling.
Speedup Link Prediction Training
GraphStorm relies on dgl.dataloading.MultiLayerNeighborSampler
and
train_mask
to avoid sampling validation and test edges during training.
Basically, it only samples edges with train_mask
set to be True. However,
the implementation is not efficient. To speedup graph sampling during link
prediction training, GraphStorm provides four link prediction dataloaders
(i.e., fast_uniform
, fast_joint
, fast_localuniform
and
fast_localjoint
) with more efficient implementation but less precise
neighbor sampling behavior. To be more specific, these dataloaders will do
neighbor sampling regardless of any masks in the beginning, and later remove
edges with val_mask
or test_mask
set to be True. In theory, a sampled
subgraph may have less neighbor nodes than expected as some of them would be
removed. However, with a graph having hundreds of millions of edges (or more)
and small validation and test sets, e.g., each with less than 10% edges, the
impact is negligible.
With DGL 1.0.4, fast_localuniform
dataloader can speedup 2.4X over localuniform
dataloader on training a 2 layer RGCN on MAG dataset on four g5.48x instances.
Hard Negative sampling
GraphStorm provides support for users to define hard negative edges for a positive edge during Link Prediction training.
Currently, hard negative edges are constructed by replacing the destination nodes of edges with pre-defined hard negatives.
For example, given an edge (src_pos
, dst_pos
) and its hard negative destination nodes hard_0
and hard_1
, GraphStorm will construct two hard negative edges, i.e., (src_pos
, hard_0
) and (src_pos
, hard_1
).
The hard negatives are stored as edge features of the target edge type.
Users can provide the hard negatives for each edge type through train_etypes_negative_dstnode
in the training config yaml.
For example, the following yaml block defines the hard negatives for edge type (src_type,rel_type0,dst_type)
as the edge feature negative_nid_field_0
and the hard negatives for edge type (src_type,rel_type1,dst_type)
as the edge feature negative_nid_field_1
.
train_etypes_negative_dstnode: - src_type,rel_type0,dst_type:negative_nid_field_0 - src_type,rel_type1,dst_type:negative_nid_field_1
Users can also define the number of hard negatives to sample for each edge type during training though num_train_hard_negatives
in the training config yaml.
For example, the following yaml block defines the number of hard negatives for edge type (src_type,rel_type0,dst_type)
is 5 and the number of hard negatives for edge type (src_type,rel_type1,dst_type)
is 10.
num_train_hard_negatives: - src_type,rel_type0,dst_type:5 - src_type,rel_type1,dst_type:10
Hard negative sampling can be used together with any link prediction negative sampler, such as uniform
, joint
, inbatch_joint
, etc.
By default, GraphStorm will sample hard negatives first to fulfill the requirement of num_train_hard_negatives
and then sample random negatives to fulfill the requirement of num_negative_edges
.
In general, GraphStorm covers following cases:
Case 1
num_train_hard_negatives
is larger or equal tonum_negative_edges
. GraphStorm will only sample hard negative nodes.Case 2
num_train_hard_negatives
is smaller thannum_negative_edges
. GraphStorm will randomly samplenum_train_hard_negatives
hard negative nodes from the hard negative set and then randomly samplenum_negative_edges - num_train_hard_negatives
negative nodes.Case 3 GraphStorm supports cases when some edges do not have enough hard negatives provided by users. For example, the expected
num_train_hard_negatives
is 10, but an edge only have 5 hard negatives. In certain cases, GraphStorm will use all the hard negatives first and then randomly sample negative nodes to fulfill the requirement ofnum_train_hard_negatives
. Then GraphStorm will go back to Case 1 or Case 2.
Preparing graph data for hard negative sampling
Both single machine and distributed graph construction pipeline of GraphStorm provide support to load hard negative data from raw input.
Hard destination negatives can be defined through edge_dst_hard_negative
transformation.
The feature_col
field of edge_dst_hard_negative
must stores the raw node ids of hard destination nodes.
The following example shows how to define a hard negative feature for edges with the relation (node1, relation1, node1)
:
{ ... "edges": [ ... { "source_id_col": "src", "dest_id_col": "dst", "relation": ("node1", "relation1", "node1"), "format": {"name": "parquet"}, "files": "edge_data.parquet", "features": [ { "feature_col": "hard_neg", "feature_name": "hard_neg_feat", "transform": {"name": "edge_dst_hard_negative", "separator": ";"}, } ] } ] }
The hard negative data is stored in the column named hard_neg
in the edge_data.parquet
file.
The edge feature to store the hard negative will be hard_neg_feat
.
GraphStorm accepts two types of hard negative inputs:
An array of strings or integers When the input format is
Parquet
, thefeature_col
can store string or integer arrays. In this case, each row stores a string/integer array representing the hard negative node ids of the corresponding edge. For example, thefeature_col
can be a 2D string array, like[["e0_hard_0", "e0_hard_1"],["e1_hard_0"], ..., ["en_hard_0", "en_hard_1"]]
or a 2D integer array (for integer node ids) like[[10,2],[3],...[4,12]]
. It is not required for each row to have the same dimension size. GraphStorm will automatically handle the case when some edges do not have enough pre-defined hard negatives.
For example, the file storing hard negatives should look like the following:
src | dst | hard_neg
"src_0" | "dst_0" | ["dst_10", "dst_11"]
"src_0" | "dst_1" | ["dst_5"]
...
"src_100"| "dst_41" | [dst0, dst_2]
A single string The
feature_col
stores strings instead of string arrays (When the input format isParquet
orCSV
). In this case, aseparator
must be provided int the transformation definition to split the strings into node ids. Thefeature_col
will be a 1D string list, for example["e0_hard_0;e0_hard_1", "e1_hard_1", ..., "en_hard_0;en_hard_1"]
. The string length, i.e., number of hard negatives, can vary from row to row. GraphStorm will automatically handle the case when some edges do not have enough hard negatives.
For example, the file storing hard negatives should look like the following:
src | dst | hard_neg
"src_0" | "dst_0" | "dst_10;dst_11"
"src_0" | "dst_1" | "dst_5"
...
"src_100"| "dst_41"| "dst0;dst_2"
GraphStorm will automatically translate the Raw Node IDs of hard negatives into Partition Node IDs in a DistDGL graph.
Link Prediction Metrics
GraphStorm supports several metrics for link prediction, to give a well-rounded view of model performance. In general, link prediction evaluation happens by constructing a set of negative edges with one of the sampling methods described above, and including one positive edge in this set of edges, which we will refer to as the candidate set. The model assigns a score to each edge in the candidate set, and ideally the true edge is ranked at the top position when edges are ranked by scores.
We define the set of ranking scores as \(\mathcal{I}\) and the number of candidate edges as \(\mathcal{|I|}\). We refer to the ranking of a positive edge within the list as \(r\).
Mean Reciprocal Rank (MRR)
Mean Reciprocal Rank or MRR is a metric commonly used in link prediction evaluation that represents the ability of the model to rank the correct edge among a list of candidate edges. It is defined as:
where \(\mathcal{I}\) is the set of candidate edges, and \(r\) corresponds to the ranking of the positive edge as determined by the score assigned to the model to each edge in the candidate set.
The ideal MRR is 1.0 meaning that the positive edges are ranked first in every score list. Because a positive edge is always included in the ranking, it cannot get the value of 0.0 so its range is in \((0, 1]\). MRR values are influenced by the size of the candidate lists, so it can only be used to compare the performance when the number of negative edges per positive edge is the same.
Hits@k
The Hits@k
metric measures the number of times the positive edge was ranked in the
top k positions by the model in the sorted score list:
This metric is easy to interpret but has the disadvantage that any position beyond the top-k is not taken into account, so does not provide a holistic view needed for cross-model comparison, and is also sensitive to the number of negatives in the set.
Adjusted Mean Ranking Index (AMRI)
AMRI was proposed in the paper On the Ambiguity of Rank-Based Evaluation of EA or LP Methods as a metric that allows cross-model comparison, by looking at the entire score list, but is not sensitive to the chosen number of negative edges per positive edge. It is defined as:
where \(\text{MR}\) is the mean rank, and \(\mathbb{E}[\text{MR}-1]\) is the expected mean rank, which is used to adjust for chance. Its values will be in the \([-1, 1]\) range, where 1 corresponds to optimal performance where each individual rank of the positive edge is 1. A value of 0 indicates model performance similar to a model assigning random scores, or equal score to every candidate. The value is negative if the model performs worse than the all-equal-score model.”