Real-time Inference on Amazon SageMaker

GraphStorm CLIs for model inference on single machine, distributed clusters, and Amazon SageMaker are designed to handle large datasets, which could take minutes to hours for predicting a large number of target nodes/edges or generating embeddings for all nodes. This is typically referred as offline inference, where the model processes a large batch of data at once and the results will be used downstream at a later point.

However, in certain use cases such as recommendation systems, social network analysis, and fraud detection, there is a need for real-time predictions. For instance, you may only want to predict a few targets and expect to get results immediately to perform an action in response to a user request. In these scenarios, you will need to deploy a server to host the trained model and respond to inference requests in real time. This is known as online inference, where the model is constantly available to make predictions on new data as it comes in, ensuring immediate responses for time-sensitive applications.

Since version 0.5, GraphStorm offers the capability to deploy a trained model as a SageMaker real-time inference endpoint. The following sections provide details of how to deploy an endpoint, and how to invoke it.

Prerequisites

In order to use GraphStorm real-time inference on Amazon SageMaker, users need to have an AWS account and access to the following AWS services.

SageMaker service. Needed to deploy endpoint, and optionally train models. Please refer to Amazon SageMaker service for how to get access to Amazon SageMaker.
Amazon ECR. Needed to store GraphStorm Sagemaker Docker images. Please refer to Amazon Elastic Container Registry service for how to get access to Amazon ECR.
S3 service. Needed for input and output for SageMaker. Please refer to Amazon S3 service for how to get access to Amazon S3.

Setup GraphStorm Real-time Inference Docker Image

Similarly to GraphStorm model training and inference on SageMaker, you will need to build a GraphStorm Docker image specifically for real-time inference. In addition, your executing role should have full ECR access to be able to pull from ECR to build the image, create an ECR repository if it doesn’t exist, and push the real-time inference image to the repository. See the official ECR docs for details.

In short you can run the following:

git clone https://github.com/awslabs/graphstorm.git
cd graphstorm/

# Build the GraphStorm real-time inference Docker image to be used on CPUs
bash docker/build_graphstorm_image.sh --environment sagemaker-endpoint --device cpu

# Will push an image to '123456789012.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sagemaker-endpoint-cpu'
bash docker/push_graphstorm_image.sh --environment sagemaker-endpoint --device cpu --region "us-east-1" --account "123456789012"

Replace the 123456789012 with your own AWS account ID. For more build and push options, see bash docker/build_graphstorm_image.sh --help and bash docker/push_graphstorm_image.sh --help.

Note

When comparing CPU instances to GPU instances for real-time inference, CPU instances prove more cost-effective while maintaining similar inference latency to GPU instances. While CPU instances are generally recommended for real-time inference, users are encouraged to run their own benchmarks and cost analysis to make the final decision between CPU and GPU instances based on their particular workload needs.

Deploy a SageMaker Real-time Inference endpoint

To deploy a SageMaker real-time inference endpoint, you will need three model artifacts that were generated during graph construction (GConstruct/GSProcessing) and model training.

The saved model path that contains the model.bin file. This path has the same purpose as the --restore-model-path used during offline inference CLIs in which GraphStorm looks for the model.bin file, and uses it to restore the trained model weights.
The updated graph construciton JSON file, data_transform_new.json. This JSON file is one of the outputs of graph construction pipeline. It contains updated information about feature transformations and feature dimensions. If using the Single Machine Graph Construction (GConstruct), the file is saved at the path specified by the --output-dir argument. For Distributed Graph Construction (GSProcessing), the file is saved at the path specified by either --output-data-s3 or --output-dir argument.
The updated model training configuration YAML file, GRAPHSTORM_RUNTIME_UPDATED_TRAINING_CONFIG.yaml. This YAML file is one of the outputs of model training. It contains the updated configurations of a model by replacing values of configuration YAML file with values given in the training CLI arguments. If set --save-model-path or --model-artifact-s3 configuration in model training, this updated YAML file will be saved to the location specified.

Note

Starting with v0.5, GraphStorm will save both updated JSON and YAML files into the same location as the trained model automatically, if the --save-model-path or --model-artifact-s3 configuration is set.

GraphStorm provides a helper script to package these model artifacts as a tar file and upload it to an S3 bucket, and then use SageMaker APIs with the inference Docker image previously built to deploy endpoint(s).

In short you can run the following:

# assume graphstorm source code has been cloned to the current folder
cd graphstorm/sagemaker/launch
python launch_realtime_endpoint.py \
        --image-uri <account_id>.dkr.ecr.<region>.amazonaws.com/graphstorm:sagemaker-endpoint-cpu \
        --role arn:aws:iam::<account_id>:role/<your_role> \
        --region <region> \
        --restore-model-path <restore-model-path>/<epoch-XX-iter-XX> \
        --model-yaml-config-file <restore-model-path>/GRAPHSTORM_RUNTIME_UPDATED_TRAINING_CONFIG.yaml \
        --graph-json-config-file <restore-model-path>/data_transform_new.json \
        --infer-task-type <task_type> \
        --upload-tarfile-s3 s3://<a-bucket> \
        --model-name <model-name>

Arguments of the launch endpoint script include:

--image-uri (Required): the URI of your GraphStorm real-time inference Docker image built and pushed in the previous Setup GraphStorm Real-time Inference Docker Image step.
--region (Required): the AWS region to deploy endpoint. This region should be same as the ECR where your Docker image is stored.
--role (Required): the role ARN that has SageMaker execution role. Please refer to the SageMaker AI document section for details.
--restore-model-path (Required): a local folder path where the model.bin file is saved.
--model-yaml-config-file (Required): a local file path where the updated model configuration YAML file is saved.
--graph-json-config-file (Required): a local file path where the updated graph construction configuration JSON file is saved.
--upload-tarfile-s3 (Required): an S3 prefix for uploading the packed and compressed model artifacts tar file.
--infer-task-type (Required): the name of real-time inference task. Options include node_classification and node_regression.
--instance-type: the instance type to be used for endpoints. (Default: ml.c6i.xlarge)
--instance-count: the number of endpoints to be deployed. (Default: 1)
--custom-production-variant: dictionary string that includes custom configurations of the SageMaker ProductionVariant. For details, please refer to ProductionVariant Documentation.
--async-execution: the mode of endpoint creation. Set True to deploy endpoint asynchronously, or False to wait for deployment to finish before exiting. (Default: False)
--model-name: the name of model. This name will be used to define names of SageMaker Model, EndpointConfig, and Endpoint by appending datetime to this model name. The name should follow a regular expression pattern: ^[a-zA-Z0-9]([\-a-zA-Z0-9]*[a-zA-Z0-9])$ as defined in SageMaker’s CreateEndpoint document. (Default: GSF-Model4Realtime)
--log-level: the level of log. Optional values include ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, and ‘CRITICAL’. Default is ‘INFO’.
--vpc-subnet-ids: Optional list of VPC subnet ids to deploy the endpoint into. Needed to deploy the endpoint in the same VPC as e.g. a Neptune Database instance for online inference. When provided, you also need to provide --vpc-security-group-ids.
--vpc-security-group-ids: Optional list of security group ids to attach to the endpoint. Needed to allow the endpoint to communicate with restricted resources like a Neptune Database instance. Needs to be provided together with --vpc-subnet-ids.

Outputs of this command include the deployed endpoint name based on the value for --model-name, e.g., GSF-Model4Realtime-Endpoint-2025-06-04-23-47-11, and the region based on the value for --region. This endpoint name and region will be used in the invoke step. The endpoint name can also be found from Amazon SageMaker AI Web console under the “Inference -> Endpoints” menu at the region.

Invoke Real-time Inference Endpoints

For real-time inference, you will need to extract a subgraph around the target nodes/edges from a large graph, and use the subgraph as input of model, which is similar to how models are trained. Because time is critical for real-time inference, it is recommended to use an OLTP graph database, e.g., Amazon Neptune Database, as data source for subgraph extraction.

Once the subgraph is extracted, you will need to prepare it as the payload of different APIs for invoke models for real-time inference. GraphStorm defines a specification of the payload contents for your reference.

Below is an example payload JSON object for node classification inference.

{
    "version" : "gs-realtime-v0.1",
    "gml_task": "node_classification",
    "graph": {
        "nodes" : [
            {
                "node_type" : "author",
                "node_id"   : "a4444",
                "features"  : {"feat_num" : [ 0.011269339360296726, ......, ],
                               "feat_cat" : "UK"},
            },
            {
                "node_type" : "author",
                "node_id"   : "a39",
                "features"  : {"feat_num": [-0.0032965524587780237, ......, ],
                               "feat_cat" : "USA"},
            },
            ......
        ],
        "edges" : [
            {
                "edge_type" : [
                    "author",
                    "writing",
                    "paper"
                ],
                "src_node_id"   : "p4463",
                "dest_node_id"  : "p4463",
                "features"      : {"feat1": [ 1.411269339360296726, ......, ]},
                                   "feat2" : "1980s"},
            },
            ......
        ]
    },
    "targets" : [
        {
            "node_type" : "author",
            "node_id"   : "a39"
        }
    ]
}

Invoke endpoints

There are multiple ways to invoke a Sagemaker real-time inference endpoint as documented in SageMaker Developer Guide.

Note

GraphStorm real-time inference endpoints currently only support “application/json” as content type.

Here is an example of how you can read a payload from a JSON file and use the boto3 APIs to invoke an endpoint.

import json
import boto3


# Create a SageMaker client object,
sagemaker = boto3.client('sagemaker')
# Create a SageMaker runtime client object using your IAM role ARN\n",
runtime = boto3.client('sagemaker-runtime',
                       region_name='aws region') # e.g., us-east-1
endpoint_name='your endpoint name'    # e.g., GSF-Model4Realtime-Endpoint-2025-07-11-21-44-36
# load payload from a JSON file
with open('subg.json', 'r') as f:
     payload = json.load(f)
content_type = 'application/json'

# invoke endpoint
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType=content_type,
    )
# Decodes and prints the response body
print(response['Body'].read().decode('utf-8'))

Endpoint invoke response

As shown in the previous example, the response from a GraphStorm real-time inference endpoint will include a JSON object in the Body field of the SageMaker API response. The details of the response syntax can be found in the specification of real-time inference response.

An example of a successful inference response:

{
    "status_code" : 200,
    "request_uid" : "569d90892909c2f8",
    "message"     : "Request processed successfully.",
    "error"       : "",
    "data"        : {
                      "results" : [
                          {
                              "node_type" : "paper",
                              "node_id"   : "p9604",
                              "prediction": [
                                  0.03836942836642265,
                                  0.06707385182380676,
                                  0.11153795570135117,
                                  0.027591131627559662,
                                  0.03496604412794113,
                                  0.11081098765134811,
                                  0.005487487651407719,
                                  0.027667740359902382,
                                  0.11663214862346649,
                                  0.11842530965805054,
                                  0.020509174093604088,
                                  0.031869057565927505,
                                  0.27694952487945557,
                                  0.012110156007111073
                              ]
                          },
                          {
                              "node_type" : "paper",
                              "node_id"   : "p8946",
                              "prediction": [
                                  0.03848873823881149,
                                  0.06991259753704071,
                                  0.057228244841098785,
                                  0.02898392826318741,
                                  0.046037621796131134,
                                  0.09567245841026306,
                                  0.008081010542809963,
                                  0.02855496294796467,
                                  0.2774551510810852,
                                  0.07382062822580338,
                                  0.03699302300810814,
                                  0.047642651945352554,
                                  0.1794610172510147,
                                  0.011668065562844276
                              ]
                          }
                      ]
    }
}

In this example response for a classification task, the prediction field contains the logits of model predictions. You can use classification method, e.g., argmax, to get the final class. For example, the argmax result of the prediction of the paper node p9604 is 12, which means this paper is predicted to belong to the 13th class as classes are normally one-hot encoded from 0.

An example of an error response:

{
    "status_code"   : 401,
    "request_uid"   : "d3f2eaea2c2c7c76",
    "message"       : "",
    "error"         : "Missing Required Field: The input payload missed the 'targets' field. Please refer to the
                      GraphStorm realtime inference documentation for required fields.",
    "data"          : {}
}

This example response provides a feedback to the invoker, explaining there is an error (401) in the payload content. And the detailed reason is that the request payload should include a targets field.