Using GraphStorm with SageMaker Pipelines
GraphStorm provides integration with Amazon SageMaker Pipelines to automate and orchestrate graph machine learning workflows at scale. This guide shows you how to use the provided tools to create, configure, and execute SageMaker pipelines for graph construction, training, and inference.
Introduction
SageMaker Pipelines enable you to create automated MLOps workflows for your GraphStorm applications. Using these workflows you can:
Automate the end-to-end process of preparing graph data, training models, and running inference
Ensure reproducibility of your graph machine learning experiments
Scale your workflows efficiently using SageMaker’s managed infrastructure
Track and version your pipeline executions
Pre-requisites
Before starting with GraphStorm Pipelines on SageMaker, you’ll need:
An execution environment with Python 3.8 or later
An AWS account with appropriate permissions. See the official SageMaker documentation for detailed information about required permissions.
Basic familiarity with Amazon SageMaker and SageMaker Pipelines.
Understanding of graph neural networks and the GraphStorm framework.
Setting Up Your Environment
To work with GraphStorm SageMaker pipelines, you’ll need the GraphStorm source code and a Python environment with the SageMaker SDK and AWS SDK (boto3) installed.
First, clone the GraphStorm repository and navigate to the pipeline directory:
git clone https://github.com/awslabs/graphstorm.git
cd graphstorm/sagemaker/pipeline
Install the required Python packages:
pip install sagemaker boto3
Creating Your First Pipeline
GraphStorm provides Python scripts to help you create and manage SageMaker pipelines. The main tools are:
create_sm_pipeline.py
: Creates or updates pipeline definitionspipeline_parameters.py
: Manages pipeline configurationexecute_sm_pipeline.py
: Runs pipelines
Here’s an example of how to create a basic pipeline that includes graph construction, training, and inference:
python create_sm_pipeline.py \
--graph-construction-config-filename my_gconstruct_config.json \
--graph-name my-graph \
--graphstorm-pytorch-cpu-image-url 123456789012.dkr.ecr.us-west-2.amazonaws.com/graphstorm:sagemaker-cpu \
--input-data-s3 s3://input-bucket/data \
--instance-count 2 \
--jobs-to-run gconstruct train inference \
--output-prefix s3://output-bucket/results \
--pipeline-name my-graphstorm-pipeline \
--region us-west-2 \
--role arn:aws:iam::123456789012:role/SageMakerExecutionRole \
--train-inference-task node_classification \
--train-yaml-s3 s3://config-bucket/train.yaml
This command sets up a pipeline with three main stages:
Graph construction using the configuration in
my_gconstruct_config.json
Model training using the settings in
train.yaml
Inference using the trained model
The pipeline will use 2 instances (specified by --instance-count
) for distributed training and inference.
You’ll need to provide:
A SageMaker execution role (
--role
) with appropriate permissionsA GraphStorm Docker image (
--graphstorm-pytorch-cpu-image-url
) for running the tasksS3 locations for your input data and where to store results
Running Pipeline Executions
Once you’ve created a pipeline, you can execute it using the execute_sm_pipeline.py
script:
python execute_sm_pipeline.py \
--pipeline-name my-graphstorm-pipeline \
--region us-west-2
You can override default parameters during execution to customize the run:
Note
Pipelines are created in a specific region. When you execute a pipeline with the script below you’ll need to specify the region in which you created the pipeline.
python execute_sm_pipeline.py \
--pipeline-name my-graphstorm-pipeline \
--region us-west-2 \
--instance-count 4 \
--gpu-instance-type ml.g4dn.12xlarge
Pipeline Components
A GraphStorm SageMaker pipeline can include several components that you can combine based on your needs.
We list those here, with the step name that you can provide in --jobs-to-run
in parentheses.
Single-instance Graph Construction (
gconstruct
): Single-instance graph construction for small graphs.Distributed Graph pre-processing (
gsprocessing
): PySpark-based distributed data preparation for large graphs.Distributed Graph Partitioning (
dist_part
): Multi-instance graph partitioning for distributed training.GraphBolt Conversion (
gb_convert
): Converts partitioned data to GraphBolt format for improved training/inference efficiency.Training (
train
): Trains your graph neural network model.Inference (
inference
): Runs predictions using your trained model.
The choice of jobs to run will mostly stem from the size of your graph.
For graphs that can fit into the memory of one machine, a typical
job sequence would be gconstruct train inference
.
For graphs that are too large to fit into one machine, you will need to
pre-process them using GSProcessing and use distributed GSPartition.
Such a sequence of jobs would be gsprocessing dist_part train inference
.
Configuration Options
This section provides a comprehensive list of all available configuration options for creating and executing GraphStorm SageMaker pipelines.
AWS Configuration
--execution-role
: SageMaker execution IAM role ARN. (Required)--region
: AWS region. (Required)--graphstorm-pytorch-cpu-image-uri
: GraphStorm GConstruct/dist_part/train/inference CPU ECR image URI. (Required)--graphstorm-pytorch-gpu-image-uri
: GraphStorm GConstruct/dist_part/train/inference GPU ECR image URI.--gsprocessing-pyspark-image-uri
: GSProcessing SageMaker PySpark ECR image URI. (Required if running agsprocessing
job.)
Instance Configuration
--instance-count
/--num-parts
: Number of worker instances/partitions for partition, training, inference. (Required)--cpu-instance-type
: CPU instance type. (Default: ml.m5.4xlarge)--gpu-instance-type
: GPU instance type. (Default: ml.g5.4xlarge)--train-on-cpu
: Run training and inference on CPU instances instead of GPU. (Flag)--graph-construction-instance-type
: Instance type for graph construction. GSProcessing and GConstruct will use this instance type if provided. Otherwise they will use the instance type set in--cpu-instance-type
.--gsprocessing-instance-count
: Number of GSProcessing instances (PySpark cluster size, default is equal to--instance-count
).--volume-size-gb
: Additional volume size for SageMaker instances in GB. (Default: 100)
Task Configuration
--graph-name
: Name of the graph. (Required)--input-data-s3
: S3 path to the input graph data. (Required)--output-prefix-s3
: S3 prefix for the output data. (Required)--pipeline-name
: Name for the pipeline.--base-job-name
: Base job name for SageMaker jobs. (Default: ‘gs’)--jobs-to-run
: Space-separated strings of jobs to run in the pipeline. Possible values are:gconstruct
,gsprocessing
,dist_part
,gb_convert
,train
,inference
(Required).--log-level
: Logging level for the jobs. (Default: INFO)--step-cache-expiration
: Expiration time for the step cache. (Default: 30d)--update-pipeline
: Update an existing pipeline instead of creating a new one. (Flag)
Graph Construction Configuration
--graph-construction-config-filename
: Filename for the graph construction config.--graph-construction-args
: Additional parameters to be passed directly to the GConstruct/GSProcessing job. For example you can provide--num-processes 8
to GConstruct to use 8 processes when running graph construction.
Partition Configuration
--partition-algorithm
: Partitioning algorithm to use. (Default: random)--partition-input-json
: Name for the JSON file that describes the input data for distributed partitioning. (Default: updated_row_counts_metadata.json)--partition-output-json
: Name for the output JSON file that describes the partitioned data generated by GConstruct or GSPartition. (Default: metadata.json for GSPartition, use <graph_name>.json forgconstruct
.)
Training Configuration
--model-output-path
: S3 path for model output.--num-trainers
: Number of trainers (per-instance training processes) to use during training/inference. Set this equal to number of GPUs (Default: 1)--train-inference-task-type
: Task type for training and inference, e.g.link_prediction
. For a complete list of available options see task_type in the runtime configuration documentation. (Required)--train-yaml-s3
: S3 path to the train YAML configuration file.--use-graphbolt
: Whether to use GraphBolt for GConstruct, training and inference. (Default: false)
Inference Configuration
--inference-yaml-s3
: S3 path to inference YAML configuration file.--inference-model-snapshot
: Which model snapshot to choose to run inference with, e.g.epoch-9
to use the model generated by the 10th (zero-indexed) epoch.--save-predictions
: Whether to save predictions to S3 during inference. (Flag)--save-embeddings
: Whether to save embeddings to S3 during inference. (Flag)
Script Paths
The entry point scripts for all tasks exist under `https://github.com/awslabs/graphstorm/tree/main/sagemaker/run`_.
--dist-part-script
: Path to DistPartition SageMaker entry point script.--gb-convert-script
: Path to GraphBolt partition conversion script.--train-script
: Path to training SageMaker entry point script.--inference-script
: Path to inference SageMaker entry point script.--gconstruct-script
: Path to GConstruct SageMaker entry point script.--gsprocessing-script
: Path to GSProcessing SageMaker entry point script.
Using Configuration Options (Example)
When creating or executing a pipeline, you can use these options to customize your workflow. For example:
python create_sm_pipeline.py \
--graph-name my-large-graph \
--input-data-s3 s3://my-bucket/input-data \
--output-prefix-s3 s3://my-bucket/output \
--instance-count 4 \
--gpu-instance-type ml.g4dn.12xlarge \
--jobs-to-run gsprocessing dist_part gb_convert train inference \
--use-graphbolt true \
--train-yaml-s3 s3://my-bucket/train-config.yaml \
--inference-yaml-s3 s3://my-bucket/inference-config.yaml \
--save-predictions \
--save-embeddings
This example sets up a pipeline for a large graph, using distributed processing, GraphBolt conversion, GPU-based training and inference, and saving both predictions and embeddings.
Remember that not all options are required for every pipeline. The necessary options depend on your specific use case and the components you’re including in your pipeline.
Advanced Usage
Using GraphBolt for Better Performance
GraphBolt enabled faster training, see Using GraphBolt to speed up training and inference. To enable GraphBolt for your pipeline:
python create_sm_pipeline.py \
... \
--use-graphbolt true
For distributed processing with GraphBolt, you will need to include a gb_convert
step after dist_part
:
When using GConstruct no follow-up job is needed, the pipeline will append --use-graphbolt true
to the
GConstruct arguments, and the graph files that GConstruct produces are ready for training with
GraphBolt enabled.
python create_sm_pipeline.py \
... \
--jobs-to-run gsprocessing dist_part gb_convert train inference \
--use-graphbolt true
For a complete example of running a GraphBolt-enabled pipeline see this AWS ML blog post.
Asynchronous and Local Execution
For non-blocking pipeline execution:
python execute_sm_pipeline.py \
--pipeline-name my-graphstorm-pipeline \
--region us-west-2 \
--async-execution
For local testing, where all pipeline steps are executed locally:
python execute_sm_pipeline.py \
--pipeline-name my-graphstorm-pipeline \
--local-execution
Note
Local execution requires a GPU if using GPU instance types.
Troubleshooting
If you encounter issues:
Check that all AWS permissions are correctly configured
Review SageMaker execution logs for detailed error messages
Verify S3 path accessibility
Confirm instance type availability in your region
For more information, see:
For additional help, you can open an issue in the GraphStorm GitHub repository.